XMLTV / xmltv

Utilities to obtain, generate, and post-process TV listings data in XMLTV format
GNU General Public License v2.0
266 stars 93 forks source link

tv_grab_zz_sdjson: improve episode/season metadata handling #222

Closed kgroeneveld closed 4 months ago

kgroeneveld commented 5 months ago

What type of Pull Request is this?

Does this PR close any currently open issues?

221

Please explain what this PR does

The program metadata from SD is an array of different providers. The previous code was specifically looking for 'Gracenote' provider in the first array element and ignoring everything else. For example, it seems the first element is now often 'TVmaze' (at least for the lineups I use) and the episode/season data was all being ignored.

The code now iterates over all the metadata providers generally preferring the earlier entries unless a later one looks more complete.

Where have you tested these changes?

Operating System: ubuntu 22.04

Perl Version: v5.34.0

I have only tested this very quickly so far. I would like to do more testing / review before merging. But I thought I would create the PR now in case @rob-vh who reported the issue wants to test it or is very eager for a fix.

kgroeneveld commented 5 months ago

I just noticed my commit changes the file permissions which I should probably fix before merging.

honir commented 5 months ago

Does the fix work when the metadata contains just an episode number but no series?

rob-vh commented 5 months ago

I've seen programs (using EPG data from Channel 5+1) with only the episode number getting extracted correctly. In that case $season = $_season is executed only once and passed down for storing (unless there is also a provider that only contains the Season later on in the array of providers).

kgroeneveld commented 5 months ago

It is hard to decide if the code is correct without agreeing on the requirements. I didn’t even fully formulate a full set of requirements in my mind when I wrote and pushed my first attempt at fix.

Some thoughts I had while working it:

When there is more than one source of metadata, which do you choose to include in the xml output?

Do you ever mix and match info from different providers?

If one provider only has the season and one only has the episode do you combine them?

What if the providers have conflicting information, which do you use?

My initial thinking was to never mix and match info from different providers but instead try to find the most detailed entry, preferring earlier entries with the same level of detail but possibly conflicting info.

But if one source only has season and one only has episode which one is “more detailed”?

The API for the source data from Schedules Direct states the only mandatory field is season. So I was giving preference to season over episode and that is where the !length($season) part of one of the statements comes from.

Does the fix work when the metadata contains just an episode number but no series?

If the source data follows the spec this should never happen. But maybe it should be considered anyway, especially since it seems @rob-vh has found a case where it happens . Right now I think it will only use an isolated episode number if it was the first metadata entry in the array. So there is probably room for improvement there.

But before tweaking the code further I would be interested to hear others thoughts on possibly combining info from multiple providers or other requirements in general.

rob-vh commented 5 months ago

Scanning through Channel5 +1 EPG I found an occurrence where Gracenote has only the season but TVMaze has both fields:

"metadata":[{"TVmaze":{"season":2,"episode":34,"url":"https:\/\/www.tvmaze.com\/episodes\/2215086\/the-adventures-of-paddington-2x34-paddington-feels-the-music"}},{"Gracenote":{"season":2}}]

So I would not like to wipe the info collected from TVmaze when Gracenote only contains partial information.
If there is no provider with both fields, I would not mind taking season from one provider and episode from another, but don't overwrite info from a previous provider with null.

Earlier I saw isolated episode numbers with no season, but not in the data I have checked today.

rob-vh commented 5 months ago

I realize why I have isolated episode numbers in my data. I also parse the Smm/Enn info from over the air EPG, esp. for sat channels that SD does not cover. Just now I saw Doctor Who mentioned on ONS with E01.

honir commented 5 months ago

I probably wouldn't recommend merging different sources. I have seen many instances where Gracenote, Epguides website, IMDb website have different opinions. Althoough, to be fair, it mostly seems to be IMDb which is wrong (Gracenote and Epguides usually agree).

honir commented 5 months ago

(In the UK at least) we used to get isolated episode numbers, particularly for things like soaps, or multi-part documentaries. Recently though it seems the soaps are coming through with the year as the season (S/E = year/num-in-year). I assume that is in the Gracenote data, and seems to be something invented by Gracenote.

For example, EastEnders tonight is listed on TVGuide as Episode 6843, but I think Gracenote has it as Season 2024 Episode 12 !

TVGuide as Ep. 6843 BBC use the broadcast date as the Ep. 22/01/2024 Sky hedge their bets and call it S/E 2024/6843 !

Many guides use the isolated episode number ("6843")

It seems the json feed S/E numbers are invented by Gracenote (or SD?).

Especially since they are actually out of sequence:

S/E 2024/11 Friday 19 January, 2024 (EP100330186088) S/E 2024/12 Monday 22 January, 2024 (EP100330186900) S/E 2024/13 Wednesday 17 January, 2024 (EP100330187077) **** OUT OF SEQUENCE S/E 2024/14 Tuesday 23 January, 2024 (EP100330187121) S/E 2024/15 Wednesday 24 January, 2024 (EP100330187338) .

Whether Gracenote / SD should be inventing S/E numbers is maybe a discussion for another thread.

garybuhrmaster commented 5 months ago

(In the UK at least) we used to get isolated episode numbers, particularly for things like soaps, or multi-part documentaries. Recently though it seems the soaps are coming through with the year as the season (S/E = year/num-in-year). I assume that is in the Gracenote data, and seems to be something invented by Gracenote.

I don't know who invented it, but I have seen the series being the year in more than one guide source, so it seems to have become some new form of normal.

Especially since they are actually out of sequence:

Are they really out of some, arguably valid, sequence? As discussed before, there are numerous orderings to chose from. Some may be original production numbers/order (as assigned during the planning process at the beginning of the series just to have unique numbers to fill in the blanks later), some may be order of start, or completion, of production (some shows may start production earlier than others but take longer to complete than others), some may be scheduled broadcast order, and some may be actual broadcast order (which due to preemption may not be the same as scheduled broadcast order). All of those orders may correct to some of the people some of the time for some of the shows (and I have seen some of all of those values from different sources).

honir commented 5 months ago

All of those orders may correct to some of the people some of the time for some of the shows

Spoken like a true academic :-) Anyone watching this serial in that s/e order is going to get very confused though (with dramatic events happening out of sequence)

rob-vh commented 5 months ago

I would, however, be great to have consistency in the choice of season number. I use PVR automation logic to record episodes of favorite shows for season > season last completed. So flipping between "sequence season" and "annual season" numbers is going to hurt.

rmeden commented 5 months ago

tv_grab_na_dd uses the Gracenote "flat file" feed as it's data source.  That doesn't provide Season/Episode numbers are we humans understated it.  It passes on a producer reported (and used) episode number that makes sense only to the series.  It calls it the episode system "onscreen".   It also provides a "dd_progid" system to pass on the Gracenote Episode-id.  It doesn't know (nor does it set), the "normal" xmltv_ns system.

I think a grabber should report all available episode-num options using different "system" attributes, and report its best guess (based on knowledge of the data source)  in the "xmltv_ns" system.

Because

garybuhrmaster commented 5 months ago

All of those orders may correct to some of the people some of the time for some of the shows

Spoken like a true academic :-) Anyone watching this serial in that s/e order is going to get very confused though (with dramatic events happening out of sequence)

And then there is the order that the episodes SHOULD have been broadcast in (and should be viewed in) except for the stupidity of certain network executives (cough Firefly cough).

And when a series gets rebooted, some (but not all) guide sources start with a new series (season) number of 1.

Standards are wonderful things, and there are many of them: https://xkcd.com/927/

honir commented 5 months ago

I think a grabber should report all available episode-num options using different "system" attributes, and report its best guess (based on knowledge of the data source)  in the "xmltv_ns" system.

That sounds very sensible. Then it is up to the consuming program to pick what it thinks best.

( I can easily imagine a situation where a source has good data for US programmes but poor for UK ones (in which they are not really interested!). And vice-versa. )

BST (blue-sky thinking): Perhaps the grabber could offer a choice of which source to use for the xmltv_ns system?

rmeden commented 5 months ago

On 1/25/2024 2:34 AM, Geoff wrote:

BST (blue-sky thinking): Perhaps the grabber could offer a choice of which source to use for the xmltv_ns system?

I'll match your BST and raise... An external utility script that promotes your favorite xmltv_ns system from any xmltv file. :)

(personally, I don't think it's necessary, but for those in a situation with a bad xmltv_ns source, it could be handy)

Robert

kgroeneveld commented 5 months ago

I am not sure if all of the above helps or just leaves me even more confused...

I think everyone pretty much agrees that mixing and matching the data from different providers for the xmltv_ns system entry should not be done, which was my original thought as well.

I think a grabber should report all available episode-num options using different "system" attributes, and report its best guess (based on knowledge of the data source) in the "xmltv_ns" system.

This sounds reasonable. And it looks like the tv_grab_zz_sdjson_sqlite grabber is doing something along these lines.

But it is unclear to me how the SD JSON data should be formatted in the XML. Sample output from tv_grab_zz_sdjson_sqlite:

<episode-num system="dd_progid">EP02696547.0168</episode-num>
<episode-num system="xmltv_ns"> 5 . 21 .  </episode-num>
<episode-num system="tvmaze.com">episode/22</episode-num>
<episode-num system="tvmaze.com">series/6</episode-num>
<episode-num system="tvmaze.com">url/https://www.tvmaze.com/episodes/2505976/swat-6x22-legacy</episode-num>
<episode-num system="schedulesdirect.org">programID/EP026965470168</episode-num>
<episode-num system="schedulesdirect.org">series/6</episode-num>
<episode-num system="schedulesdirect.org">episode/22</episode-num>
<episode-num system="schedulesdirect.org">resourceID/14158903</episode-num>
<episode-num system="schedulesdirect.org">originalAirDate/20230519 +0000</episode-num>

Is this schema defined anywhere? Or does each grabber author just make up whatever they want?

It seems strange to have multiple entries with the same system. It seems instead it should be something more like:

<episode-num system="tvmaze.com">
  <episode>22</episode>
  <series>6</series>
  <url>https://www.tvmaze.com/episodes/2505976/swat-6x22-legacy</url>
</episode-num>

In some ways it seems system is being used as the "provider" and not a "system" in the above.

And if we are just trying to pass on all the keys in the SD JSON data in a uniform way maybe it should be something more like:

<episode-num system="sdjson">
  <provider name="Gracenote">
    <season>6</season>
    <episode>22</episode>
  </provider>
  <provider name="TVmaze">
    <season>6</season>
    <episode>22</episode>
    <url>https://www.tvmaze.com/episodes/2505976/swat-6x22-legacy</url>
  </provider>
<episode-num>

(plus the standard dd_progid and xmltv_ns entries)

I don't really have a strong preference myself on any of this. Maybe this is already defined a bit better somewhere. Ideally the different xmltv grabbers would be somewhat consistent.

garybuhrmaster commented 5 months ago

Some random thoughts (which should likely be ignored):

(*) At least for content originated in the US. Outside of the US Gracenote has to depend (to various extents) on the local guide source providers, but, importantly, Gracenote is still responsible for curating the data, and they pay people to do it.

honir commented 5 months ago

I agree. There has not been a general call for numbering systems outside of xmltv_ns, and this is the one which is pretty-much universally output and used by downstream programs.

Any other numbering systems (i.e. other 'system' elements/attributes) is simply froth. It MAY be used by a downstream program but I think that would be a niche use case.

The focus here should be on populating xmltv_ns and, as pointed out, I would expect Gracenote data in preference to any other. (Gracenote data is the reason I signed up to Schedules Direct.)

Only if Gracenote was fully empty should another provider be considered for populating xmltv_ns. (On balance I'd rather have 'some' data than 'none', even if it's potentially unreliable.)

kgroeneveld commented 5 months ago

Thanks everyone for the feedback. I have been getting more helpful feedback than expected.

It should also be noted that in this case the root cause of the error (no episode/series) appears to actually be due to coding error presuming that the first element of an unordered array

That is about the only thing I have been sure of in this discussion. It was so long ago when I wrote that code I don't remember what I was thinking at the time but I would like to think I knew it was dumb / wrong at the time and that it was just a quick temporary way to get the the other parts working and then I forget to fix it.

Based on the most recent comments I am also starting to think it is best to keep things simple for now (and probably for a long time) and only populate xmltv_ns (and dd_progid) preferring the Gracenote data if available. Anything beyond that seems unlikely to even be used with standard way to represent it anyway.

I just pushed an updated version of my branch which implements the above.

kgroeneveld commented 5 months ago

And of course immediately after pushing the update I see a mistake... so I just pushed again. But with my luck there is still something stupid in there.

knowledgejunkie commented 5 months ago

@honir @rmeden @garybuhrmaster @kgroeneveld With a view to getting a new release out very soon, are we happy that the latest changes provide the expected data (if available)?

kgroeneveld commented 4 months ago

With a view to getting a new release out very soon, are we happy that the latest changes provide the expected data (if available)?

No news is good news?

I have been using the latest version of this PR without any issues on my MythTV system from around the time I pushed it. Although, I must admit I don't pay much attention to the episode / season metadata. I did also do some more specific testing with various hand created test data and everything I tried worked as per my expectations.

Unless someone spots a major bug I say we should merge it before the next release. It is definitely wrong in the current release.

Maybe I should just merge it myself in a couple days if I don't hear anything. I think I can merge it... I have not made an actual commit / merge since XMLTV was still in CVS!

knowledgejunkie commented 4 months ago

Thank you @kgroeneveld !