XMLTV / xmltv

Utilities to obtain, generate, and post-process TV listings data in XMLTV format
GNU General Public License v2.0
269 stars 93 forks source link

uk_tvguide: improve alternative method #166

Closed mkbloke closed 2 years ago

mkbloke commented 2 years ago

What type of Pull Request is this?

Does this PR close any currently open issues?

No.

Please explain what this PR does

Improves the alternative method for getting the list of channels. It now gets 939 channels, although this includes duplicates. I toyed with the idea of trying to remove the unusable duplicates (they have no schedule data), but as this is only a backup method, I wasn't sure if it was really that important.

Any other information?

Discussed in #159.

Where have you tested these changes?

Operating System: Linux Mint 20.3 Una

Perl Version: 5.30.0

honir commented 2 years ago

Thanks @mkbloke I'll try and take a look at your PR presently

mkbloke commented 2 years ago

Just for information: I had a go at the de-duping today. It seems like there are actually a lot of IDs that don't carry schedule data, not only the duplicates, but also all the radio stations as far as I can see, along with numerous TV channels. That method only found 489 of the 939 found by the code in this PR. I think 489 is probably the true number of stations that TVG actually has schedule data for.

honir commented 2 years ago

Is there any particular reason you disabled cookies in the POST request?

mkbloke commented 2 years ago

My only reasoning was to just enable what was necessary for getting the page. Also, with the initialisation of the ua, I just decided to do it where I did because it's only used for one request and only when the grabber is called for configuration or channel listing, which I guess for most people won't be very often.

If you think it's better to enable it as it would have been, i.e. with cookies enabled, I can do that. Likewise, I could make ua a global so it can be reused in future once initialised.

honir commented 2 years ago

There's definitely some weirdness going on over at TVG. Sometimes I can get the html "select" list (i.e. original method) to appear (but a page refresh kills it again). But when I do get it I get the same list of channels that your method fetches (939). But as you point out many of these channels have no data (maybe they are old numbers in their database?). e.g. Movies24 appears 6 times (with different ids) but only one (1756) actually has data. Broken.

So I think we need a flexible approach to getting the channels list.

I hope you don't mind, I've taken your code and inserted it into a reworked fetch_channels sub.

I've added a commandline param (--method n) to select which alternative method(s) people would like to try.

   --method N This program has three methods for fetching the list of channels available. 
    The preferred method can be overridden with an option of either --method 1 or --method 2 
    to use one of the two alternative methods.

    If no --method parameter is supplied then the various methods will be tried in sequence 
    until a channels list is obtained. A parameter of  --method 0 will run the preferred method only.

    Normally you should omit this parameter.

Method 0 = 939 (original method) Method 1 = 939 (lots of empty channels) Method 2 = 387 (some channels missing e.g. Movies24)

At the present time I think method 2 is most likely to be the most useful (387 channels), as it seems to avoid the 'null data' ids, however some channels are missing. :-(

Reporting all those null channels is not going to be helpful to people. Have to think how to tidy those...

mkbloke commented 2 years ago

OK. Yes, I have noticed that sometimes the original method has worked, perhaps even several times in a row during testing, then stops working again. Very strange.

Here is some code I used that might be a useful technique to identify IDs that actually have schedule data:

https://github.com/mkbloke/xmltv/blob/tv_grab_uk_tvguide-alt/grab/uk_tvguide/tv_grab_uk_tvguide

That currently gets 492 IDs, 3 more than I reported yesterday in https://github.com/XMLTV/xmltv/pull/166#issuecomment-1049269798. It may be that some channels do not broadcast every day, so there is no schedule data and those without schedule data do not appear in the resultant page. One way around this could be to take the current day and the next 2 days also or perhaps more (are there channels that only broadcast one or two days a week or perhaps only on weekends?).

https://github.com/mkbloke/xmltv/blob/1e0ff09668949b9a1a6a44d148a9ad51828d8840/grab/uk_tvguide/tv_grab_uk_tvguide#L891-L892

It may or may not be necessary to do this in chunks, I added it as a kindness to the server. The splice value of 100 can be increased, that's just the last value I had in there. I think I was using 250 originally, but perhaps with testing it could be higher than that.

Feel free to make use of that code if you think the technique is worth pursuing.

Edited to add:

https://github.com/mkbloke/xmltv/blob/1e0ff09668949b9a1a6a44d148a9ad51828d8840/grab/uk_tvguide/tv_grab_uk_tvguide#L861 https://github.com/mkbloke/xmltv/blob/1e0ff09668949b9a1a6a44d148a9ad51828d8840/grab/uk_tvguide/tv_grab_uk_tvguide#L882 https://github.com/mkbloke/xmltv/blob/1e0ff09668949b9a1a6a44d148a9ad51828d8840/grab/uk_tvguide/tv_grab_uk_tvguide#L935

are all redundant and should be deleted. The original idea was to keep track of the IDs that couldn't be found and try another technique to check them, but with so many IDs to check that idea was soon put to rest.

honir commented 2 years ago

Thanks for the pointers.

It's hard to work around failings in the TVG data. For example, Yanga! has no guide data (which is ok), but nor does Chelsea TV as it went off-air in 2019 and so shouldn't be in the list at all.

De-duping the channel names reduces the 939 possible channels to 652. But (as you know) we don't know which is the 'live' id (i.e. which one has data).

For a while it looked like we could use whether they had a valid icon on the mychannels.asp page to remove the duplicate channels. (Channels without data return a 403 when you fetch the icon.) That looked promising but fell down when I checked Sky Cinema Family HD (523 has an icon but no guide; 1354 has data but no icon). (And fails with Chelsea TV, of course, which has working icons.)

I think there is little that can be done until TVG fix their broken website. Maybe the approach is to tell them "the mychannels page has lots of duplicates and some channels which don't exist". I think if they fix that page, then it may fix the original channel 'select' method as a by-product.

mkbloke commented 2 years ago

For what it's worth, I tested over a week ago with a modified grabber to check for schedule data for channels over a 7 day period and only found 491 channels in total. The method worked, but it's rather slow.

I'll close this now as there's no point leaving it open.