XMLTV / xmltv

Utilities to obtain, generate, and post-process TV listings data in XMLTV format
GNU General Public License v2.0
275 stars 93 forks source link

tv_grab_uk_tvguide: fix for missing form options in channel listings #125

Closed mkbloke closed 3 years ago

mkbloke commented 3 years ago

What type of Pull Request is this?

Does this PR close any currently open issues?

No.

Please explain what this PR does

So, for a long time, I've been running tv_grab_uk_tvguide in a loop (with up to 5 retries and a sleep between of 1 hour) if it doesn't exit with 0. The problem is the tvguide.co.uk website is pretty flaky (the UX sucks too, but that's by the by) and often I end up with a few log entries of Unable to retrieve web page for $channel_id which causes retries and sometimes a total failure if the max retries is exceeded, which means manual intervention. I've been meaning to look at this for a long time now, but have only just got around to it.

The problem is that - for whatever reason - the tvguide.co.uk website does not always insert the options into the form that is searched to find $channelname - see the attached channel listing HTML page for an example of this. Having got around to understanding the problem, I now realise my retry solution has not been a very good one, due to the transparent caching also coming into play.

This PR modifies the grabber to obtain the channel name from the second occurrence of:

<span class=programmeheading>

It has been working pretty well for me over the last 5 days:

2020-12-03-05:37: finished on try: 1
2020-12-03-17:37: finished on try: 1
2020-12-04-05:37: finished on try: 1
2020-12-04-17:37: finished on try: 1
2020-12-05-05:37: finished on try: 2
2020-12-05-17:49: finished on try: 1
2020-12-06-05:37: finished on try: 1
2020-12-06-17:37: finished on try: 1
2020-12-07-05:37: finished on try: 1
2020-12-07-17:37: finished on try: 1

Note: I had to manually kill the tv_grab_uk_tvguide process that started on 2020-12-05-05:37, as it hung on the first try at 05:39 for no obvious reason (CPU usage 0%, memory usage 2.2% (~22MB), plenty of disk space). That is something I have very occassionally seen before. Perhaps it's a network issue where something in LWP or called by LWP never returns and doesn't time out? There was nothing in the error log that was out of place - using grep -v ^Fetching on it returned nothing. Annoyingly, by the time I killed the hung first try, the second try didn't quite finish before the scheduled evening run at 17:37, so that had to be manually run a little later.

Any other information?

121.html.txt

Where have you tested these changes?

Operating System: Debian 9

Perl Version: v5.24.1

honir commented 3 years ago

Good spot! :-) That's curious that the options input is sometimes missing: it's not as though it's fetched asynchronously.

Do you think there's any mileage in trying both methods? Try for the 'option' tag first and if not found then go for the programmeheading class.

I suggest this because the css for programmeheading is very generic

.programmeheading {
    color: #ffffff;
    font-size: 2.2em;
    font-weight: normal;
}

so it's possible they might use it elsewhere in the future which might then mess up the list counter.

mkbloke commented 3 years ago

That's curious that the options input is sometimes missing: it's not as though it's fetched asynchronously.

Yes, it's quite strange. I guess something in their back-end must occasionally fail or timeout...

Do you think there's any mileage in trying both methods? Try for the 'option' tag first and if not found then go for the programmeheading class.

Yes, that sounds like a great idea.

I suggest this because the css for programmeheading is very generic

.programmeheading {
    color: #ffffff;
    font-size: 2.2em;
    font-weight: normal;
}

so it's possible they might use it elsewhere in the future which might then mess up the list counter.

I did wonder about the anchoring, as ideally it would be better to anchor on something more specific. I will look into trying to anchor on something a bit more specific as well, so hopefully the back-up method will be a bit more robust than it is right now.

honir commented 3 years ago

TOL: a possible approach might be to store the option tags (from the first page download, and anchored within the "select name=ch" for safety) in an array and use that for the rest of the program run. Dunno...that might be a sledgehammer...?

mkbloke commented 3 years ago

TOL: a possible approach might be to store the option tags (from the first page download, and anchored within the "select name=ch" for safety) in an array and use that for the rest of the program run. Dunno...that might be a sledgehammer...?

That could work. I guess it would be necessary to consider the possibility of missing options in the first channel listings request and do something about it though. I used a sledgehammer approach in my first attempt at fixing this:

                # Fetch the page
                #   my $tree = XMLTV::Get_nice::get_nice_tree($url);
                my $tree = undef;
                my $channelname = undef;
                my $maxretry = 5;

                # Retry on invalid data
                for (my $retry = 1; $retry <= $maxretry; $retry++) {
                    $tree = fetch_url($url);
                    $channelname = $tree->look_down('_tag' => 'option', 'value' => $channel_id) if defined $tree;
                    last if defined $channelname;

                    print STDERR "Channel data not found, retrying URL: $url (attempt $retry of $maxretry) \n";
                    my $cache_filename = HTTP::Cache::Transparent::_urlhash($url);
                    $conf->{cachedir}->[0] =~ /\/$/ or $cache_filename = "/$cache_filename";
                    unlink $conf->{cachedir}->[0] . $cache_filename;
                    print STDERR "Deleted cache file: " . $conf->{cachedir}->[0] . $cache_filename . "\n";
                }
                # $tree->dump; exit;

                # Scrub the page

:-)

mkbloke commented 3 years ago

OK, so I've gone with the suggestion in your first comment. The most recent push:

What do you think?

mkbloke commented 3 years ago

https://travis-ci.org/github/XMLTV/xmltv/jobs/748836560 https://travis-ci.org/github/XMLTV/xmltv/jobs/749308909

Hmm, how to give Travis a good kicking? :-)

mkbloke commented 3 years ago

Any thoughts on the latest code, @honir? Also, why the Travis failures for builds on CentOS8?

Cheers.

Edit: Just to add, I've been using the most recently pushed code since I pushed it and it appears to be working very well - 5 full days (10 runs in total) that have all exited with 0 on the first try.

honir commented 3 years ago

Thanks for the code contribution :-)

I've added a method to get the channel names from the first set of 'options' tag it finds (which may not be the first webpage retrieved).