Closed mkbloke closed 3 years ago
Good spot! :-) That's curious that the options input is sometimes missing: it's not as though it's fetched asynchronously.
Do you think there's any mileage in trying both methods? Try for the 'option' tag first and if not found then go for the programmeheading class.
I suggest this because the css for programmeheading is very generic
.programmeheading {
color: #ffffff;
font-size: 2.2em;
font-weight: normal;
}
so it's possible they might use it elsewhere in the future which might then mess up the list counter.
That's curious that the options input is sometimes missing: it's not as though it's fetched asynchronously.
Yes, it's quite strange. I guess something in their back-end must occasionally fail or timeout...
Do you think there's any mileage in trying both methods? Try for the 'option' tag first and if not found then go for the programmeheading class.
Yes, that sounds like a great idea.
I suggest this because the css for programmeheading is very generic
.programmeheading { color: #ffffff; font-size: 2.2em; font-weight: normal; }
so it's possible they might use it elsewhere in the future which might then mess up the list counter.
I did wonder about the anchoring, as ideally it would be better to anchor on something more specific. I will look into trying to anchor on something a bit more specific as well, so hopefully the back-up method will be a bit more robust than it is right now.
TOL: a possible approach might be to store the option tags (from the first page download, and anchored within the "select name=ch" for safety) in an array and use that for the rest of the program run. Dunno...that might be a sledgehammer...?
TOL: a possible approach might be to store the option tags (from the first page download, and anchored within the "select name=ch" for safety) in an array and use that for the rest of the program run. Dunno...that might be a sledgehammer...?
That could work. I guess it would be necessary to consider the possibility of missing options in the first channel listings request and do something about it though. I used a sledgehammer approach in my first attempt at fixing this:
# Fetch the page
# my $tree = XMLTV::Get_nice::get_nice_tree($url);
my $tree = undef;
my $channelname = undef;
my $maxretry = 5;
# Retry on invalid data
for (my $retry = 1; $retry <= $maxretry; $retry++) {
$tree = fetch_url($url);
$channelname = $tree->look_down('_tag' => 'option', 'value' => $channel_id) if defined $tree;
last if defined $channelname;
print STDERR "Channel data not found, retrying URL: $url (attempt $retry of $maxretry) \n";
my $cache_filename = HTTP::Cache::Transparent::_urlhash($url);
$conf->{cachedir}->[0] =~ /\/$/ or $cache_filename = "/$cache_filename";
unlink $conf->{cachedir}->[0] . $cache_filename;
print STDERR "Deleted cache file: " . $conf->{cachedir}->[0] . $cache_filename . "\n";
}
# $tree->dump; exit;
# Scrub the page
:-)
OK, so I've gone with the suggestion in your first comment. The most recent push:
What do you think?
https://travis-ci.org/github/XMLTV/xmltv/jobs/748836560 https://travis-ci.org/github/XMLTV/xmltv/jobs/749308909
Hmm, how to give Travis a good kicking? :-)
Any thoughts on the latest code, @honir? Also, why the Travis failures for builds on CentOS8?
Cheers.
Edit: Just to add, I've been using the most recently pushed code since I pushed it and it appears to be working very well - 5 full days (10 runs in total) that have all exited with 0 on the first try.
Thanks for the code contribution :-)
I've added a method to get the channel names from the first set of 'options' tag it finds (which may not be the first webpage retrieved).
What type of Pull Request is this?
Does this PR close any currently open issues?
No.
Please explain what this PR does
So, for a long time, I've been running
tv_grab_uk_tvguide
in a loop (with up to 5 retries and a sleep between of 1 hour) if it doesn't exit with 0. The problem is the tvguide.co.uk website is pretty flaky (the UX sucks too, but that's by the by) and often I end up with a few log entries ofUnable to retrieve web page for $channel_id
which causes retries and sometimes a total failure if the max retries is exceeded, which means manual intervention. I've been meaning to look at this for a long time now, but have only just got around to it.The problem is that - for whatever reason - the tvguide.co.uk website does not always insert the options into the form that is searched to find
$channelname
- see the attached channel listing HTML page for an example of this. Having got around to understanding the problem, I now realise my retry solution has not been a very good one, due to the transparent caching also coming into play.This PR modifies the grabber to obtain the channel name from the second occurrence of:
<span class=programmeheading>
It has been working pretty well for me over the last 5 days:
Note: I had to manually kill the
tv_grab_uk_tvguide
process that started on 2020-12-05-05:37, as it hung on the first try at 05:39 for no obvious reason (CPU usage 0%, memory usage 2.2% (~22MB), plenty of disk space). That is something I have very occassionally seen before. Perhaps it's a network issue where something in LWP or called by LWP never returns and doesn't time out? There was nothing in the error log that was out of place - usinggrep -v ^Fetching
on it returned nothing. Annoyingly, by the time I killed the hung first try, the second try didn't quite finish before the scheduled evening run at 17:37, so that had to be manually run a little later.Any other information?
121.html.txt
Where have you tested these changes?
Operating System: Debian 9
Perl Version: v5.24.1