XMLTV / xmltv

Utilities to obtain, generate, and post-process TV listings data in XMLTV format
GNU General Public License v2.0
281 stars 93 forks source link

tv_grab_uk_tvguide exits with: Can't call method "tag" on an undefined value at /usr/bin/tv_grab_uk_tvguide line 606. #28

Closed mkbloke closed 6 years ago

mkbloke commented 6 years ago

As per the subject. I'm using the libxmltv-perl package and the xmltv package version 0.5.70-1 from Debian testing on a Debian stable system. No dependency issues with the actual packages upon installation.

Starting the grabber as below, for today (not usually with --debug):

tv_grab_uk_tvguide --quiet --days 1 --offset 1 \ --output /tmp/tv_guide_tvguide.xml --debug \ &> /tmp/tv_guide_tvguide.log

I've attached the debug log, if that helps at all.

I have tried deleting the cache and running the grabber again, to no avail.

It appears that a page for a programme on BBC Parliament is the issue, so I disabled it in the grabber config and managed to get listings again.

Thanks.

Cheers, Ian tv_guide_tvguide.log

honir commented 6 years ago

Thanks for the log - that helped.

I can't see any issue with the data for that programme although I can confirm the error you are getting. For some reason the Perl library which parses the programme listing is creating some spurious data, and it's that which is causing the error.

It's just the one programme ("House of Lords" Tues 7:30pm) which seems to be the problem. So the best I can suggest is reinstate BBC Parliament after tomorrow and see if the issue recurs.

mkbloke commented 6 years ago

I have not as yet bothered enabling BBC Parliament again. I'm also getting the error:

Could not parse the page Can't call method "delete" on an undefined value at /usr/bin/tv_grab_uk_tvguide line 636.

Is that related to the 403 Forbidden, which I'm guessing means I've exceeded the site's HTTP traffic threshold? The URL that is forbidden can be accessed if opened in a browser.

At the moment I'm trying to build up my listings, so I last ran tv_grab_uk_tvguide as below:

tv_grab_uk_tvguide --quiet --days 1 --offset 6 --debug \ --output /tmp/tv_guide_tvguide.xml \ &> /tmp/tv_guide_tvguide.log

Thanks, Ian tv_guide_tvguide.log

honir commented 6 years ago

Is that related to the 403 Forbidden, which I'm guessing means I've exceeded the site's HTTP traffic threshold?

Yes that seems to be the reason/cause.

mkbloke commented 6 years ago

Hi Geoff,

Thanks for your previous reply.

Just to tie this up, I can confirm that having enabled BBC Parliament again, the original issue has disappeared for the time being.

It would be good to get the issue sorted properly if possible, as temporarily removing channels that caused the original issue means going without programme data, which could be problematic on channels with programmes I'd like to record via MythTV.

Out of interest, which Perl module/library was the cause of the original issue (if you have managed to identify it)?

Thanks, Ian

honir commented 6 years ago

Hi Ian,

Thanks for the feedback.

I suspect a Perl library (HTML::TreeBuilder) rather than the data source, which looked fine on manual inspection.

I'm afraid this sort of thing is always a possibility with website scraping. If you need something nearer 100% reliable then I can recommend the sd-json service from Schedules Direct. Not free, but under £20 p.a. at current exchange rates. No direct affiliation other than I'm a satisfied customer. They have a few issues with UK channel lineups which they/their upstream provider seem unable/unwilling to fix (e.g. ch183 Vice UK), but nothing disastrous.

Geoff

mkbloke commented 6 years ago

Hi Geoff,

Cheers for the info; I might try to get further into debugging if necessary and as time goes on, then feed back anything that might be helpful to the XMLTV project in the future.

I realise that website scrapping is a reactive sport! Although nowhere near as complicated as XMLTV, I'm running a few bespoke Perl scripts to scrape websites and turn the results into RSS feeds using the XML::RSS::SimpleGen module.

I have been considering using Schedules Direct. It's not expensive and I can see that given the more reliable nature of it, it could well be worth the money. The only thing that's putting me off a little at the moment is that my cheap set-up (a VPS costing less than £11/year!) that I'm using to stream only 19 Freeview channels to (currently) on demand, would be quite a bit more expensive when you add the almost £19 it would cost for the SD feed too. If the hassle with scrapping becomes too much, I'll probably do it, but for the time being I'll go on with tv_grab_uk_tvguide.

Do you know anything about http://www.xmltv.co.uk/ by the way? Initially it looked rather promising, but upon trying to use the EPG data it became apparent that it's not that great, which was disappointing. I was wondering where the data comes from, probably from a tuner card I should think. Unfortunately there are no contact details for the site owner anywhere, not even in whois either, where I noted that the domain expires this coming July, 10th. I'm wondering if it will be renewed as the site doesn't seem to have had much development thus far.

Cheers, Ian

honir commented 6 years ago

Yes that's true - it makes it a bit pricey for just 19 channels.

I don't know anything about www.xmltv.co.uk. AFAIK it's not an 'official' site. I think you're right - it looks like it's scraping an OTA EPG. Looks as though it's not maintained any more (e.g. it still has ITV Encore which closed down 2 months ago).

mkbloke commented 6 years ago

Hmm, so I now have exactly the same problem as the original one, on the same channel and exactly the same programme, but a different timeslot. This time it's House of Lords on Thu 5 Jul 1:45am-6:00am on BBC Parliament.

Interestingly, I can download the page using curl with no problem at all, but wget complains about a TLS error:

$ wget 'https://www.tvguide.co.uk/detail/158428658/139868244/house-of-lords' --2018-06-29 13:43:31-- https://www.tvguide.co.uk/detail/158428658/139868244/house-of-lords Resolving www.tvguide.co.uk (www.tvguide.co.uk)... 54.221.235.161 Connecting to www.tvguide.co.uk (www.tvguide.co.uk)|54.221.235.161|:443... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: ‘house-of-lords’

house-of-lords [ <=> ] 81.85K 541KB/s in 0.2s

2018-06-29 13:43:31 (541 KB/s) - Read error at byte 83811 (The TLS connection was non-properly terminated.).Retrying.

--2018-06-29 13:43:32-- (try: 2) https://www.tvguide.co.uk/detail/158428658/139868244/house-of-lords Connecting to www.tvguide.co.uk (www.tvguide.co.uk)|54.221.235.161|:443... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: ‘house-of-lords’

house-of-lords [ <=> ] 81.85K 532KB/s in 0.2s

2018-06-29 13:43:33 (532 KB/s) - Read error at byte 83811 (The TLS connection was non-properly terminated.).Retrying.

^C

I don't know if that's significant, or perhaps just a quirk of wget on my Debian system...

EDIT: Yes, it appears to be a quirk of the wget version I have. I've tried multiple URLs for House of Lords, along with other programmes on other channels and all seem to exhibit the same TLS error.

EDIT: Once again, disabling BBC Parliament in the channel list in tv_grab_uk_tvguide.conf has fixed the problem.

Cheers, Ian

tv_guide_tvguide.log

mkbloke commented 6 years ago

Good news, I've finally had a eureka moment after a couple of days of debugging!

The following programme information causes the error on line 606: https://www.tvguide.co.uk/detail/158428658/139868244/house-of-lords

Why? Because the programme description contains the word 'Rating'. This causes the code on line 603 to skip further down the HTML tree, matching the following:

<span class="programmetext">Wednesday's business in the House of Lords, including the third reading of the Domestic Gas and Electricity (Tariff Cap) Bill, and the report stage of the Rating (Property in Common Occupation) and Council Tax (Empty Dwellings) Bill<br><br></span>

Of course, that then blows up in the immediate code that follows line 603.

So, changing line 603 is the answer. This shouldn't work according to the HTML::Element documentation (as I understand it) because, as far as I can see, ->as_text should include all the white space too due to the lack of options you're supplying to HTML::TreeBuilder, but it does (it could be due to: $root->ignore_ignorable_whitespace defaulting to true, but I'm not clear on that):

my $showrating = $show->look_down('_tag' => 'span', 'class' => 'programmetext', sub { $_[0]->as_text =~ /^Rating$/ } );

This also works and seems a little better to be sure of what you want to match on without worrying about white space defaults:

my $showrating = $show->look_down('_tag' => 'span', 'class' => 'programmetext', sub { $_[0]->as_trimmed_text =~ /^Rating$/ } );

Cheers, Ian

honir commented 6 years ago

Good spot! Many thanks for the work in tracking this down, and for the code fix. Duly committed. Cheers, Geoff

mkbloke commented 6 years ago

Great, thank you!

Ian