XMLTV / xmltv

Utilities to obtain, generate, and post-process TV listings data in XMLTV format
GNU General Public License v2.0
300 stars 94 forks source link

tv_grab_uk_freeview produces bad XML for some channels #244

Closed nhathaway closed 1 week ago

nhathaway commented 2 months ago

XMLTV Version?

(Please specify release version or git commit ID) f84e2eb

XMLTV Component?

(Grabber name or utility) tv_grab_uk_freeview

Perl Version

5.38.2

Operating System

Ubuntu 24.04 - note: only the grabber is from github. The rest is from the Ubuntu distro.

What happened?

Aborted and produced invalid file(s)

What did you expect to happen?

Run to completion and produce valid file(s)

Did you see any warnings/errors?

(Please paste any warnings/errors, if available) Code point \u0018 is not a valid character in XML at /usr/share/perl5/XMLTV.pm line 2197. Code point \u0018 is not a valid character in XML at /usr/share/perl5/XMLTV.pm line 2197. Code point \u0018 is not a valid character in XML at /usr/share/perl5/XMLTV.pm line 2197. malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/share/perl5/XMLTV/Get_nice.pm line 136. malformed JSON string, neither tag, array, object, number, string or atom, at character offset 0 (before "(end of string)") at /usr/share/perl5/XMLTV/Get_nice.pm line 136. Code point \u001C is not a valid character in XML at /usr/share/perl5/XMLTV.pm line 2197. no programmes found no programmes found

What steps are needed to reproduce this issue?

(Please provide the full commands you are running)

  1. Generate the config for my postcode
  2. Split the config down into multiple files, one channel per config file
  3. Run the grabber for each channel, one at a time

Please attach your config file below:

(Remember to remove any usernames/passwords) I have attached the entire output as well as the main config file, and the resulting per-channel xml files. I ran tv_validate_file on each and marked the bad ones as bad: grab267.xml is bad grab269.xml is bad grab272.xml is bad grab273.xml is bad grab43.xml is bad grab707.xml is bad grab790.xml is bad

Any other information?

(For example, is this a new or intermittent issue?) This gives more in depth info for problems that other have reported'

Maybe Unicode::Escape could be used to convert \uNNNN to UTF-8? https://manpages.ubuntu.com/manpages/mantic/man3/Unicode::Escape.3pm.html

I'm not sure what is being received for the bad JSON string. I have 6 errors and 7 bad files, so it's difficult to tell which one corresponds to which, but the errors and the bad files are likely to be in the same order (in the 2 lists above). In any case, it's not many to try to find out wnat is going wrong.

tv_grab_uk_freeview.zip

honir commented 2 months ago

I only see the .conf file not the others?

What are you running: --days 1 --offset 0 ?

nhathaway commented 2 months ago

tv_grab_uk_freeview.tar.gz Sorry, bad zip file. The new tarball also has all the cache files.

It was all channels, all days

honir commented 2 months ago

'code point' and 'no programmes' xml fixed. I can't do anything with the 'malformed json' unless you know the specific channel+day it occurred

nhathaway commented 2 months ago

I sent a full set of cache files. Will one of these contain the offending JSON? If so, is it possible to run a batch file to read them all and see which ones fail?

nhathaway commented 2 months ago

Maybe this?

  <programme start="20240913024000 +0000" stop="20240913030000 +0000" channel="707.freeview.co.uk">
    <title lang="en">The Rise and Fall of Oasis</title>
    <desc lang="en">
honir commented 2 months ago

is it possible to run a batch file to...

Possibly, but I don't have time to do that. (I don't get paid for this :) )

Maybe this?...

I think that was a control code problem.

nhathaway commented 2 months ago

How about this:

xmltv@ubuntu:~/.xmltv/cache$ for FILE in `ls -1`; do if ! tail -n +7 $FILE | jq -e . >/dev/null 2>&1; then echo $FILE failed; fi; done
0a52542532cac77375c4ea0776f8eb85 failed
8211b15a956759e5600eb82ed82418fc failed
xmltv@ubuntu:~/.xmltv/cache$

Both those have no json in.

honir commented 2 months ago

Nice! Good idea.

Both those have no json in.

That seems to be it. Neither of the main Perl JSON packages seem to handle an empty string without croaking

honir commented 2 months ago

I've made a change to fix the missing JSON. Please give it a try.

nhathaway commented 2 months ago

Output from the cron job, which ran in the early hours of this morning:

could not fetch https://www.freeview.co.uk/api/program?sid=10&nid=64321&pid=crid://csi.enh.digitaluk.co.uk/af452102-916c-42da-b7e8-26b2d66a093c&start=2024-09-14T01:00:00+0000&duration=PT30M, error: 502 Bad Gateway, aborting
Code point \u001D is not a valid character in XML at /usr/share/perl5/XMLTV.pm line 2197.
no programmes found
no programmes found
grab10.xml is bad
grab707.xml is bad
grab790.xml is bad

New unicode escape sequences seem to appear at any time. It might be better to use Unicode::Escape than keep adding new exceptions.

5xx errors seem to be a regular feature of the Freeview website. Most runs I have done has at least one of these. The script currently seems to abort on the first encounter of one of these errors. The documentation for HTTP::Cache::Transparent has an "approve" interface which can be implemented to say "use the cached data on error". But then the cache timeout would probably want to be governed by a parameter.

honir commented 1 month ago

Unicode::Escape only fixes non-ascii characters.