XMLTV / xmltv

Utilities to obtain, generate, and post-process TV listings data in XMLTV format
GNU General Public License v2.0
269 stars 93 forks source link

tv_sort & tv_grep failure to parse output from tv_grab_uk_tvguide due to URL encoding of "£" symbol in show title #173

Closed Tricky-M closed 2 years ago

Tricky-M commented 2 years ago

Thanks for taking the time to report an issue. Please take a moment to review our open/closed issues above, in case your issue has already been reported.

If you are reporting a new issue, please give your issue a descriptive title and fill out the blanks below, providing as much information as possible.

XMLTV Version?

1.1.1

XMLTV Component?

tv_sort, tv_grep, tv_grab_uk_tvguide (Probably tv_grab_uk_tvguide)

What happened?

Downloaded schedule using "tv_grab_uk_tvguide --days 14 --nodetailspage", there is a program "My £2 Dream Home" on Channel 4 that is downloading with the URl as below. This is not being interpreted correctly as the "£" and causing an error with both tv_sort and tv_grep. <url>https://www.tvguide.co.uk/detail/4575868/65526501/my-<A3>2-dream-home</url>

Original URL from TV Guide = https://www.tvguide.co.uk/detail/4575868/65526501/my-£2-dream-home

Downloaded data for this show in the cache file contains correct "£" representation.

What did you expect to happen?

Handle the string correctly (vi does, less doesn't). Suspect handling or encoding for URL encoding is converting "£" to "A3" with angle brackets that are causing a problem in the parsing of the XML generated by tv_grab_uk_tvguide.

Did you see any warnings/errors?

(Please paste any warnings/errors, if available) Error from tv_sort / tv_grep not well-formed (invalid token) at line 110194, column 62, byte 6363821 at /usr/lib/arm-linux-gnueabihf/perl5/5.32/XML/Parser.pm line 187. at /usr/local/share/perl/5.32.1/XMLTV.pm line 605. at /usr/local/share/perl/5.32.1/XMLTV.pm line 605.

What steps are needed to reproduce this issue?

(Please provide the full commands you are running) Full command running on a Raspberry Pi:

  1. /usr/local/bin/tv_grab_uk_tvguide --days 14 --nodetailspage | /usr/local/bin/tv_sort | /usr/local/bin/tv_grep --on-after now > guide_file.xml

Any other information?

(For example, is this a new or intermittent issue?) This has been occurring for a couple of weeks, which I've now found is a result of the URL encoding of £ symbol. I have this running as part of a script once downloaded runs a tv_check to find shows I'm interested in and then emails me the output.

Manually updates from package management version 1.0.0 to 1.1.1 as originally had the GMT->BST date issue (might have manually packed the perl in tv_grab_uk_tvguide.

I think this is the section extracting the URL: https://github.com/XMLTV/xmltv/blob/master/grab/uk_tvguide/tv_grab_uk_tvguide#L683-L688

But the problem might be within the XMLTV.pm writer function?

What other software are you using?

Operating System: Raspberry Pi OS (Bullseye)

Perl Version: v5.32.1

Tricky-M commented 2 years ago

Found another example with show called "¡Three Amigos! (1986)" which is presenting the "raw" URL as (viewed in less): <url>https://www.tvguide.co.uk/detail/295480/65536317/<A1>three-amigos</url>

honir commented 2 years ago

Thanks for the report. And for the hint :-)

Fix committed to github.