XMLTV / xmltv

Utilities to obtain, generate, and post-process TV listings data in XMLTV format
GNU General Public License v2.0
266 stars 93 forks source link

tv_grab_uk_tvguide xml fails to validate due to channel names giving error Input is not proper UTF-8, indicate encoding #194

Closed FizzyTea closed 1 year ago

FizzyTea commented 1 year ago

Thanks for taking the time to report an issue. Please take a moment to review our open/closed issues above, in case your issue has already been reported.

If you are reporting a new issue, please give your issue a descriptive title and fill out the blanks below, providing as much information as possible.

XMLTV Version?

XMLTV module version 1.1.2

XMLTV Component?

tv_grab_uk_tvguide version 1.1.2

Perl Version

Perl v5.28.1

Operating System

Raspbian 10 Buster

What happened?

Grabber appears to run successfully but the xml file does not validate. Irish channel names such as RTÉ One and RTÉ2 are not correctly displayed.

On running the grabber with --configure I notice a problem with such channel names. See attached screenshot. Upon inspecting the xml file from a seemingly successful grab I notice a similar problem though the channel names are displayed differently to before. See attached screenshot.

What did you expect to happen?

I expect the channel names to be correctly displayed in the console and the xml.

Did you see any warnings/errors?

I get the following error running tv_validate_file listings.xml

The file is not well-formed xml:
listings.xml:48: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xC9 0x20 0x4F 0x6E
    <display-name lang="en">RT� One</display-name>
                              ^

The file did not validate as well-formed XML, so no further
processing was performed.

What steps are needed to reproduce this issue?

(Please provide the full commands you are running)

  1. tv_grab_uk_tvguide --configure
  2. tv_grab_uk_tvguide --days 2 --nodetailspage --output listings.xml
  3. tv_validate_file listings.xml

Please attach your config file below:

``

Any other information?

I wonder if this is related to the encoding system on my OS or in my console? I ssh into my Raspberry Pi from Ubuntu 20.04 (XFCE). Initially my rpi locale was set to en_GB.UTF-8. After encountering issues I changed my locale to en_IE.UTF-8 and went through the configuration process again (after removing the cached files from previous runs) but the problems remain.

Screenshot_tvguide_config Screenshot_tvguide_validate

FizzyTea commented 1 year ago

Further investigation reveals some curious results

Running the cmd tv_grab_uk_tvguide --list-channels > channels.xml results in an xml file with the problematic characters correctly displayed e.g.

  <channel id="1305.tvguide.co.uk">
    <display-name lang="en">RTÉ One</display-name>
    <icon src="https://cdn.tvguide.co.uk/channel_logos/60x35/1305.png" />
    <url>https://www.tvguide.co.uk/channellistings.asp?ch=1305</url>
  </channel>
  <channel id="1306.tvguide.co.uk">
    <display-name lang="en">RTÉ One +1</display-name>
    <icon src="https://cdn.tvguide.co.uk/channel_logos/60x35/1306.png" />
    <url>https://www.tvguide.co.uk/channellistings.asp?ch=1306</url>
  </channel>
  <channel id="719.tvguide.co.uk">
    <display-name lang="en">RTÉ One +1</display-name>
    <icon src="https://cdn.tvguide.co.uk/channel_logos/60x35/719.png" />
    <url>https://www.tvguide.co.uk/channellistings.asp?ch=719</url>
  </channel>
  <channel id="1307.tvguide.co.uk">
    <display-name lang="en">RTÉ One HD</display-name>
    <icon src="https://cdn.tvguide.co.uk/channel_logos/60x35/1307.png" />
    <url>https://www.tvguide.co.uk/channellistings.asp?ch=1307</url>
  </channel>
  <channel id="900.tvguide.co.uk">
    <display-name lang="en">RTÉ One HD</display-name>
    <icon src="https://cdn.tvguide.co.uk/channel_logos/60x35/900.png" />
    <url>https://www.tvguide.co.uk/channellistings.asp?ch=900</url>
  </channel>

However (removing cache for the sake of caution and moving previously generated config files and) running tv_grab_uk_tvguide --configure results in a configuration file with malformed Channel names e.g.

channel!1519   # RT� 2 +1
channel!716   # RT� Jr
channel!342   # RT� One
channel=1305   # RT� One
channel!1306   # RT� One +1
channel!719   # RT� One +1
channel=1355   # RT� One HD
channel=900   # RT� One HD
channel=1307   # RT� One HD
channel!1236   # RT� Radio 1 FM
channel!1237   # RT� Raidi� na Gaeltachta
channel!1235   # RT� lyric fm
channel=363   # RT�2
channel=718   # RT�2 HD

And of course upon running the grabber the results are similarly problematic e.g.

<channel id="1305.tvguide.co.uk">
    <display-name lang="en">RT� One</display-name>
  </channel>
  <channel id="1355.tvguide.co.uk">
    <display-name lang="en">RT� One HD</display-name>
  </channel>
  <channel id="342.tvguide.co.uk">
    <display-name lang="en">RT� One</display-name>
  </channel>
  <channel id="363.tvguide.co.uk">
    <display-name lang="en">RT�2</display-name>
  </channel>
  <channel id="718.tvguide.co.uk">
    <display-name lang="en">RT�2 HD</display-name>
  </channel>
  <channel id="900.tvguide.co.uk">
    <display-name lang="en">RT� One HD</display-name>
  </channel>

I suspect the problem may well be at my end but I am at a loss to solve this issue so any help is much appreciated.

honir commented 1 year ago

Thanks for the detailed report.

Before I commit a change to git do you want to check it out for me, please?

Find your tv_grab_uk_tvguide on your RPi and change line line 711 from

$channels->{$channel_id} = { 'id'=> $xmlchannel_id , 'display-name' => [[$channelname, 'en']] };

to

$channels->{$channel_id} = { 'id'=> $xmlchannel_id , 'display-name' => [[ encode('utf-8', $channelname), 'en' ]] };

.

p.s. locale GB.UTF-8 should be fine

p.p.s. the channelname may display wrong in the config file, or not: it depends on your system. I guess I should fix that. I never tested this with Irish channels, since this is (notionally) a "UK" grabber ;-)

FizzyTea commented 1 year ago

That change has fixed the described issues. Thanks very much.

p.s. the issue also affects one or two (at least one) BBC Wales channels.

p.p.s. I have some further validation issues. Do not think they are connected to this issue though so I should probably make a separate thread if I can't resolve them.