UTF-8 Encoding bug in tags

flexersan commented 3 years ago

Here is a common Genre in my library: 180°

Note the degree symbol. This symbol causes the code to barf.

For now I've modified this line of code to fix this: https://github.com/ZeroQI/Lambda.bundle/blob/81ef5a4c8490f67e30727adc810c7452b8f66c53/Contents/Code/__init__.py#L400

Locally I simply strip anything non ascii: for tag in xml.iterdescendants('Genre' ): SaveFile(tag.get('tag').encode('ascii', 'ignore'), path, 'movies_nfo', nfo_xml=nfo_xml, xml_field='genre', metadata_field=metadata.genres, multi=True, tag_multi='genre'); genres.append(tag.get('tag').encode('ascii', 'ignore')) A better fix would be to coax the symbol into the proper format since stripping non ascii is not localization friendly. I don't know enough (nor did I dig into it) about character encodings to figure out a better fix. Hopefully someone can chime in. Likely all the tags could use with some character encoding validation.

ZeroQI commented 3 years ago

Thanks for the very documented post I am very bad with encoding, and i would like to avoid breaking for two-byte coded languages, (japanese/chinese or Korean)... If you add manually a tag in one of these languages to a series, request to update metadata for it and it is saved properly, i can include it in the master code

ZeroQI commented 3 years ago

To try: genres.append(u"{}".format(tag.get('tag'))) or genres.append(tag.get('tag').encode("utf-8"))

Marco4223 commented 2 years ago

Hi Benjamin, maybe I have the same problem wit the NFO Files. In the Plex UI I see German words like "Jäger" but in the NFO files I've this: "J├ñger" in the Element. Maybe you can give me a hind where can start to find a fix for this this problem?

I also find this in the logs: 2021-11-04 13:36:42,389 (7fa705647b38) : INFO (init:212) - local_value - Exception: "All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters", {'actor': {'role': {'text': 'Bermann'}, 'name': {'text': u'Peter R\xfchring'}, 'thumb': {'text': ''}}}, Peter Rühring, None

Cheers Marco

ZeroQI commented 2 years ago

All strings with accentuated characters should be unicode (u'string') line 212

Marco4223 commented 2 years ago

I'm sorry, bit maybe I'm blind. Line 212 is an "else"

ZeroQI commented 2 years ago

Init:212 point to line 212 but it is in fact line 179

Marco4223 commented 2 years ago

Sorry, I think we are talking diffrent things here. Line 179 is this tight? except Exception as e: Log.Info('local_value - Exception: "{}", {}, {}, {}'.format(e, xml_field, thumb, tag)); return

ZeroQI commented 2 years ago

indeed lines 164-173 crashed and line 179 output the exception information

INFO (init:212) - local_value - Exception: "All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters", {'actor': {'role': {'text': 'Bermann'}, 'name': {'text': u'Peter R\xfchring'}, 'thumb': {'text': ''}}}, Peter Rühring, None

except Exception as e: Log.Info('local_value - Exception: "{}", {}, {}, {}'.format(e, xml_field, thumb, tag)); return

one would need to add text output and run it again to crash to locate which code line has the issue

Marco4223 commented 2 years ago

Ok :) But where is the line of code where you write the NFO file? :)

ZeroQI commented 2 years ago

Lines 691-697

  ### Save NFOs if different from local copy or file didn't exist #############################################################################################
  Log.Info('NFO files')
  for nfo in sorted(NFOs, key=natural_sort_key):
    nfo_string_xml     = XML.StringFromElement(NFOs[nfo]['xml'  ], encoding='utf-8')
    if nfo_string_xml == XML.StringFromElement(NFOs[nfo]['local'], encoding='utf-8'):  Log.Info('[=] {:<12} path: "{}"'.format(nfo, NFOs[nfo]['path']))
    elif NFOs[nfo]['path'].endswith('Ignored'):                                        Log.Info('[ ] {:<12} path: "{}"'.format(nfo, NFOs[nfo]['path']))
    else:                    Core.storage.save(NFOs[nfo]['path' ], nfo_string_xml);    Log.Info('[X] {:<12} path: "{}"'.format(nfo, NFOs[nfo]['path']))  #NFOs[nfo]['xml'].write(NFOs[nfo]['path' ])

Marco4223 commented 2 years ago

If I'm not wrong "Core.storage.save" is a function of Plex? (Don't find any Python doc. Is there any doc?) in the nfo_string_xml all of my chars (äüö...) are correct. If I take a look into WebTools (v3.0.0) I found something crazy:

LANG | en_US.UTF-8 LC_ALL | en_US.UTF-8 Locale | ('en_US', 'UTF-8')

But I have set everything do German (de-DE) Maybe this could cause the issue?

ZeroQI commented 2 years ago

Yes it is a Plex function. This could be a reason, to be tested

Marco4223 commented 2 years ago

Hi, so now I had a little bit more time and double check everything. The problem is not you’re encoding or anything else. Alle files are encoded correctly: '# dfeal /volume1/video/test_Hellboy\ (2004).nfo { "encoding": "UTF-8", "language": "german", "confidence": { "encoding": 1, "language": 0 } } The Problem is the is Notepad++ that misinterpretation the Encoding. Here is the Bug: https://github.com/notepad-plus-plus/notepad-plus-plus/issues/9153

Sorry for the inconveniences.

ZeroQI / Lambda.bundle

UTF-8 Encoding bug in tags #25