LMS-Community / slimserver

Server for Squeezebox and compatible players. This server is also called Lyrion Music Server.
https://lyrion.org
Other
1.2k stars 299 forks source link

SBS/Lyrion incorrectly encodes title when scanning unusual Unicode characters #1225

Open simonltwick opened 3 days ago

simonltwick commented 3 days ago

The Coldplay album Music of the Spheres has tracks titled as unusual unicode characters such as ā¤ļø. The track metadata, including titles, has been downloaded from MusicBrainz and seems to be encoded correctly.

However after rescanning, the tracks database on Squeezebox Server/Lyrion has incorrect encoding for the title which cannot be re-encoded to a python string.

The url/filename, which is also set to the track title from Musicbrainz, is encoded correctly. Track 9, entitled šŸŒŽ, has this problem, while others such as ā¤ļø above (track 6) work fine.

Data printed from the tracks database: using select id, title, url - with an sqlite3 text_factory that returns repr(exception) if a UnicodeError is encountered in decoding the bytes to string:

track: (9, "UnicodeDecodeError('utf-8', b'\xed\xa0\xbc\xed\xbc\x8e', 0, 1, 'invalid continuation byte')", 'file:///srv/share2/music/Coldplay/Music%20of%20the%20Spheres/09%20-%20%F0%9F%8C%8E.mp3')

From this I assume the correct bytes encoding is /xf09f8c8e, and in Python this correctly decodes using UTF-8 to the šŸŒŽ character.

Please let me know if I can provide further information. Thanks in advance for your help with this.

simonltwick commented 3 days ago

PS. SBS release: Logitech Media Server Version: 8.5.2 - 1716215514 @ Sun 26 May 2024 05:43:14 PM CEST Running on Ubuntu 26.10 x86_64

michaelherger commented 3 days ago

That albums is looking good here:

Bildschirmfoto 2024-11-23 um 01 29 50

This is imported from Spotify, but it shows that the problem must be elsewhere, not in the database. Please give LMS9 (https://downloads.lyrion.org) another try. It should be pretty stable: I plan to release it in the next few days.

simonltwick commented 3 days ago

Thanks Michael. This is what I get on 8.5.2, but I look forward to trying LMS9 and will update you after that. image

simonltwick commented 3 days ago

I have now installed latest Lyrion nightly build (easy to install and looking good!). Lyrion Music Server Version: 9.0.0 - 1732300171 It initiated a full rescan but the results are still the same: image

Tracks 1 and 9 are still not showing with their correct (pictorial) titles.

So I am now looking elsewhere for the source of the problem, since your tracks from Spotify are scanned OK. The tracks I am using were ripped from the CD and then tagged using the MusicBrainz Picard scanner, a widely-used and usually reliable piece of software. I've also scanned the same tracks into another sqlite database using my own scanner written in Python and the mutagen tagging library, and the titles were scanned correctly into that sqlite DB.

I wonder if the LMS scanner uses a different encoding to put the tags into the database? I have used utf-8. But I tried to decode what was in the LMS database with several possible encodings (utf-16, utf-32, latin-1) and none of them worked.

Not quite sure how to proceed, as the tags in the files seem to be correct as far as I can tell.

I realise you are probably pretty busy now and the priority is getting LMS9 released and dealing with questions arising to make it a successful launch.

I think it's not legal to send you the actual track, but I have applied the same tags to another non-copyrighted file from the Free Music Archive and attached it here. If you get a moment please could you scan it and see if it works on your test system? You can search for it by looking for path contains "Jahzzar". This one still has problems showing the title on LMS.

As another possibility, we could compare the contents of the title tag for the Coldplay track, your version compared to mine? I could write a short Python program to do that maybe. test_track.zip

michaelherger commented 3 days ago

I'm pretty sure this is not a database issue. Because if it was, all of those characters would be broken. What browser are you using? What operating system where the browser is running? And what are the LMS details in Settings/Information)? I wonder whether the client platform was a bit dated, failing to render some unicode characters.

simonltwick commented 3 days ago

Hi Michael, Thanks again for your help with this.

I'm using the latest version of Linux Ubuntu and the Chrome browser, but the problem happens when I access the tracks table with python/sqlite3, so definitely not the browser. (Latest version of python, 3.13 by the way).

The correct character is extracted from the track by python/mutagen and inserted into another sqlite table by my own code, but it appears that it is scanned / inserted into the LMS table as a different byte sequence, (b'\xed\xa0\xbc\xed\xbc\x8e' as above) - this cannot be decoded back to Unicode. So I wonder if it's something to do with the way Perl does encoding or how the Perl bindings for sqlite produce bytestrings, but I don't know enough about Perl to test it out.

Here is the settings/information page, I hope this has all the info you need. image

Would you be able to try scanning the test track I attached above, to see what result you get on your system? I think that would show whether there is something different / wrong on my system.

michaelherger commented 2 days ago

Could you please provide me with a copy of your library.db and two sample music files with given tags. Feel free to send me a file with all audio removed, or just silence, but with the tags which you'd use.

https://www.dropbox.com/request/T3RctyzGgNg0oFDubq6a

Your system seems to be using pretty much the latest of everything. So that should definitely be fine.

simonltwick commented 1 day ago

Thank you for looking at this. I've uploaded my library.db file and also two tracks from Free Music Archive which have been tagged using Musicbrainz Picard as tracks 1 and 9 from Coldplay's Music of the Spheres.

When I search for them (using Path Contains "Jazzhar"), the title is shown as unicode error characters. image

I hope this gives you some useful info to suggest where the problem lies. Simon

michaelherger commented 1 day ago

Thanks for the files. This is just to confirm that I'm seeing the same with them as you do.

michaelherger commented 1 day ago

Something's odd about those files: I opened them in Meta (a tagger on Mac) - it would render correctly.

Then I spit out all metadata using exiftool:

Title                           : ??????

I duplicated the emoji and saved the file again in Meta, run exiftool again:

Title                           : šŸŖšŸŖ

And now it's rendering correctly in LMS as well.

I would think that the metadata is not saved correctly, and some applications are more forgiving than others. And LMS would render perfectly well if the file was saved correctly. Would you have an alternative tagger to try to save the metadata once again?

simonltwick commented 1 day ago

I don't have another tagging tool available, I'm afraid, but I have discovered that the TIT2 tag is encoded using UTF-16. Not sure if that is what the scanner expects? I used the python mutagen library to extract the tags from one of the test tracks I sent you.

$ python3
Python 3.12.7 (main, Oct  3 2024, 15:15:22) [GCC 14.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from mutagen import File
>>> f = File("Jahzzar - Railroad's Whiskey Co.mp3")
>>> f['TIT2']
TIT2(encoding=<Encoding.UTF16: 1>, text=['šŸŒŽ'])
>>> str(f['TIT2'])
'šŸŒŽ'
>>> 

I know that according to the ID3 standard, different encodings can be supported, but I'm not sure how common UTF16 is? (The track artist tag, TPE1, is encoded using UTF8.) Does the scanner take note of the encoding in the tag or does it always assume UTF8? However, some of the other Coldplay tracks, for example '06 - ā¤ļø.mp3' also have the TIT2 tag encoded using UTF16, but these seem to make it into the library.db ok.

>>> f=File('06 - ā¤ļø.mp3')
>>> f["TIT2"]
TIT2(encoding=<Encoding.UTF16: 1>, text=['ā¤ļø'])

The bytes encoding of the earth symbol is different in UTF8 from UTF16:

>>> 'šŸŒŽ'.encode("utf16")
b'\xff\xfe<\xd8\x0e\xdf'
>>> 'šŸŒŽ'.encode("utf8")
b'\xf0\x9f\x8c\x8e'
>>> 

Does this help at all?

michaelherger commented 1 day ago

TBH I'm not sure about utf16 support. We certainly don't have code specific to this encoding type. So this could very well be the problem here.

Maybe you could give mp3tag a try (https://www.mp3tag.de/en/download.html)? Or can you configure you tagger to only use utf8?

simonltwick commented 1 day ago

Michael, leave it with me for a bit ... I'm going to investigate how much of my music library is encoded using UTF-16, to see if it's an outlying case. I don't know of a way to configure what encoding is used, it's all handled under the covers by some low-level library but I will check it out.

I have a suspicion that there might be a difference between the way Python and Perl handle encoding of certain characters, which would be a bug in either Python or Perl, but I'll do some research first.

I am pretty busy for the next few days, so I hope to come back to you by the end of the week. Thank you for all your help so far.

michaelherger commented 1 day ago

I have a suspicion that there might be a difference between the way Python and Perl handle encoding of certain characters, which would be a bug in either Python or Perl, but I'll do some research first.

FWIW: ExifTool (https://exiftool.org) is written in Perl, too. Which would support your theory. I'd rather think it was a shortcoming of Perl, as my tagger is reading the data correctly.