playlist: Address Unicode normalization differences when constructing queries

scztt commented 1 year ago

Problem

Path in m3u files containing unicode characters do not generate proper queries, and thus the target files are never found. Given a file path like:

/Volumes/Music/Music/_organized/Gregor Cürten & Anselm Rogmans/Planes [Entr’acte]/02-Planes II.flac

I see (via some print statements) that this path is converted to this before querying the database:

b'/Volumes/Music/Music/_organized/Gregor C\xc3\xbcrten & Anselm Rogmans/Planes [Entr\xe2\x80\x99acte]/02-Planes II.flac'

This bytestream path does not return any results, though this path is queryable via other commands like beet info and the displayed path via things like ls -f '$path' matches.

It looks like this may be related to PlaylistQuery:match being slightly invalid? If I populate my original playlist with (1) the original path as it was in the playlist, and (2) the path copy-pasted from the output of a beet info command, I see the following paths being queries:

b'/Volumes/Music/Music/_organized/Gregor C\xc3\xbcrten & Anselm Rogmans/Planes [Entr\xe2\x80\x99acte]/02-Planes II.flac'
b'/Volumes/Music/Music/_organized/Gregor Cu\xcc\x88rten & Anselm Rogmans/Planes [Entr\xe2\x80\x99acte]/02-Planes II.flac'

If I read each of these as unicode strings (e.g. not as bytestreams) in Python, they are ==, but of course as bytestreams they are not.

AFAICT these two strings are equivalent unicode representations, so probably the fix here is to simply normalize the paths in the m3u before querying. Experimentally, this seems to resolve the problem for several of the cases I'm seeing:

                line = unicodedata.normalize('NFD', line)

But I don't know enough to say whether this is correct - possibly normalization like this should be part of the query pipeline in a more generalized way, rather than requiring it to be added ad hoc to every plugin?

sampsyo commented 1 year ago

Interesting issue! Filename encoding problems are by far the most complicated and persistent set of problems in beets, and this is no exception.

The way to address this is to track down exactly why the filename is getting normalized differently in the different situations. To begin with this, this depends heavily on your platform—what OS are you using? (I'm guessing macOS, where the filesystem manipulates Unicode normalization internally.)

What we want is this: for the bytes in the playlist file to exactly match the bytes in the beets database. That way, when we read the file in binary mode (without any encoding conversion or re-normalization), when can use exactly those bytes to construct the query. Does this general notion make sense? The next steps are not exactly easy, but they involve trying to carefully inspect where the playlist gets read and written to find the place where the bytes are being changed.

scztt commented 1 year ago

My guess is that the testcase that would catch this (and one that is definitely not passing now) is something like:

stringA = "some/unicode/path
stringB = "some/unicode/path" # different but equivalent unicode representation...

assert(stringA == stringB)
assert(util.normpath(stringA) == util.normpath(stringB))

Adding unicode normalization to util.normpath would be a simple fix for this, and I imagine would be very safe as long as this is the only code path where bytestring_path's are generated? It's hard to imagine a scenario that's valid but where the above test does not pass....

scztt commented 1 year ago

What we want is this: for the bytes in the playlist file to exactly match the bytes in the beets database.

But, I think this is not possible, unless I'm misunderstanding? The playlist in question is generated by something outside of beets (this might not have been clear in my original report), which is free to put any valid unicode representation of paths in the file.

sampsyo commented 1 year ago

The playlist in question is generated by something outside of beets (this might not have been clear in my original report), which is free to put any valid unicode representation of paths in the file.

Ah, that would indeed make things a bit more complicated.

The core problem here is that, in general, Unix paths are byte strings—they are not Unicode data. This means that, if we want to retrieve something by matching on its paths, the only universally correct way to do this is by matching on the exact bytes. It is possible on many filesystems, for example, to have two different files in the same directory whose names only differ in Unicode normalization! This sounds crazy, and it is crazy, but unfortunately we can't ignore this reality—normalizing everything will cause other subtle problems.

Can you clarify what platform you're on? Is it macOS? There, the FS actually does its own Unicode normalization. So maybe we could consider a special case just for macOS? It would be a tricky thing to get right, but it's possible…

scztt commented 1 year ago

The platform is OSX.

Yeah, my understanding was that on OSX (and at least some posix cases - definitely NFS/SMB?), the underlying filesystem is responsible for resolving an arbitrary stream of path bytes, which means the way particular unicode expressions can be matched to concrete paths can even vary depending even on where in the filesystem you're at (e.g. depending on mount points). I wonder if normalizing all paths through OS API's is the only remotely "correct" way to handle this? There are some particularly cursed scenarios on MacOS related to using "non-canonical" capitalization in paths on a case-insensitive filesystem, where there are OS caches that can have paths with mismatched caps depending on the API's being called...

I checked and it looks like the path in my playlist file is the one that diverges from both the filesystem and the beets database path. It's not exactly "bad" because it's still totally valid, but obviously it doesn't match the canonical path. I can fix the paths on my end, I'm already processing the playlists with a script so it's not so hard to do.

It still feels like it would be good to canonicalize m3u paths on the way in to beets? It would definitely fix this issue + comparable issues with capitalization, and I can't really imagine a common scenario where it would make things worse?

sampsyo commented 1 year ago

It still feels like it would be good to canonicalize m3u paths on the way in to beets? It would definitely fix this issue + comparable issues with capitalization, and I can't really imagine a common scenario where it would make things worse?

I think it's worth a shot for the playlist plugin specifically, perhaps as a special case just for macOS. I am concerned, however, that doing it on all platforms could cause just as many problems as it solves.

beetbox / beets

playlist: Address Unicode normalization differences when constructing queries #4739

Problem