Open brunal opened 9 years ago
I'm all for it. Our current path handling stuff is too fiddly.
I have one important concern: encodings. The core problem is that Unix paths need to be bytes and Windows paths need to be Unicode strings (more or less). We need to store paths in a database, so eventually they need to be represented internally as one or the other. Beets currently chooses byte strings, and on Windows the "true" paths are encoded with UTF-8 just for storage and uniform manipulation. I'm not quite clear on how pathlib would interact with this constraint.
I also don't know how pathlib handles long names on Windows (the issue described here).
Maybe the right approach here would be to start with a small prototype, see how it goes on different platforms, and then attempt the full migration.
Currently,
beets.util._fsencoding()
displayable_path()
does that operationsyspath()
does some ugly dark magicbytestring_path()
ensures always gives a byte string path, no matter the inputnormpath()
expands everything and gives a byte stringIs that right? Am I forgetting something?
Python 3.2 introduced os.fs{en,de}code()
: https://docs.python.org/3/library/os.html#os.fsencode
The source is pretty simple (Lib/os.py:800
) and shares some traits with our functions. However mbcs
encoding only triggers another error mode: strict
instead of surrogateescape
.
I count ~100 {en,de}code calls in the beets/ folder. It'd be good to get rid of most of them.
surrogateescape
is not natively available in python 2I think the solution is to use python-future (https://github.com/PythonCharmers/python-future) which provides a future.utils.surrogateescape
module.
Yes, surrogate escaping is probably now the Pythonic way of doing this. A few thoughts/questions:
fsencode
. We'll at least need to handle this case to avoid crashing.\\?\
prefix on Windows that syspath
does. It's possible I'm missing something, but if that's the case, it's an important oversight—we'll need to find some way to add that back. :grimacing:Roughly,
OK, this migration plan sounds good.
On Windows exceptions: Yes, if the OS can be trusted to always supply Unicode, then we should be OK. Hopefully, paths do not come from any other source that could contain surrogate escapes.
On the backport: FWIW, it looks like someone has tried starting a maintained backport, but it appears abandoned.
I don't think that pathlib is able to replace truncate/sanitize _path from what I've seen of it. It's a module for providing many of the functions of os.path with some corrections in an object-oriented way, and making some useful information about the path available (eg. root, suffix, filename). While this in itself is useful and could clean up a lot of code, it looks to me like we would still have to do the encoding/decoding of the path, and removal of unsafe characters.
Agreed; while pathlib will definitely be nice, it will not solve our encoding and sanitation problems. Those will still be 100% up to us.
there's now a maintained pathlib port that matches the stdlib version https://github.com/mcmtroffaes/pathlib2
@brunal : do you still wanna work on this? I'd love to help out.
I'd like to tackle this and I want to make sure I'm considering all the potential issues. The proposed solution I am thinking is using pathlib everywhere and storing paths as strings in the database (using pathlib's string representation).
The hope is that people will gravitate towards path objects like pathlib and that will move people away from operating directly with bytes.
Some current issues:
Long windows paths
Issue: Pathlib doesn't seem to correctly deal with these situations, as illustrated by the answer to that stackoverflow question. Potential Solution: Don't attempt to solve this on beet's end. Instead, direct users to enable long path support explicitly. This is the same decision that python made.
Other notes:
os
modules and open
now accept pathlib objects. The issue is that the above PEP was introduced to Python 3.6. Whether or not these features are necessary for beet's adoption, I'm not sure.These are my rough initial thoughts on the matter, and I'm only just now looking into these issues, so I'm sure I'm missing something and am hoping others can help fill in the gaps so we can work this out.
Hi! Thanks for being interested in taking this on!
I think the big obstacle here is about Unicode, surrogate escapes, and SQLite storage. Namely, because surrogate escapes are a Python-specific quirk, SQLite cannot store strings that contain them. For example:
>>> weird_string = 'café'.encode('latin1').decode('utf8', 'surrogateescape')
>>> import sqlite3
>>> sqlite3.connect(':memory:').execute('select ?;', (weird_string,)).fetchall()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 3: surrogates not allowed
Basically, SQLite stores values as bytes internally. Python tries to "pretend" that it's storing Unicode objects by relying on UTF-8 encoding.
Regardless, it seems easier to attempt such a migration while leaving the database intact. At some point, we will need to convert between Path objects and "primitive" values that can go into the database. We might as well continue using the same conversion we use now—namely, represent the paths as the actual bytes that came from the operating system.
Separately, one thing I'm pretty worried about here is the interaction between command-line argument encodings and filesystem encodings. On Unix, if you specify a file as a CLI argument, that's bytes—and probably exactly the same bytes that appear on the filesystem. Python 3, however, will try to decode that using the argument encoding and then, when you send it to the filesystem, re-encode it using the filesystem encoding—neither of which might actually match the filename bytes as written! Surrogate escapes don't really solve this because it has to do with representable bytes rather than unrepresentable ones. I'm not sure how pathlib will make this worse or better.
Finally, for long filenames on Windows, I wonder whether we couldn't find a nice pathlib-native solution. I don't think the idea from SO where the actual Path object contains the magic prefix is the way to go, but maybe we can convince the library to add the prefix just before calling an OS function, as we currently do manually with syspath
. Maybe with a subclass of WindowsPath
or something.
An ambitious alternative would be to roll our own PEP 519-based library that uses a native representation everywhere.
Finally finally, you briefly remarked that type annotations could help with this. I agree! Maybe a good plan of attack would be (after dropping Python 2 support) just go whole-hog with type annotations in parts of the code that deal with paths. Then a later phase can explore holistic bytes
-> Path
conversion.
Regardless, it seems easier to attempt such a migration while leaving the database intact. At some point, we will need to convert between Path objects and "primitive" values that can go into the database. We might as well continue using the same conversion we use now—namely, represent the paths as the actual bytes that came from the operating system.
I'll keep it as is, at least for the first iteration.
Separately, one thing I'm pretty worried about here is the interaction between command-line argument encodings and filesystem encodings. On Unix, if you specify a file as a CLI argument, that's bytes—and probably exactly the same bytes that appear on the filesystem. Python 3, however, will try to decode that using the argument encoding and then, when you send it to the filesystem, re-encode it using the filesystem encoding—neither of which might actually match the filename bytes as written! Surrogate escapes don't really solve this because it has to do with representable bytes rather than unrepresentable ones. I'm not sure how pathlib will make this worse or better.
Hopefully this is something the abstraction of the os module and pathlib objects will take care of. I suppose we'll see if that's true or not.
Finally, for long filenames on Windows, I wonder whether we couldn't find a nice pathlib-native solution. I don't think the idea from SO where the actual Path object contains the magic prefix is the way to go, but maybe we can convince the library to add the prefix just before calling an OS function, as we currently do manually with syspath. Maybe with a subclass of WindowsPath or something.
An ambitious alternative would be to roll our own PEP 519-based library that uses a native representation everywhere.
Might be worth digging into. If it begins to create more problems than it's worth, then maybe we can decide otherwise.
My initial plan is to pass pathlib.PurePath objects around everywhere, converting to pathlib.Path objects if actual file operations are needed, and converting the objects to and from bytes for the database. Consider the following example:
>>> weird_bytes = os.fsencode('café')
>>> weird_bytes
b'caf\xc3\xa9'
>>> weird_str = os.fsdecode(weird_bytes)
>>> weird_str
'café'
>>> p = Path(weird_str)
>>> p
PosixPath('café')
>>> bytes(p)
b'caf\xc3\xa9'
Does this seem reasonable?
Note, it is possible to call str(path)
for all windows paths and bytes(path)
for all unix paths for storing in the database. I'm not sure if that makes things better or not.
The more I get into it though, the more it seems life would be much easier if we had the benefits of PEP 519 i.e. dropping python 3.5 support. For reference, here's the download stats for the last 180 days:
$ pypinfo -d 180 beets pyversion
Served from cache: False
Data processed: 643.30 GiB
Data billed: 643.30 GiB
Estimated cost: $3.15
| python_version | download_count |
| -------------- | -------------- |
| 3.8 | 4,146 |
| 3.7 | 2,570 |
| 2.7 | 1,923 |
| 3.6 | 1,861 |
| 3.5 | 132 |
| 3.9 | 77 |
| 3.4 | 32 |
| 3.3 | 2 |
| Total | 10,743 |
I wish I could do more, but I hit my free quota on these stats 😆
Does this seem reasonable?
Note, it is possible to call
str(path)
for all windows paths andbytes(path)
for all unix paths for storing in the database. I'm not sure if that makes things better or not.
That seems reasonable! The only difference is that we will want to make sure we are doing the same str
-to-bytes
conversion as we currently do on Windows for storage.
No objection to dropping Python 3.5 at this point.
No objection to dropping Python 3.5 at this point.
Great, this would be much harder to implement without the 3.6 improvements, and would probably be worth putting off at that point. I should have this done by the time beets 1.5.x (or whatever the next one is) gets released.
I'm happy to give this a shot, it seems like it'd be very helpful.
Pathlib is awesome. It abstracts path the platform for path manipulation. It would allow us to have more robust code and get rid to all the calls to
beets.util.*_path()
functions. It could be extended to provide desiredrepr()
.I feel like working on it. It is huge though.