beetbox / beets

music library manager and MusicBrainz tagger
http://beets.io/
MIT License
12.97k stars 1.83k forks source link

Use pathlib everywhere in beets #1409

Open brunal opened 9 years ago

brunal commented 9 years ago

Pathlib is awesome. It abstracts path the platform for path manipulation. It would allow us to have more robust code and get rid to all the calls to beets.util.*_path() functions. It could be extended to provide desired repr().

I feel like working on it. It is huge though.

sampsyo commented 9 years ago

I'm all for it. Our current path handling stuff is too fiddly.

I have one important concern: encodings. The core problem is that Unix paths need to be bytes and Windows paths need to be Unicode strings (more or less). We need to store paths in a database, so eventually they need to be represented internally as one or the other. Beets currently chooses byte strings, and on Windows the "true" paths are encoded with UTF-8 just for storage and uniform manipulation. I'm not quite clear on how pathlib would interact with this constraint.

I also don't know how pathlib handles long names on Windows (the issue described here).

Maybe the right approach here would be to start with a small prototype, see how it goes on different platforms, and then attempt the full migration.

brunal commented 9 years ago

Currently,

Is that right? Am I forgetting something?

Python 3.2 introduced os.fs{en,de}code(): https://docs.python.org/3/library/os.html#os.fsencode The source is pretty simple (Lib/os.py:800) and shares some traits with our functions. However mbcs encoding only triggers another error mode: strict instead of surrogateescape.

I count ~100 {en,de}code calls in the beets/ folder. It'd be good to get rid of most of them.

brunal commented 9 years ago

I think the solution is to use python-future (https://github.com/PythonCharmers/python-future) which provides a future.utils.surrogateescape module.

sampsyo commented 9 years ago

Yes, surrogate escaping is probably now the Pythonic way of doing this. A few thoughts/questions:

brunal commented 9 years ago

Roughly,

sampsyo commented 9 years ago

OK, this migration plan sounds good.

On Windows exceptions: Yes, if the OS can be trusted to always supply Unicode, then we should be OK. Hopefully, paths do not come from any other source that could contain surrogate escapes.

On the backport: FWIW, it looks like someone has tried starting a maintained backport, but it appears abandoned.

LordSputnik commented 9 years ago

I don't think that pathlib is able to replace truncate/sanitize _path from what I've seen of it. It's a module for providing many of the functions of os.path with some corrections in an object-oriented way, and making some useful information about the path available (eg. root, suffix, filename). While this in itself is useful and could clean up a lot of code, it looks to me like we would still have to do the encoding/decoding of the path, and removal of unsafe characters.

sampsyo commented 9 years ago

Agreed; while pathlib will definitely be nice, it will not solve our encoding and sanitation problems. Those will still be 100% up to us.

ghost commented 8 years ago

there's now a maintained pathlib port that matches the stdlib version https://github.com/mcmtroffaes/pathlib2

@brunal : do you still wanna work on this? I'd love to help out.

jtpavlock commented 4 years ago

I'd like to tackle this and I want to make sure I'm considering all the potential issues. The proposed solution I am thinking is using pathlib everywhere and storing paths as strings in the database (using pathlib's string representation).

PEP 519 on pathlib

The hope is that people will gravitate towards path objects like pathlib and that will move people away from operating directly with bytes.

Some current issues:

  1. Long windows paths

    • If I'm understanding it correctly, the issue is basically defined here.

    Issue: Pathlib doesn't seem to correctly deal with these situations, as illustrated by the answer to that stackoverflow question. Potential Solution: Don't attempt to solve this on beet's end. Instead, direct users to enable long path support explicitly. This is the same decision that python made.

  2. Dealing with user's current databases Issue: Currently, all user's paths are bytes, and pathlib doesn't accept bytes. Potential Solution: I'm not the most experienced with databases, but is this something that could be solved with a database migration?
  3. Unix uses bytes Issue: So pathlib abstracts this layer and has it's own method for opening files and such. However, how do we deal with native Unix utilities that return and deal with bytes? Potential Solution: Python3 has surrogate escapes, which can represent bytes as strings, which can then be passed to pathlib to create a path object. These path objects then get passed around instead of the surrogate escape strings, which should hopefully prevent any unforeseen issues that come with using surrogate escapes directly.

Other notes:

  1. As far as enforcement, type hints and then static type checking may prove to be useful.
  2. PEP 519 (mentioned above) introduces a lot of nice adoption features for pathlib, notably, most (all?) os modules and open now accept pathlib objects. The issue is that the above PEP was introduced to Python 3.6. Whether or not these features are necessary for beet's adoption, I'm not sure.

These are my rough initial thoughts on the matter, and I'm only just now looking into these issues, so I'm sure I'm missing something and am hoping others can help fill in the gaps so we can work this out.

Other relevant PEPS: 529, 538, 540, 383

sampsyo commented 4 years ago

Hi! Thanks for being interested in taking this on!

I think the big obstacle here is about Unicode, surrogate escapes, and SQLite storage. Namely, because surrogate escapes are a Python-specific quirk, SQLite cannot store strings that contain them. For example:

>>> weird_string = 'café'.encode('latin1').decode('utf8', 'surrogateescape')
>>> import sqlite3
>>> sqlite3.connect(':memory:').execute('select ?;', (weird_string,)).fetchall()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 3: surrogates not allowed

Basically, SQLite stores values as bytes internally. Python tries to "pretend" that it's storing Unicode objects by relying on UTF-8 encoding.

Regardless, it seems easier to attempt such a migration while leaving the database intact. At some point, we will need to convert between Path objects and "primitive" values that can go into the database. We might as well continue using the same conversion we use now—namely, represent the paths as the actual bytes that came from the operating system.


Separately, one thing I'm pretty worried about here is the interaction between command-line argument encodings and filesystem encodings. On Unix, if you specify a file as a CLI argument, that's bytes—and probably exactly the same bytes that appear on the filesystem. Python 3, however, will try to decode that using the argument encoding and then, when you send it to the filesystem, re-encode it using the filesystem encoding—neither of which might actually match the filename bytes as written! Surrogate escapes don't really solve this because it has to do with representable bytes rather than unrepresentable ones. I'm not sure how pathlib will make this worse or better.


Finally, for long filenames on Windows, I wonder whether we couldn't find a nice pathlib-native solution. I don't think the idea from SO where the actual Path object contains the magic prefix is the way to go, but maybe we can convince the library to add the prefix just before calling an OS function, as we currently do manually with syspath. Maybe with a subclass of WindowsPath or something.

An ambitious alternative would be to roll our own PEP 519-based library that uses a native representation everywhere.


Finally finally, you briefly remarked that type annotations could help with this. I agree! Maybe a good plan of attack would be (after dropping Python 2 support) just go whole-hog with type annotations in parts of the code that deal with paths. Then a later phase can explore holistic bytes -> Path conversion.

jtpavlock commented 4 years ago

Regardless, it seems easier to attempt such a migration while leaving the database intact. At some point, we will need to convert between Path objects and "primitive" values that can go into the database. We might as well continue using the same conversion we use now—namely, represent the paths as the actual bytes that came from the operating system.

I'll keep it as is, at least for the first iteration.

Separately, one thing I'm pretty worried about here is the interaction between command-line argument encodings and filesystem encodings. On Unix, if you specify a file as a CLI argument, that's bytes—and probably exactly the same bytes that appear on the filesystem. Python 3, however, will try to decode that using the argument encoding and then, when you send it to the filesystem, re-encode it using the filesystem encoding—neither of which might actually match the filename bytes as written! Surrogate escapes don't really solve this because it has to do with representable bytes rather than unrepresentable ones. I'm not sure how pathlib will make this worse or better.

Hopefully this is something the abstraction of the os module and pathlib objects will take care of. I suppose we'll see if that's true or not.

Finally, for long filenames on Windows, I wonder whether we couldn't find a nice pathlib-native solution. I don't think the idea from SO where the actual Path object contains the magic prefix is the way to go, but maybe we can convince the library to add the prefix just before calling an OS function, as we currently do manually with syspath. Maybe with a subclass of WindowsPath or something.

An ambitious alternative would be to roll our own PEP 519-based library that uses a native representation everywhere.

Might be worth digging into. If it begins to create more problems than it's worth, then maybe we can decide otherwise.

jtpavlock commented 4 years ago

My initial plan is to pass pathlib.PurePath objects around everywhere, converting to pathlib.Path objects if actual file operations are needed, and converting the objects to and from bytes for the database. Consider the following example:

>>> weird_bytes = os.fsencode('café')
>>> weird_bytes
b'caf\xc3\xa9'
>>> weird_str = os.fsdecode(weird_bytes)
>>> weird_str
'café'
>>> p = Path(weird_str)
>>> p
PosixPath('café')
>>> bytes(p)
b'caf\xc3\xa9'

Does this seem reasonable?

Note, it is possible to call str(path) for all windows paths and bytes(path) for all unix paths for storing in the database. I'm not sure if that makes things better or not.

The more I get into it though, the more it seems life would be much easier if we had the benefits of PEP 519 i.e. dropping python 3.5 support. For reference, here's the download stats for the last 180 days:

$ pypinfo -d 180 beets pyversion 
Served from cache: False
Data processed: 643.30 GiB
Data billed: 643.30 GiB
Estimated cost: $3.15

| python_version | download_count |
| -------------- | -------------- |
| 3.8            |          4,146 |
| 3.7            |          2,570 |
| 2.7            |          1,923 |
| 3.6            |          1,861 |
| 3.5            |            132 |
| 3.9            |             77 |
| 3.4            |             32 |
| 3.3            |              2 |
| Total          |         10,743 |

I wish I could do more, but I hit my free quota on these stats 😆

sampsyo commented 4 years ago

Does this seem reasonable?

Note, it is possible to call str(path) for all windows paths and bytes(path) for all unix paths for storing in the database. I'm not sure if that makes things better or not.

That seems reasonable! The only difference is that we will want to make sure we are doing the same str-to-bytes conversion as we currently do on Windows for storage.

No objection to dropping Python 3.5 at this point.

jtpavlock commented 4 years ago

No objection to dropping Python 3.5 at this point.

Great, this would be much harder to implement without the 3.6 improvements, and would probably be worth putting off at that point. I should have this done by the time beets 1.5.x (or whatever the next one is) gets released.

Serene-Arc commented 2 years ago

I'm happy to give this a shot, it seems like it'd be very helpful.