archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: metsrw filepath string type uses Unicode in Python 3 (mets-reader-writer) #580

Open sevein opened 5 years ago

sevein commented 5 years ago

Expected behaviour metsrw.FSEntry prefers to store the binary representation of the filepath. I understood that's a design choice because of the following snippet in the FSEntry constructor:

# path can validly be any encoding; if this value needs
# to be spliced later on, it's better to treat it as a
# bytestring than as actually being encoded text.
# TODO update this with six and bytes
if path:
    path = str(path)

In metsrw 0.3.0, the snippet above was refactored for better support of unicode input encoding (work done for https://github.com/artefactual/archivematica-storage-service/pull/438) - however the issue described here with Python 3 is still relevant.

Current behaviour In Python 3, metsrw uses str aka unicode - whereas in Python 2, it's using bytes aka binary. When changed to binary, tests don't pass mostly due to the lack of coercion.

Steps to reproduce

  1. Install Python 3
  2. Install metsrw
  3. Create a new FSEntry where path is bytes.
  4. Check the type of FSEntry.path

Your environment (version of Archivematica, OS version, etc) metsrw (all versions seem affected)


For Artefactual use: Please make sure these steps are taken before moving this issue from Review to Verified in Waffle:

jraddaoui commented 4 years ago

Hi @sevein, was this addressed in https://github.com/artefactual-labs/mets-reader-writer/commit/3347a142e4bf3f6fdc5708b679061171b382b201? Can we close this or is there anything else to do?

Not sure what the testing plan is for all the Py3 issues but maybe we should start closing those that we consider "done" and report new ones when we test Py3 altogether.

sevein commented 4 years ago

I don't think this is fixed since path is still unicode in Py3 and the original intention was to use a bytestring. But we may want to consider again whether this is an actual problem. @replaceafill, what do you think?

replaceafill commented 4 years ago

I don't think this is fixed since path is still unicode in Py3 and the original intention was to use a bytestring.

Sorry for answering your question with a question :blush: but do you think using a pathlib object instead of strings might help with the splicing issue mentioned in the original comment? It might be a bigger change to the library and client code but sounds like the right approach to me.

sevein commented 4 years ago

What do you think the comment means exactly with "splicing"? 🤷

replaceafill commented 4 years ago

Oh, I assumed it meant path handling: joins, subpaths, base names, etc. which are a problem when you start mixing bytes and unicode.

sevein commented 4 years ago

So I guess pathlib embraces the melted unicode sandwich? :D

replaceafill commented 4 years ago

FWIW pathlib.Path doesn't accept bytestrings in Python 3:

TypeError: argument should be a str object or an os.PathLike object returning str, not <class 'bytes'>