LibraryOfCongress / bagit-python

Work with BagIt packages from Python.
http://libraryofcongress.github.io/bagit-python
216 stars 85 forks source link

Filenames with spaces at the cause bag incompleteness errors. #143

Open nkrabben opened 4 years ago

nkrabben commented 4 years ago

Because bagit-python removes blank characters at the end of each manifest line, the in-memory representation of a filename like "path_with_space_at_the_end.txt " is "path_with_space_at_the_end.txt" which causes a completeness fail.

https://github.com/LibraryOfCongress/bagit-python/blob/4b76c143e61d815043f1e8bdfbb159ce98f7d978/bagit.py#L669

I think the likely reason is that L699 is being used to remove the new line characters, but is overly aggressive. I can write a test for this but want some feedback about potential solutions before trying to code that up.

kieranjol commented 4 years ago

This isn't a block or anything, but trailing spaces in filenames can produce terrible issues on windows (maybe ntfs in general?). Windows prevents you from doing it, but I've seen drives where the space was created in another operating system and the folder/file often not only doesn't appear to a user, the space on the disk is designated free space ready for overwriting. See cause 6 here: https://support.microsoft.com/en-ie/help/320081/you-cannot-delete-a-file-or-a-folder-on-an-ntfs-file-system-volume

kieranjol commented 4 years ago

Also I doubt that the line of code was concerned with this issue and it was more about cleaning up trailing whitespace.

acdha commented 4 years ago

Doesn’t the RFC say one or more spaces as a separator? If I remember correctly - on phone, sorry - that would require using the URL encoded form.