LibraryOfCongress / bagit-python

Work with BagIt packages from Python.
http://libraryofcongress.github.io/bagit-python
218 stars 85 forks source link

Graceful handling of filename unicode normalization discrepancies #82

Closed acdha closed 7 years ago

acdha commented 7 years ago

Some filesystems and utilities, most notable Apple's HFS+ filesystem, will transparently apply Unicode normalization to filenames. This means that the comparison between the list of files present on disk and the the list of files in the manifests must be performed with a Unicode-equivalence test to avoid confusing situations such as #51 where a visually-identical filename is reporting as non-existent because the encoded characters differ (see http://www.unicode.org/reports/tr15/).

To avoid backwards compatibility, bagit-python will not alter the values stored in the manifests or saved to disk. Instead, it will compare the two lists after applying a consistent normalization form which also allows raising an exception for attempts to bag files which differ only in normalization form (which is both user-hostile and likely to lead to data-loss).

This patch is somewhat more involved to maintain backwards compatibility with existing code. It might be time for another round of cleanup to the existing code structure, especially regarding Bag.entries and the way it's populated.

See #51 See #81