LibraryOfCongress / bagit-conformance-suite

Test cases for validating BagIt implementations
Other
10 stars 8 forks source link

⚠️ Add tests for filename encoding inconsistencies #1

Open acdha opened 8 years ago

acdha commented 8 years ago

Some tests which will probably require a helper utility to create files on the local filesystem to avoid having to fight the filename normalization performed by Git, the operating system or an archival utility, which is usually a great feature but complicates reliably testing error cases:

johnscancella commented 7 years ago

see v0.97/warning/filename-normalization and v0.97/warning/same-filename-listed-twice-with-different-normalization

acdha commented 7 years ago

The simple file in the repo approach will only work if you have the right combination of local filesystem and Git configuration.

acdha commented 7 years ago

Since we cannot rely on anything about the filename being preserved, the best way to test this would probably be to have two files and ensure that the manifest uses separate normalization forms for both so a filesystem or DVCS which normalizes names will break one of them.

johnscancella commented 7 years ago

then https://github.com/loc-rdc/bagit-conformance-suite/tree/master/v0.97/warning/same-filename-listed-twice-with-different-normalization should take care of that, no?

acdha commented 7 years ago

Both tests should be complete and standalone since they're not the same thing. That warning could be satisfied by an implementation which simply does a Unicode equivalence test on the manifest contents.

The first test isn't a warning but a compatibility check to confirm that the implementation will handle normalization differences between the manifest and filesystem so it won't choke when given a bag which has passed through a Mac, Git, some ZIP tools, etc. The easiest way to do that would be to have two separate files and list one in the manifest normalized as NFC and the other as NFD so a naive implementation will report an error if the filenames on disk are consistently normalized.