--validation warnings related to diacritics

tw4l commented 8 years ago

Yesterday I created a very large bag (approximately 82,000 files; 300 GB) using bagit.py on my Mac. Afterwards, the bag validated without issue.

Today, I copied the bag to our Archivematica transfer server and ran bagit.py --validate on the bag. This resulted in many errors, seemingly related to diacritics/character encoding issues. Some of the sample warnings:

2016-01-20 09:34:44,183 - WARNING - data/VILLANURBS/CD-DVD/07_200606/05_0929 VNU/VNU_planta arriba/cerámica/produccion/prueba/06 0119 ceramica VNU prueba 2 b.pdf exists on filesystem but is not in manifest
2016-01-20 09:34:43,267 - WARNING - data/VILLANURBS/CD-DVD/07_200606/05_0929 VNU/VNU_planta arriba/cera ́mica/produccion/prueba/06 0119 ceramica VNU prueba 2 b.pdf exists in manifest but not found on filesystem

2016-01-20 09:34:43,303 - WARNING - data/VILLANURBS/SCAPE/2005/05_0929 VNU/05_0929 VNU/VNU_planta arriba/cerramientos/fachada vidrio/dibujos/06 1116 perfilieri ́a 3D.dgn exists in manifest but not found on filesystem
2016-01-20 09:34:44,261 - WARNING - data/VILLANURBS/SCAPE/2005/05_0929 VNU/05_0929 VNU/VNU_planta arriba/cerramientos/fachada vidrio/dibujos/06 1116 perfiliería 3D.dgn exists on filesystem but is not in manifest

When I look at the manifest via cat in a bash terminal, the paths appear exist as they exist on the filesystem.

Thanks!

edsu commented 8 years ago

This warning is telling you that a file present in the bag's payload directory (data) is not present in the manifest. Is it possible you added some files to the bag after you created it, and before you transferred it to the other system?

tw4l commented 8 years ago

Right. The thing is, the file does exist - just not at the path that bagit.py is reading from the manifest (the formatting above doesn't make this obvious, but they are in pairs).

The file exists at /path/filenáme, but to check that it exists and verify the hash, bagit.py is looking for /path/filena ́me. So I get one warning saying the file can't be found, another that the md5 can't be verified because the file isn't found, and a third saying an unexpected file is in /data.

What I can't figure out is why bagit.py isn't successfully reading the paths for accented characters from the manifest when the manifest was written/encoded by bagit.py. (I'm assuming it's a character encoding issue, but that's where I start to get out of my depth)

tw4l commented 8 years ago

For what it's worth, --validate --fast also validates successfully, so the payload oxum checks out.

edsu commented 8 years ago

Can you share a line from your manifest that includes a file that isn't being found?

edsu commented 8 years ago

Also what version of Python are you using?

tw4l commented 8 years ago

On the Mac (where validation worked fine):

5232da61f988043e6b29c37d9b0e1ae4 data/VILLANURBS/SCAPE/2005/05_0929 VNU/05_0929 VNU/VNU_planta arriba/cerramientos/fachada vidrio/dibujos/06 1116 perfiliería 3D.dgn

Version: Python 2.7.10

On Archivematica server (where I ran into the issue):

5232da61f988043e6b29c37d9b0e1ae4 data/VILLANURBS/SCAPE/2005/05_0929 VNU/05_0929 VNU/VNU_planta arriba/cerramientos/fachada vidrio/dibujos/06 1116 perfiliería 3D.dgn

Version: Python 2.7.6

File -i for both files shows "text/plain; charset=us-ascii" and JHOVE2 says they have UTF-8 character encoding.

edsu commented 8 years ago

Very strange. I created a test bag with that filename and it worked just fine on OS X and on Linux (after I copied it there).

Perhaps you noticed this already, but If you look closely at the error snippet you provided, you can see something is going on with the way the unicode is being composed:

2016-01-20 09:34:43,303 - WARNING - data/VILLANURBS/SCAPE/2005/05_0929 VNU/05_0929 VNU/VNU_planta arriba/cerramientos/fachada vidrio/dibujos/06 1116 perfilieri ́a 3D.dgn exists in manifest but not found on filesystem

2016-01-20 09:34:44,261 - WARNING - data/VILLANURBS/SCAPE/2005/05_0929 VNU/05_0929 VNU/VNU_planta arriba/cerramientos/fachada vidrio/dibujos/06 1116 perfiliería 3D.dgn exists on filesystem but is not in manifest

Note the difference between how this word appears in the manifest:

perfilierí

and how it appears coming back from the filesystem:

perfilieri ́

You can see in the latter í is decomposed into i ́. The BagIt specification indicates that UTF-8 should be used, but I don't think it is explicit about what normalization form to use...maybe bagit.py should assume NFD?

edsu commented 8 years ago

I wonder if bagit.py should assume NFC normalization when reading from the manifest and the filesystem?

edsu commented 8 years ago

Ok, I've added a unit test that seems to demonstrate the problem.

acdha commented 8 years ago

Linux, OS X and Windows all differ on normalization and I believe at least on Linux it varies depending on filesystem. The best thing to do here is probably to add a function which we use for all of the filenames which uses an arbitrary normalization form for both sides (filesystem and tag file) to ensure that the comparisons are the same. I've used this for years on WDL:

unicodedata.normalize("NFC", v)

acdha commented 8 years ago

I just created https://github.com/LibraryOfCongress/bagit-python/pull/57 (which includes the work from #55) to apply NFD normalization when reading or writing a manifest.

acdha commented 8 years ago

Looking at this more, I think this is going to be something of a hairball. OS X on HFS+ normalizes everything to NFD (not sure about NFS or SMB access by an OS X client). Windows (always) and Linux (apparently almost always) does not normalize at all, which means it's often possible to have files which differ only in normalization form even if that's terrible for usability.

If we're going to address this in BagIt, it will require some complicated test cases to handle all of the possibilities. It seems like we could make validation pass in many cases by doing something like using the filename as specified in the manifest and, if that does not exist, testing for the NFD form to see if it's been copied to an HFS+ filesystem. The more involved form would probably be something like reading the contents of the directory and seeing whether our set of “missing” files matches the set of “extra” files after normalization. The one I'm not sure we should support at all is the case where two files of different normalization exist – I think that would be a good candidate for a spec note and raising an exception.

LibraryOfCongress / bagit-python

--validation warnings related to diacritics #51