LibraryOfCongress / bagit-python

Work with BagIt packages from Python.
http://libraryofcongress.github.io/bagit-python
218 stars 85 forks source link

Links #120

Open edsu opened 5 years ago

edsu commented 5 years ago

Currently if you attempt to bag a directory that contains symbolic links bagit.py happily creates the bag, but the manifests will not include the linked files or directories.

bagit.py --validate will say this bag is valid, because the manifests don't mention the linked content, and the code that walks the filesystem during validation ignores links.

However when you are looking at the filesystem you can see there is content present that is not in the manifest. This seems to be contrary to 3.A.4:

For BagIt 1.0, every payload file MUST be listed in every payload manifest.

This PR includes changes to support following symbolic links when creating and validating bags:

bagit.py --follow-links mydir
bagit.py --validate --follow-links mydir

I think an argument could be made that follow links should be the default. I'm thinking of situations where bags are quite large and may use symlinks to manage large payload files/directories.

Is this a crazy idea?

kba commented 5 years ago

I see two different use cases for symbolic links:

1) Organize complex data within the payload dir 2) Avoid redundancy across bags by replacing file/dir with symlink to file/dir outside payload dir

For 1) the most reasonable behavior seems to follow the links as this PR implements and should indeed be the default.

The bahvior for 2) should only be enabled explicitly, e.g. while developing or for testing, with a flag like --allow-external-symlinks. If not enabled explicitly, symlinks pointing beyond the payload directory should be a validation error. Because while that bag is considered valid, serialized (which must keep symlinks intact so 1) won't break), transferred to another client and deserialized, it will not be valid anymore.

There's also security issues when allowing payload data to point outside the bag dir, data/legit.data pointing to /etc/passwd and such.

edsu commented 5 years ago

Yes. It seems simplest for bagit-python to trust the integrity of the filesystem, and if it can read files in the payload directory it should do so. Perhaps this is naive. But the current behavior where symlinks could be present in the payload and totally ignored by the bagging and validation process seems wrong.

coveralls commented 5 years ago

Coverage Status

Coverage decreased (-1.4%) to 82.166% when pulling 1dc1cc675372a9fc057fb63dfcb88719f97b3a78 on edsu:links into c39b650a80837a1d32314599143d0b5b04159248 on LibraryOfCongress:master.