Open edsu opened 5 years ago
I see two different use cases for symbolic links:
1) Organize complex data within the payload dir 2) Avoid redundancy across bags by replacing file/dir with symlink to file/dir outside payload dir
For 1) the most reasonable behavior seems to follow the links as this PR implements and should indeed be the default.
The bahvior for 2) should only be enabled explicitly, e.g. while developing or for testing, with a flag like --allow-external-symlinks
. If not enabled explicitly, symlinks pointing beyond the payload directory should be a validation error. Because while that bag is considered valid, serialized (which must keep symlinks intact so 1) won't break), transferred to another client and deserialized, it will not be valid anymore.
There's also security issues when allowing payload data to point outside the bag dir, data/legit.data
pointing to /etc/passwd
and such.
Yes. It seems simplest for bagit-python to trust the integrity of the filesystem, and if it can read files in the payload directory it should do so. Perhaps this is naive. But the current behavior where symlinks could be present in the payload and totally ignored by the bagging and validation process seems wrong.
Currently if you attempt to bag a directory that contains symbolic links bagit.py happily creates the bag, but the manifests will not include the linked files or directories.
bagit.py --validate
will say this bag is valid, because the manifests don't mention the linked content, and the code that walks the filesystem during validation ignores links.However when you are looking at the filesystem you can see there is content present that is not in the manifest. This seems to be contrary to 3.A.4:
This PR includes changes to support following symbolic links when creating and validating bags:
I think an argument could be made that follow links should be the default. I'm thinking of situations where bags are quite large and may use symlinks to manage large payload files/directories.
Is this a crazy idea?