Open ajnelson-nist opened 6 years ago
Bagit.py is designed to package entire directories. If it can’t do that the only safe thing to do is fail so the unreadable files can be corrected.
The long-term refactoring to make bagit more customizable might allow you to create a subclass which could implement custom ignore logic but I haven’t had time to work on that in awhile.
Could there instead be a flag that allows broken soft links to be preserved, instead of failing the whole process? In the application where I'm using bags, "correcting" files is not an option. They are legitimately included as "direction" files.
v1.0 of the BagIt specification doesn't preclude the use of links in the payload directory, it simply requires the filename be readable as binary data (2.1.2):
Each payload file is treated as an opaque octet stream when verifying file correctness.
In addition to the constraint in _can_read
identified by @ajnelson-nist above, the current version of bagit-python does not allow symbolic links (broken or not) to exist in the payload directory. Both bagging and validation operations use Bag._path_is_dangerous
to sanity check the path, and if the link points outside the payload directory it will fail.
We could consider relaxing bagging and validation in bagit-python to allow symbolic links, since they don't seem as dangerous as things like .../../
, /etc/passwd
, etc which Bag._is_dangerous
guards against. In fact, symlinks could actually be useful in some situations, like when managing storage volumes.
However if a symbolic link in the payload directory is broken, it is not possible to read it as an opaque octet stream, and thus it is an invalid payload file name. It seems like this use case is out of scope for the bagit-python utility, at least with respect to the v1.0 specification.
I do like the idea of a customizable bagit-python that could allow the Bag
class to be extended in various ways for custom behavior. For example you could imagine an allow_broken_symlinks
option that would create a new tag file broken_symlinks.txt
that could be populated during bagging and consulted during verification.
@acdha I'd be interested in learning more about the refactor you had planned, and if I could perhaps help with it. Full disclosure, I am working with @ajnelson-nist at NIST on the use of BagIt within the National Software Reference Library. Perhaps me doing a heavy lift on the refactor you had in mind could be mutually amenable to LC and NIST?
Just to be transparent about this, I met up with @acdha @dbrunton and Liz Madden from LC and we discussed the possibility of me contributing to a v2.0 version of bagit-python that would be backward compatible with the existing implementation but provide a more modular approach that could be easier to extend for various purposes (like the use case this issue is concerned with).
I'm going to put together a proposal for how we talked about moving things around to get some feedback from people who are using bagit-python.
Hello,
I encountered an issue today with
bagit.py
failing to deal with a broken soft link, and halting bagging an otherwise-intact file system tree. I was attempting to use the script on a directory tree that included a soft-linked file apparently meant to be set at a later date (e.g..../foo.cfg
pointing to.../config/populated_by_user/foo.cfg
, similar to what the Apache webserver does with config files). The execution environment in this case is in a POSIX-interfaced file system.The problem in
bagit.py
appears to stem from the function_can_read
, on (today's) Line 1362. The broken soft link in this case pointed at a non-existent directory, raising an error on Line 206.I suggest that a broken soft link should not prevent a directory tree from being bagged. It may be better for
_can_read
to only report actual directories and files that are unreadable, possibly with broken links as a new third output.If helpful, there is a script that converts a file system walk (via
os.walk
) to DFXML, and it has an if-ladder that goes through all file-system-level file types, not just directories and regular files. See thewalk_to_dfxml.py
functionfilepath_to_fileobject
, and all assignment statements matchingname_type =
(starting on Line 36 today). You may want_can_read
to skip operating on other file types as well.--Alex