LibraryOfCongress / bagit-python

Work with BagIt packages from Python.
http://libraryofcongress.github.io/bagit-python
218 stars 85 forks source link

Allow bag in-place on empty directories. #100

Closed mikedarcy closed 6 years ago

mikedarcy commented 6 years ago

As of the 1.6.x release series, it is no longer possible to create a bag in-place from an empty directory or directory path that does not contain any payload files. This worked in 1.5.4 and previous versions and feels like a regression.

For example:

mkdir bag
bagit ./bag/

Yields:

2017-12-12 22:14:24,032 - INFO - Creating bag for directory /home/mdarcy/bag
2017-12-12 22:14:24,032 - INFO - Creating data directory
2017-12-12 22:14:24,033 - INFO - Moving data to /home/mdarcy/bag/tmpMP6HXC/data
2017-12-12 22:14:24,033 - INFO - Moving /home/mdarcy/bag/tmpMP6HXC to data
2017-12-12 22:14:24,033 - INFO - Using 1 processes to generate manifests: sha256, sha512
2017-12-12 22:14:24,033 - ERROR - An error occurred creating a bag in /home/mdarcy/bag
Traceback (most recent call last):
  File "/usr/bin/bagit.py", line 207, in make_bag
    encoding=encoding)
  File "/usr/bin/bagit.py", line 1148, in make_manifests
    raise RuntimeError(_('Expected the same number of files for each checksum'))
RuntimeError: Expected the same number of files for each checksum
2017-12-12 22:14:24,033 - ERROR - Failed to create bag in ./bag/: Expected the same number of files for each checksum
Traceback (most recent call last):
  File "/usr/bin/bagit.py", line 1419, in main
    checksums=args.checksums)
  File "/usr/bin/bagit.py", line 207, in make_bag
    encoding=encoding)
  File "/usr/bin/bagit.py", line 1148, in make_manifests
    raise RuntimeError(_('Expected the same number of files for each checksum'))
RuntimeError: Expected the same number of files for each checksum

The following error checking logic in make_manifests causes this:

    if len(file_count_set) != 1:
        raise RuntimeError(_('Expected the same number of files for each checksum'))

    if len(byte_value_set) != 1:
        raise RuntimeError(_('Expected the same number of bytes for each checksums'))

In our application, we make use of the ability to create a bag in-place without payload files in order to create entirely "holey" bags where all payload files are referenced via fetch.txt. We do this by first creating a empty bag with no payload as a placeholder, and then later create/update the bag manifests with the respective file checkums from our fetch.txt entries.

We think that just as an empty directory is a valid directory (although arguably not very useful), so should a bag without any payload be. As long as the manifests don't contain any entries that reference the empty directory(ies), then the bag is still spec-compliant.

This PR adds code to test if the payload is empty and bypass the set-based error check in such a case. It restores functionality from previous versions of bagit-python which, in our opinion, is useful and relatively harmless.

If this behavior is to be intentionally removed from 1.6.x and onward, then it might be worth clarifying the exception handling a bit so that a user can better understand why this exception is being thrown in the specific "empty bag" case.

coveralls commented 6 years ago

Coverage Status

Coverage decreased (-0.1%) to 83.087% when pulling 5981c18e98907d69bf5557cf436b76b7c056df1a on mikedarcy:master into b6d1c26ee5092452e6ba7f6c2bdad27fc9e60f85 on LibraryOfCongress:master.

coveralls commented 6 years ago

Coverage Status

Coverage increased (+0.06%) to 83.272% when pulling 110ae136e64b4a28a8b1ee7d31c3b4204674dbb9 on mikedarcy:master into b6d1c26ee5092452e6ba7f6c2bdad27fc9e60f85 on LibraryOfCongress:master.

prairiewest commented 6 years ago

Forgive my drive-by comment, but I was just reading this spec today and happened on this PR. According to the spec, empty directories are not allowed:

Payload manifests only include the pathnames of files. Because of this, a payload manifest cannot reference empty directories. To account for an empty directory, a bag creator may wish to include at least one file in that directory; it suffices, for example, to include a zero-length file named ".keep".

I agree that the exception should be caught, though.

I also agree that your use case is a good one (creating an empty bag and then adding to it), so alternatively the spec should be updated to support the case. It did seem strange to me on reading that they specifically disallowed empty directories.

acdha commented 6 years ago

It seems like this should be technically valid as long as the manifest is completely empty, which would only allow top-level directories.

Longer-term, this seems like a good topic for bagit 2.0. We've been trying to hit 1.0 and one of the goals there was not breaking compatibility with older text format used in the early history of the spec. For 2.0 I'd like to propose dropping support with the md5deep, et al. tools so we could consider things like tracking directories or even filesystem permissions.

mikedarcy commented 6 years ago

Any update on this @acdha? We have a downstream dependency on this issue and would like to push a release of our software that coincides with a bagit-python release containing this fix.

acdha commented 6 years ago

v1.6.3 is on PyPI now

mikedarcy commented 6 years ago

Many thanks!