LibraryOfCongress / bagit-python

Work with BagIt packages from Python.
http://libraryofcongress.github.io/bagit-python
216 stars 85 forks source link

Hashlib paradigm inconsistent with Bagit spec, but also causing runtime errors #158

Open ross-spencer opened 2 years ago

ross-spencer commented 2 years ago

Bagit-python is currently using a generic pattern to import the list of checksums available to the tool to be used:

https://github.com/LibraryOfCongress/bagit-python/blob/bff20d27ee191a1a21381bcf633724ef7b36f727/bagit.py#L125-L131

This creates a help-command with the following in Python 3.9.6.

--sha3_224 
--sha3_256 
--blake2s 
--sha3_384 
--sha384 
--md5 
--sha224 
--shake_256 
--sha256 
--shake_128 
--sha3_512 
--blake2b 
--sha512 
--sha1 

@Pwinckles brings up something similar here: https://github.com/LibraryOfCongress/bagit-python/issues/156 I am not clear what is invalid about those names - that being said:

The spec states:

To avoid future ambiguity, the checksum algorithm SHOULD be registered in IANA's "Named Information Hash Algorithm Registry"

The most obvious example of algorithm's not in the registry at time of writing are those from the shake family. Further, there is ambiguity in the Blake algorithms presented by Python's hashlib.

Errors caused by shake

Additionally, the shake algorithms require a length argument not handled by the code: https://docs.python.org/3/library/hashlib.html#shake-variable-length-digests

Causing the following when selected.

Traceback (most recent call last):
  File "C:\temp\testing\bags\venv\lib\site-packages\bagit.py", line 241, in make_bag
    total_bytes, total_files = make_manifests(
  File "C:\temp\testing\bags\venv\lib\site-packages\bagit.py", line 1254, in make_manifests
    checksums = [manifest_line_generator(i) for i in _walk(data_dir)]
  File "C:\temp\testing\bags\venv\lib\site-packages\bagit.py", line 1254, in <listcomp>
    checksums = [manifest_line_generator(i) for i in _walk(data_dir)]
  File "C:\temp\testing\bags\venv\lib\site-packages\bagit.py", line 1412, in generate_manifest_lines
    results = [
  File "C:\temp\testing\bags\venv\lib\site-packages\bagit.py", line 1413, in <listcomp>
    (alg, hasher.hexdigest(), decoded_filename, total_bytes)
TypeError: hexdigest() missing required argument 'length' (pos 1)
2022-05-13 10:38:36,341 - ERROR - Failed to create bag in govdoc-gifs: hexdigest() missing required argument 'length' (pos 1)
Traceback (most recent call last):
  File "C:\temp\testing\bags\venv\lib\site-packages\bagit.py", line 1599, in main
    make_bag(
  File "C:\temp\testing\bags\venv\lib\site-packages\bagit.py", line 241, in make_bag
    total_bytes, total_files = make_manifests(
  File "C:\temp\testing\bags\venv\lib\site-packages\bagit.py", line 1254, in make_manifests
    checksums = [manifest_line_generator(i) for i in _walk(data_dir)]
  File "C:\temp\testing\bags\venv\lib\site-packages\bagit.py", line 1254, in <listcomp>
    checksums = [manifest_line_generator(i) for i in _walk(data_dir)]
  File "C:\temp\testing\bags\venv\lib\site-packages\bagit.py", line 1412, in generate_manifest_lines
    results = [
  File "C:\temp\testing\bags\venv\lib\site-packages\bagit.py", line 1413, in <listcomp>
    (alg, hasher.hexdigest(), decoded_filename, total_bytes)
TypeError: hexdigest() missing required argument 'length' (pos 1)

While it's nice there's some extensibility for free, perhaps, as this issue can come up again in future, the algorithms can be provided from a known list compatible with IANA today, and Python, and perhaps cross-referenced against, checksums_guaranteed instead?

acdha commented 2 years ago

If memory serves, that bit of code predates the work on the RFC which strengthened the guidance around things like this. If you'd like to send a pull-request I think it would definitely make sense to trim that down to be only the list of non-draft algorithms from https://www.iana.org/assignments/named-information/named-information.xhtml.

ross-spencer commented 2 years ago

That makes sense @acdha. I'll have a look and see if I can submit something. Folks reading this list might want to avail themselves on whether they're going to be affected by any of the changes, i.e. checking the IANA link above for any checksums removed they may be using.