APTrust / dart

Create bags based on BagIt profiles and send them off into the ether (EasyStore is now DART)
BSD 2-Clause "Simplified" License
43 stars 3 forks source link

Manifest filename escaping #509

Open diamondap opened 2 years ago

diamondap commented 2 years ago

This is a general bug filed against the manifest spec at https://github.com/LibraryOfCongress/bagit-spec/issues/46.

TLDR: No existing baggers correctly implement the BagIt spec's filename percent-encoding requirements. Fixing this in DART likely means breaking compatibility with all other baggers and validators.

Wait to see how the community moves on this.

pwinckles commented 2 years ago

I was just starting to file issues for implementations. I would also note:

  1. That, if desired, it would be possible to support validating bags with manifests that are not encoded correctly by falling back to the old behavior if spec complaint validation fails.
  2. Paths in fetch.txt are supposed to be encoded the same as manifest paths. I have not done a survey of fetch.txt implementations, but I would be surprised if there was not a general problem here as well. I'm not sure how this could be addressed in a backward compatible way.
diamondap commented 2 years ago

Thanks. The fallback for backwards compatibility is a good idea.

pwinckles commented 2 years ago

It occurred to me that another approach to backwards compatible validation would be to only decode %0D, %0A, and %25 when decoding paths in manifests. Normally, when percent-decoding, you'd decode all encoded characters as described here. However, by only decoding these three a correct BagIt 1.0 implementation would still be able to validate most bags produced by existing implementations.

For example, if a bag contains the file test%201.txt, then an existing implementation would write it to the manifest as data/test%201.txt when it should actually be data/test%25201.txt. However, if you only decode %0D, %0A, and %25, then the paths are equivalent.

This approach does not work for files that naturally include these three strings. For example, if a bag has a file named test%251.txt. Existing implementations would write it to the manifest as data/test%251.txt when it should be data/test%25251.txt. These paths are not equivalent. The first decodes to data/test%1.txt and the second decodes to data/test%251.txt.

While not perfect, I think this approach would greatly improve validation compatibility.