bagit-profiles / bagit-profiles-specification

https://bagit-profiles.github.io/bagit-profiles-specification/
Other
35 stars 11 forks source link

Support an only-holes profile. #31

Closed paulmillar closed 1 year ago

paulmillar commented 1 year ago

The forthcoming v4.5 of the DataCite metadata scheme introduces the concept of a Distribution. A Distribution identifies a way to obtain the underlying data of some dataset. For datasets with multiple files and where (for whatever reason) the dataset files are not available as a single archive (e.g., as a .zip file), DataCite recommends that there be a BagIt file that describes the dataset. The URL for this BagIt file is then included in the Distribution for this dataset.

The preferred format for this BagIt file is that it contains no data: the /data directory exists but is empty. The fetch.txt file must exist and is populated with details of the files of the dataset. More details are available in the Using Distribution for a collection of files draft section.

With this description, DataCite is describing a profile of BagIt. The semantics of BagIt are not modified, but additional constraints are placed on the allowed values.

I believe it is currently not possible to describe this DataCite profile completely, using the language of this profile specification. Prima facie, there appears to be two problems:

ruebot commented 1 year ago
  1. Would adding Fetch.txt-Required: true|false address the first issue? If believe that's an easy fix.
  2. Is an empty /data a valid bag? To me, the spirit of 2.1.2 reads like at least something has to be in the directory, maybe an .ignore file?
jscancella commented 1 year ago
  1. Would adding Fetch.txt-Required: true|false address the first issue? If believe that's an easy fix.

    1. Is an empty /data a valid bag? To me, the spirit of 2.1.2 reads like at least something has to be in the directory, maybe an .ignore file?

During my time at LoC they did have the idea of a "holey" bag that was purely fetch.txt files. But in practice they didn't work well. Personally I don't think using bagit for archiving is a good idea and the fetch file is an ugly historical hack (just my $0.02). But to answer your 2. question - yes that is valid (https://github.com/LibraryOfCongress/bagit-conformance-suite/tree/master/v0.97/valid/holey-bag)

paulmillar commented 1 year ago

@ruebot yes, I think adding Fetch.txt-Required would be fine: an easy fix.

@jscancella Thanks for the link to the valid holey-bag example and your comments.

As an aside:

I'm not convinced that BagIt is the right decision for DataCite. Independently of DataCite, I was trying to solve this distribution problem and was planning to use Metalink instead. Then I came across DataCite's v4.5 request-for-comments page. I think Metalink has a few advantages over a holey-bag (principally, the possibility to include multiple URLs per file), but otherwise they're somewhat similar ... at least, on paper.

My concern comes from a feeling that the completely holey bag isn't really what people had in mind: it's off the beaten path, so likely to hit bugs and other problems. For example, a dataset might link to a regular (non-holely) BagIt file as the Distribution URL, so how should a client distinguish between these two cases, without downloading the file and looking at the contents? Profiles might answer that problem, but there's currently no place in DataCite metadata to place the BagIt profile (other than the MIME-Type, but then that's another problem with BagIt: no MIME-Type).

So, in general, your comments chime with my fears about this direction.

I'm not sure whether to plough on with the plans to use Metalink, or to switch to using BagIt (following DataCite direction) and introduce a new tag to hold alternative URLs. Therefore, this issue is me hedging my bets ;-)

jscancella commented 1 year ago

I know lots of places use holey bags as part of their archiving. But again, just my opinion, I think that is the wrong choice. Bagit is great when you are sending big hard drives across the country. But for preserving and making data available there are better options that deal with bit-rot, availability, redundancy, etc.

ruebot commented 1 year ago

Oh right! I completely forgot about the holey bags work at LoC. Thanks for the reminder @jscancella.

I guess we could do something like: `Data-Empty: true|false Allow or forbid a "holey" bag'? How's that sound?

paulmillar commented 1 year ago

Thanks for the pull request @ruebot .

Overall, the change looks good.

I've added a few comments where I think the language could be tightened slightly.

ruebot commented 1 year ago

@paulmillar updated with your feedback. Let me know if that works now.

paulmillar commented 1 year ago

Thanks @ruebot The update looks good to me. Let's see what the reviewers say.