bagit-profiles / bagit-profiles-specification

https://bagit-profiles.github.io/bagit-profiles-specification/
Other
35 stars 11 forks source link

Add Payload-Files-Required? #36

Closed stain closed 9 months ago

stain commented 1 year ago

Hi, I am trying to update our Research Object BagIt Profile to match our RO-Crate requirements for BagIt

One of the big changes is that we moved our metadata file from metadata/metadata.json to be inside data/ro-crate-metadata.json (to avoid rewriting paths to "../data" if archiving an existing RO-Crate).

However, there in the BagIt Profiles there is no payload equivalent to Tag-Files-Required so I can't express any requirements for the data/ folder. It would technically be possible to do "Tag-Files-Required": ["data/ro-crate-metadata.json"] but even if these tag files need not be listed in tag manifiest files, being in data/ this is a payload file, not a tag file.

I would suggest adding Payload-Files-Required -- and I guess for consistency Payload-Files-Allowed:

  1. Payload-Files-Required: LIST
    A list of a payload files and/or directories that MUST be included in a conformant Bag. Entries are full path names relative to the Bag base directory, e.g. data/LICENSE.txt or data/src/. Paths requiring directory existence MUST end with /, indicating the in conformant Bags, the directory MUST be present with at least one file or subdirectory (which may be .keep to indicate an empty directory).
    Every file in Payload-Files-Required MUST also be covered by the path patterns in Payload-Files-Allowed .

  2. Payload-Files-Allowed: LIST
    A list of payload files or directories that MAY be included in a conformant Bag. Each entry MUST be either a full path names relative to the bag base directory or a path name pattern in which asterisks can represent zero or more characters (c.f. glob(7)). Paths requiring permitted directories MUST end with /* (not /).
    Conformants bags MUST NOT contain any payload that do not match the specified Payload-Files-Allowed. If Payload-Files-Allowed is not provided, its value is assumed to be ['*'], i.e. all payload files are allowed.
    At least all the payload paths listed in Payload-Files-Required MUST be covered by the list of path patterns in Payload-Files-Allowed.

BTW, I think we need to add the "Conformants bags MUST NOT contain" phrase also to entry 12 on "Tag-Files-Allowed", as technically the use of "MAY" means the files are optional - it doesn't specify other files are not permitted!

ruebot commented 1 year ago

I'm ok with this, and it makes sense. @tdilauro @jscancella do you have any concerns if all this was added?

ruebot commented 1 year ago

Other question -- do we want to create a new version with this (1.5.0)? Or lump it in with all the updates with just did for #31, and call it all 1.4.0?

tdilauro commented 1 year ago

Other question -- do we want to create a new version with this (1.5.0)? Or lump it in with all the updates with just did for https://github.com/bagit-profiles/bagit-profiles-specification/issues/31, and call it all 1.4.0?

I think it's fine to include in a 1.4.0 (or 1.4.1) release. Since we do have releases, I think it's fair to not consider a spec version to be published until it has been released.

tdilauro commented 1 year ago

@stain Regarding the presence of a .keep file to indicate an empty directory, I think that is an implementation detail to accommodate current archive formats that don't track empty directories. I don't think we should necessarily include it in the specification.

tdilauro commented 1 year ago

@stain Would you share your rationale for the distinction between the directory suffixes / and /*?

jscancella commented 1 year ago

I'm ok with this, and it makes sense. @tdilauro @jscancella do you have any concerns if all this was added?

I think it can work, but really at what point are duplicating the payload manifest file? I don't know, to me this just seems a but too far(but that;s just my $0.02).

tdilauro commented 1 year ago

@jscancella This is for the bag profile, so these would effective be saying what must appear in the payload manifest files (directories excluded, of course). The payload manifest is used to help determine if a bag is a valid bag, whereas this would be used to help determine if a valid bag conforms to a specific profile.

jscancella commented 1 year ago

@tdilauro thanks. To be clear, I understand the reasoning, I just think it is going to far. But I am just 1 person on this team, and if everyone else agrees that's ok!

tdilauro commented 1 year ago

Another question for everyone here: How do these constraints interact with "holey" bags / fetch.txt? My assumption is that a "holey" bag should be allowed to work the way that it is designed to work, but the current language would require these files/directories to be present in bag. Requiring/allowing the files to be in the payload manifest(s), rather than present in the bag could solve this problem for files, but would not work for directories. 🧐

tdilauro commented 1 year ago

@ruebot @jscancella I'm not currently working in the digital preservation/archiving space these days, but wondering if there are opportunities for broader community engagement on changes to these specs.

jscancella commented 1 year ago

Yeah, I haven't worked in this space either for some time (currently I am at USPTO). I think adding new team members or otherwise engaging the broader community would be a great idea.

ruebot commented 1 year ago

I'm still in the space, but don't really work with bags anymore. So, it might be worth asking for more feedback on the digital curation list.

stain commented 1 year ago

Another question for everyone here: How do these constraints interact with "holey" bags / fetch.txt?

Upstream BagIt are not that fan of holey bags anymore, but I am (and it's used in bdbag together with ARK identifiers). So I would consider the profile and Payload-Files-Required to apply after completing the bag from fetch.txt, which would anyway be needed to successfully validate the manifest files.

stain commented 1 year ago

@stain Would you share your rationale for the distinction between the directory suffixes / and /*?

So I think just for consistency with the rest of the profile on Tag-Files-Allowed allowing globs, while Tag-Files-Required only have absolute files, and then to avoid Payload-Files-Allowed permitting both foo/ and foo/* -- likewise Payload-Files-Required insists on trailing / so you can't have an ambiguous file-folder requirement foo.

Although BagIt can't record empty directories [without .keep] (https://datatracker.ietf.org/doc/html/rfc8493#section-2.1.3)(as there's no checksum to add to the manifest) I would not want to mandate .keep to exist as the directory may be non-empty.

(This means the profile can't mandate an empty directory, which I think is OK)

Thus the reasoning is that if a directory is permitted, it must contain something, and hence the *

stain commented 1 year ago

@stain Regarding the presence of a .keep file to indicate an empty directory, I think that is an implementation detail to accommodate current archive formats that don't track empty directories. I don't think we should necessarily include it in the specification.

Not just implementation detail, but the suggested filename by BagIt specs. Can nevertheless change this to just say "placeholder file" and reference back to RFC 8493 section 2.1.3:

A manifest MUST NOT reference directories. Bag creators who wish to create an otherwise empty directory have typically done so by creating an empty placeholder file with a name such as ".keep".