bagit-profiles / bagit-profiles-specification

https://bagit-profiles.github.io/bagit-profiles-specification/
Other
35 stars 11 forks source link

Draft of #31 updates for feedback. #35

Closed ruebot closed 1 year ago

ruebot commented 1 year ago

I'm torn on how to implement @paulmillar's first issue in #31.

This I think would be cleaner, but can break backwards compatibility since we'd be removing an implementation detail.

Fetchable: forbidden|required|optional

Allow, forbid or require fetching of Bags. Default is optional.

I ended up implementing it as it is in the PR, since it's just adding an additional detail, and it works in conjunction with the existing one, instead of replacing it. If we want to go with the cleaner option, let me know.

@paulmillar please feel free to provide feedback as well since I can't tag you as a reviewer here. I definitely want your thumbs up before we merge anything.

ruebot commented 1 year ago

@tdilauro @jscancella y'all good with this? If so, I can squash and merge.

tdilauro commented 1 year ago

@ruebot I think this is fine as far as the spec goes.

One (minor?) point concerns one of the constraints added by what @paulmillar rightly notes as a DataCite profile of BagIt. The DataCite spec specifically says 'A TSV file named “fetch.txt.', whereas the BagIt spec allows spaces or tabs. I don't think that should hold us up.

One other concern is that an absolutely empty data directory might not be visible in some archive formats, since they are recording files. This might've been the origin of your thought of an .ignore file. I remember using .keep, in a similar vein.

paulmillar commented 1 year ago

Thanks for the feedback @tdilauro .

On TSV vs "spaces or tabs", this is something we could report back to DataCite. The may be able to fix this before releasing v4.5 of the metadata specification. Even if they don't, I (too) don't think this is a showstopper.

On /data containing place-holder files to prevent the directory being garbage collected. I think you have a good point.

How about the following:

If "true", the /data directory must hold the minimum contents for the directory to be included in the archive. If "false", no constraints are placed on the /data directory. Default is "false".

ruebot commented 1 year ago

I'd good with the suggested language change to take care of the empty directory issue.

tdilauro commented 1 year ago

Yeah, I'm good with this, too.

ruebot commented 1 year ago

Updated.

ruebot commented 1 year ago

@tdilauro forgot about the bag-requirements-section for ABNF. @jscancella added that a while back, and I'm not 100% familiar with it. I take a stab at adding what is missing there.

ruebot commented 1 year ago

Updated.

ruebot commented 1 year ago

Awesome! Thanks @tdilauro.

I'll give it a day or so for @jscancella. He seemed good with proceeding, so I'll assume he'll be fine with this merge if we don't hear from him.

jscancella commented 1 year ago

Awesome! Thanks @tdilauro.

I'll give it a day or so for @jscancella. He seemed good with proceeding, so I'll assume he'll be fine with this merge if we don't hear from him.

My apologies, I have been extremely busy with work and life events! I still need to update my implementation for this. Do you still need my help with the ABNF stuff?

jscancella commented 1 year ago

Sorry, just thinking about this more:

Data-Empty: true|false

If "true", the `/data` directory must hold the minimum contents for the directory to be included in the archive. If "false", no constraints are placed on the `/data` directory. Default is "false".

So if I set this to true and run it through a validator what does that mean? Is it enough to test if data directory contains at least 1 file?

ruebot commented 1 year ago

No worries @jscancella! I think we got the ABNF sorted. But, let me know if you think we missed anything.

So if I set this to true and run it through a validator what does that mean? Is it enough to test if data directory contains at least 1 file?

I take it to mean the validator would look for a dot file such as .keep or .ignore, or just an empty directory. If it contained anything more, then I don't think it'd be a holey bag. That make sense?

jscancella commented 1 year ago

@ruebot yeah that makes sense, just trying to tighten the language up from a programming perspective (since I had some time today on my day off and I was updating the code to be compliant with this version). Perhaps we want to update the language to be specific that the data directory may only contain a single zero byte file?

ruebot commented 1 year ago

Perhaps we want to update the language to be specific that the data directory may only contain a single zero byte file?

I like that! @tdilauro @paulmillar y'all good with that too?

tdilauro commented 1 year ago

The Bagit spec doesn't prohibit a bag with a fetch.txt from containing payload files. A "holey" bag is allowed to contain some (or none) of its payload, which is probably why it's said to have "holes," rather than to be devoid of payload files. See section 2.2.3.

ruebot commented 1 year ago

@tdilauro good catch!

paulmillar commented 1 year ago

Hi @jscancella,

The phrase "minimum contents for the directory to be included in the archive" was meant to be flexible enough to cover both cases: where a directory can be empty (e.g., zip file) and where a directory cannot be empty (e.g., git).

The zip archive format accepts empty directories. Therefore, the "minimum content" for /data (when archived with zip) would be nothing: the /data directory is completely empty.

On the other hand, a git repository cannot store an empty /data directory. If a git repo were to be created from a BagIt filesystem with a completely empty /data directory then anyone who cloned that repository would not see a /data directory (violating the spec, per §2.1.2) Therefore, the "minimum content" for /data (when stored with git) would be for /data to contain a single, empty file.

As @ruebot says, people often use a standard name for these empty files; for example, .keep or .ignore. Currently the profile doesn't specify the name of this empty file.

We could tighten this by specifying what the single, empty file (if needed) is called. I don't have a strong opinion on this.

We could also change it so that the /data directory MAY contain a single, empty file, irrespective of whether the archive supports empty directories. With this change, it would be valid for /data to contain the empty file .keep when creating a BagIt zip archive, even though this file isn't necessary. Again, I don't have a strong opinion on this.

In terms of validating, here's a rough plan:

Add a new test DATA-CONTENTS with the following behaviour:

The last clause would allow a zip archive to contain an empty file, but we might want to allow this, anyway.

HTH, Paul.

jscancella commented 1 year ago

I understand wanting to make it vague so that any archive is supported. Unfortunately that then makes it impossible for validation to be done programmatically.

I would suggest we change it to be

<li>
            `Data-Empty`: `true`|`false`
            <p>
              If "true", the `/data` directory must not contain any files or it must contain a single zero byte file(the file has no naming restrictions). If "false", no constraints are placed on the `/data` directory. Default is "false".
            </p>

mostly because from a programming perspective that is easy and unambiguous to implement.

paulmillar commented 1 year ago

Hi @jscancella ,

You version is basically fine.

There are two (minor) points we could address: duplication of "must" in the "if true" part, and it's a bit wordy.

I've tried to rephrase it so that there's only one "must" and avoids the duplication of "contain".

I've also added a "then" in the sentence, but we can go with the shorter version (If "true", the '/data' directory...) if you prefer that form.

I've replaced zero byte with empty. I think this would be the better term, but I'd be happy to switch back if you prefer zero-byte.

I've also omitted that the filename is unconstrained. I think it's commonly understanding that things that are not specified are unconstrained. However, if you'd like, we can certainly make this explicit.

Here's the result:

If "true" then the `/data` directory must contain either no files or a single empty file.
If "false", no constraints are placed on the `/data` directory. Default is "false".

Does this work for you?

jscancella commented 1 year ago

I've replaced zero byte with empty. I think this would be the better term, but I'd be happy to switch back if you prefer zero-byte.

Yes I would change it back to zero byte as it is less ambiguous. I could have a blank file that is filled with white space. A zero byte file is unambiguous no matter what the filesystem.

Personally I like the explicit wording on the filename, but that is the programmer in me and I am fine in leaving it off.

So I would propose:

If "true" then the `/data` directory must contain either no files or a single zero byte length file.
If "false", no constraints are placed on the `/data` directory. Default is "false".
paulmillar commented 1 year ago

Looks good to me.

I would create a new pull request with this updated version, but I'm not sure how to (re-)generate the index.html file.

ruebot commented 1 year ago

@paulmillar if you want to do a PR, you can just do it against the index.html currently in the root. There's a couple of issues open by @stain that we're working through, and that all can be 1.4.0.