Closed ruebot closed 1 year ago
@tdilauro @jscancella y'all good with this? If so, I can squash and merge.
@ruebot I think this is fine as far as the spec goes.
One (minor?) point concerns one of the constraints added by what @paulmillar rightly notes as a DataCite profile of BagIt. The DataCite spec specifically says 'A TSV file named “fetch.txt.', whereas the BagIt spec allows spaces or tabs. I don't think that should hold us up.
One other concern is that an absolutely empty data
directory might not be visible in some archive formats, since they are recording files. This might've been the origin of your thought of an .ignore
file. I remember using .keep
, in a similar vein.
Thanks for the feedback @tdilauro .
On TSV vs "spaces or tabs", this is something we could report back to DataCite. The may be able to fix this before releasing v4.5 of the metadata specification. Even if they don't, I (too) don't think this is a showstopper.
On /data
containing place-holder files to prevent the directory being garbage collected. I think you have a good point.
How about the following:
If "true", the
/data
directory must hold the minimum contents for the directory to be included in the archive. If "false", no constraints are placed on the/data
directory. Default is "false".
I'd good with the suggested language change to take care of the empty directory issue.
Yeah, I'm good with this, too.
Updated.
@tdilauro forgot about the bag-requirements-section
for ABNF. @jscancella added that a while back, and I'm not 100% familiar with it. I take a stab at adding what is missing there.
Updated.
Awesome! Thanks @tdilauro.
I'll give it a day or so for @jscancella. He seemed good with proceeding, so I'll assume he'll be fine with this merge if we don't hear from him.
Awesome! Thanks @tdilauro.
I'll give it a day or so for @jscancella. He seemed good with proceeding, so I'll assume he'll be fine with this merge if we don't hear from him.
My apologies, I have been extremely busy with work and life events! I still need to update my implementation for this. Do you still need my help with the ABNF stuff?
Sorry, just thinking about this more:
Data-Empty
:true
|false
If "true", the `/data` directory must hold the minimum contents for the directory to be included in the archive. If "false", no constraints are placed on the `/data` directory. Default is "false".
So if I set this to true and run it through a validator what does that mean? Is it enough to test if data
directory contains at least 1 file?
No worries @jscancella! I think we got the ABNF sorted. But, let me know if you think we missed anything.
So if I set this to true and run it through a validator what does that mean? Is it enough to test if data directory contains at least 1 file?
I take it to mean the validator would look for a dot file such as .keep
or .ignore
, or just an empty directory. If it contained anything more, then I don't think it'd be a holey bag. That make sense?
@ruebot yeah that makes sense, just trying to tighten the language up from a programming perspective (since I had some time today on my day off and I was updating the code to be compliant with this version). Perhaps we want to update the language to be specific that the data directory may only contain a single zero byte file?
Perhaps we want to update the language to be specific that the data directory may only contain a single zero byte file?
I like that! @tdilauro @paulmillar y'all good with that too?
The Bagit spec doesn't prohibit a bag with a fetch.txt
from containing payload files. A "holey" bag is allowed to contain some (or none) of its payload, which is probably why it's said to have "holes," rather than to be devoid of payload files. See section 2.2.3.
@tdilauro good catch!
Hi @jscancella,
The phrase "minimum contents for the directory to be included in the archive" was meant to be flexible enough to cover both cases: where a directory can be empty (e.g., zip file) and where a directory cannot be empty (e.g., git).
The zip archive format accepts empty directories. Therefore, the "minimum content" for /data
(when archived with zip) would be nothing: the /data
directory is completely empty.
On the other hand, a git repository cannot store an empty /data
directory. If a git repo were to be created from a BagIt filesystem with a completely empty /data
directory then anyone who cloned that repository would not see a /data
directory (violating the spec, per §2.1.2) Therefore, the "minimum content" for /data
(when stored with git) would be for /data
to contain a single, empty file.
As @ruebot says, people often use a standard name for these empty files; for example, .keep
or .ignore
. Currently the profile doesn't specify the name of this empty file.
We could tighten this by specifying what the single, empty file (if needed) is called. I don't have a strong opinion on this.
We could also change it so that the /data
directory MAY contain a single, empty file, irrespective of whether the archive supports empty directories. With this change, it would be valid for /data
to contain the empty file .keep
when creating a BagIt zip archive, even though this file isn't necessary. Again, I don't have a strong opinion on this.
In terms of validating, here's a rough plan:
Add a new test DATA-CONTENTS
with the following behaviour:
Data-Empty: false
then test DATA-CONTENTS
passes.Data-Empty: true
then check the contents of the /data
directory.
/data
directory is empty then test DATA-CONTENTS
passes./data
directory contains at least one non-empty file then test DATA-CONTENTS
fails./data
directory contains more than one file then test DATA-CONTENTS
fails./data
directory contains exactly one file, that is empty then test DATA-CONTENTS
passes.The last clause would allow a zip archive to contain an empty file, but we might want to allow this, anyway.
HTH, Paul.
I understand wanting to make it vague so that any archive is supported. Unfortunately that then makes it impossible for validation to be done programmatically.
I would suggest we change it to be
<li>
`Data-Empty`: `true`|`false`
<p>
If "true", the `/data` directory must not contain any files or it must contain a single zero byte file(the file has no naming restrictions). If "false", no constraints are placed on the `/data` directory. Default is "false".
</p>
mostly because from a programming perspective that is easy and unambiguous to implement.
Hi @jscancella ,
You version is basically fine.
There are two (minor) points we could address: duplication of "must" in the "if true" part, and it's a bit wordy.
I've tried to rephrase it so that there's only one "must" and avoids the duplication of "contain".
I've also added a "then" in the sentence, but we can go with the shorter version (If "true", the '/data' directory
...) if you prefer that form.
I've replaced zero byte
with empty
. I think this would be the better term, but I'd be happy to switch back if you prefer zero-byte.
I've also omitted that the filename is unconstrained. I think it's commonly understanding that things that are not specified are unconstrained. However, if you'd like, we can certainly make this explicit.
Here's the result:
If "true" then the `/data` directory must contain either no files or a single empty file.
If "false", no constraints are placed on the `/data` directory. Default is "false".
Does this work for you?
I've replaced zero byte with empty. I think this would be the better term, but I'd be happy to switch back if you prefer zero-byte.
Yes I would change it back to zero byte as it is less ambiguous. I could have a blank file that is filled with white space. A zero byte file is unambiguous no matter what the filesystem.
Personally I like the explicit wording on the filename, but that is the programmer in me and I am fine in leaving it off.
So I would propose:
If "true" then the `/data` directory must contain either no files or a single zero byte length file.
If "false", no constraints are placed on the `/data` directory. Default is "false".
Looks good to me.
I would create a new pull request with this updated version, but I'm not sure how to (re-)generate the index.html file.
@paulmillar if you want to do a PR, you can just do it against the index.html
currently in the root. There's a couple of issues open by @stain that we're working through, and that all can be 1.4.0.
I'm torn on how to implement @paulmillar's first issue in #31.
This I think would be cleaner, but can break backwards compatibility since we'd be removing an implementation detail.
Fetchable
:forbidden
|required
|optional
Allow, forbid or require fetching of Bags. Default is
optional
.I ended up implementing it as it is in the PR, since it's just adding an additional detail, and it works in conjunction with the existing one, instead of replacing it. If we want to go with the cleaner option, let me know.
@paulmillar please feel free to provide feedback as well since I can't tag you as a reviewer here. I definitely want your thumbs up before we merge anything.