ComPlat / chemotion_REPO

Repository for samples, reactions and related research data
https://www.chemotion-repository.net
GNU Affero General Public License v3.0
12 stars 2 forks source link

Improve BagIt implementation #99

Open tilfischer opened 2 months ago

tilfischer commented 2 months ago

Dear all,

This is also connected to #10, #32 and #98.

User on e.g. https://dx.doi.org/10.14272/UVLHGRADYISRGZ-UHFFFAOYSA-N/CHMO0000025.18 may select the blue button called "download data + metadata". The data is downloaded as a ZIP. This ZIP contains the data shown in the pop up modal of the corresponding dataset of an CV analysis as well as a dataset_description.txt and another ZIP.

The additional ZIP seems to be a Bagit Bag. Having two folders "data" and "metadata" as well as bagit.txt, manifest-sha256.txt and manifest-sha512.txt .

The BagIt specification v0.97 is available here: https://www.digitalpreservation.gov/documents/bagitspec.pdf .

1) On PDF page 7 it is stated, "A bag MUST NOT contain more than one payload manifest for a particular bag checksum algorithm". To my knowledge SHA256 and SHA512 are two variants of the SHA-2 algorithm i.e. one of the manifest....txt need to be removed to be compliant with the specification. Edit: Also following BagIt specification v1.0 states, "A bag can have more than one payload manifest, with each using a different checksum algorithm."

2) Different to other places where I know BagIt implementations, the Bag does not contain the bagit-info.txt and the tagmanifest-md5.txt . Both are optional following the BagIt specification, but could be added to also have checksums for the bagit.txt and the manifest-shaXXX.txt .

3) In the Bag are the folder "data" and "metadata". The "data" folder contains the payload files. Other optional folders within the Bag root may contain optional tag files, following the BagIt specification, which must adhere to the text tag file format described in the specification. These tag files must have the extension ".txt" (PDF page 11). I think this is not was Chemotion wants to do i.e. the "metadata" folder should be moved to the "data" (payload) folder. The folder "data" might have subfolders "dataset" and "metadata". Another approach for naming the latter is to use the same wording as in RADAR, which would be "descriptive-md" (they also have technical MD).

4) The data in the ZIP downloaded on https://dx.doi.org/10.14272/UVLHGRADYISRGZ-UHFFFAOYSA-N/CHMO0000025.18 via the blue "download data + metadata" button (but not the BagIt ZIP in the ZIP...) should be included in the "dataset". Edit: ..shoulb be included in the /data/dataset/ folder.

5) The DataCite Metadata e.g. of https://dx.doi.org/10.14272/UVLHGRADYISRGZ-UHFFFAOYSA-N/CHMO0000025.18 should also be added to the /data/metadata/ folder in the Bag.

At some later point in time, BagIt might be also combined with RO-Crate see: https://www.researchobject.org/ro-crate/1.1/appendix/implementation-notes.html#adding-ro-crate-to-bagit by adding a RO-Crate (based on Schema.org metadata) to BagIt. Currently not everything can be described with this type of metadata. Until we are there, Chemotion could provide an update for the current BagIt implementation.

Best, Tillmann

Edit: Found BagIt specification v1.0 i.e. RFC-8493 and linked this below.

tilfischer commented 2 months ago

Link to BagIt v1.0 i.e. RFC-8493: https://www.rfc-editor.org/rfc/rfc8493.html

tilfischer commented 2 months ago

Folder structure how it could look like adopted from the link to RO-Crate (see above, RO-Crate also had difficulties to read the specfication, there are errors in their figure, which I tried to fix):

<BagIt base directory>/
  |   bagit.txt                            # As per BagIt specification
  |   bag-info.txt                         # Optional, As per BagIt specification
  |   manifest-<algorithm>.txt             # As per BagIt specification
  |   tagmanifest-<algorithm>.txt          # Optional, As per BagIt specification
  |   fetch.txt                            # Optional, per BagIt Specification
  |   data/                                # Payload (would also be  RO-Crate root directory, see link provided. Later!)
      |   dataset                          # data here
      |   metadata or descriptive-md       # metadata here

Best, Tillmann

cllde8 commented 2 months ago

Dear @tilfischer,

Thank you for bringing up this topic. Your comprehensive information is greatly appreciated. I'd also like to provide some information:

  1. The BagIt version in use is 1.0.
  2. According to BagIt version 1.0 specifications, a bag can contain multiple payload manifest files [p.8]. Both SHA-256 and SHA-512 are cryptographic checksum algorithms supported by BagIt [p.14]. While SHA-512 offers a higher level of security due to its longer hash length (512-bit), SHA-256 (256-bit) remains widely used.

Regarding the mention of "dataset" in point 4, could you please provide further clarification?

  1. The data in the ZIP downloaded on https://dx.doi.org/10.14272/UVLHGRADYISRGZ-UHFFFAOYSA-N/CHMO0000025.18 via the blue "download data + metadata" button (but not the BagIt ZIP in the ZIP...) should be included in the "dataset".

Thank you!

Best regards, Claire

tilfischer commented 2 months ago

Dear Claire,

Thank you for your prompt reply!

In the specification v1.0 on page 7 it says "A bag can have more than one payload manifest with each using a different checksum algorithm." I thought that both sha256 and sha512 have the same sha2 algorithm in the background but different bit lengths. Maybe I am wrong and I also must admit that I just notices this but this is actually of minor importance.

On bullet point 4.: I simply want to say that the whole thingy downloaded with the blue "download data + metadata" button should be a BagIt Bag, rather than having a BagIt Bag (as ZIP) within the downloaded ZIP.

Best, Tillmann

cllde8 commented 2 months ago

Dear @tilfischer,

Thank you for clarifying point 4. I now understand the distinction.

On bullet point 4.: I simply want to say that the whole thingy downloaded with the blue "download data + metadata" button should be a BagIt Bag, rather than having a BagIt Bag (as ZIP) within the downloaded ZIP.

The "download metadata" and "download metadata + data" functions serve as user-friendly tools to retrieve data efficiently. The former provides data in xlsx format, while the latter not only includes the metadata excel but also a "data list" file and the data itself. The system packages the information into a zip file for convenience, but it's important to note that it's not in BagIt format. You can find more information at link.

Thank you.

Best regards, Claire

tilfischer commented 2 months ago

Dear Claire,

Unfortunately, I must disagree. The ZIP users get when selecting "download data + metadata" does not include the metadata in XLSX format. It does include a converter.json within the BagIt Bag inside the downloaded dataset.

One of my suggestions is to have all in a BagIt Bag as a successor of the current implementation having a BagIt Bag within the downloaded dataset.

Best, Tillmann

cllde8 commented 2 months ago

Dear @tilfischer,

Thank you for pointing this out. Upon testing the function, I confirmed that the metadata excel is missing when accessed without logging in; the feature functions correctly only when the user is logged in, as described in the documentation. This discrepancy is certainly an issue, and we will address it promptly to ensure consistent functionality regardless of user login status.

Thank you once again for bringing it to our attention.

Best regards, Claire

cllde8 commented 2 months ago

Dear @tilfischer,

We're pleased to inform you that the missing metadata excel issue has been resolved. The "download data + metadata" function now works properly whether the user is logged in or not, as described in the documentation.

Thank you once again for your help in improving the system.

Best regards, Claire

tilfischer commented 2 months ago

Dear Clarie,

Thank you!

What is the status for the other points mentioned above? The first two are of minor importance (at least to me), but 3-4 should be taken into account. In short: Folder structure needs to be fixed and all data need to move to the Bag so that uses do not download data which includes a Bag but just a Bag which includes all data (and metadata in other folder).

Best, Tillmann