ResearchObject / ro-crate

Research Object Crate
https://w3id.org/ro/crate/
Apache License 2.0
86 stars 34 forks source link

Use Case: How to get the contents of a RO as a zip file? #228

Closed dgarijo closed 1 month ago

dgarijo commented 1 year ago

As a programmer, I want to obtain the aggregated contents of a Research Object as a downloadable resource.

Ideally, I would like to do so through a request and content-negotiation. But I do not see an agreement about how to serve the RO-Crate itself. Can we agree into something like application/zip? Can we have some community-agreed guidelines?

stain commented 1 year ago

https://signposting.org/adopters/#workflowhub documents how we do this with Signposting in WorkflowHub. Could we generalize this?

Let's make a new section for Retrieving RO-Crate and move out some of the content-negotiation described in https://www.researchobject.org/ro-crate/1.2-DRAFT/profiles#how-to-retrieve-a-profile-crate to perhaps allow both for application/zip and application/ld+json.

We can then add signposting particularly where the persistent identifier has a HTML landing page (which may be ro-crate-preview.html as suggested by Profile Crate) -- see #160

See also #149

stain commented 1 year ago

Not sure we should close this, as we don't detail what to expect in the zip file.

@dgarijo -- is the text in https://www.researchobject.org/ro-crate/1.2-DRAFT/root-data-entity.html#root-data-entity-identifier sufficient for 1.2 to close this?

Here's one take with BagIt: https://trefx.uk/trusted-wfrun-crate/0.3/#archive-serialisation which assumes a single folder (with arbitrary name) that again contains bagit.txt and manifest-sha512.txt with checksums and then data/ro-crate-metadata.json -- I'm trying to formalize this into an update of https://github.com/ResearchObject/bagit-ro profile but it is mostly already in https://www.researchobject.org/ro-crate/1.2-DRAFT/appendix/implementation-notes.html#adding-ro-crate-to-bagit

Then there is Workflow RO-Crate has a different take where the Zip file has not got a top level directory at all (that is ro-crate-metadata.json and other files are directly in ZIP root). This is easy to access programmatically, but may give some classical unzip users a surprise as the current directory will be filled with multiple files. (I think the Windows/macOS integrations will make a folder for you)

ROHub also exports directly with ro-crate-metadata.json in the root.

As I listed in https://trefx.uk/trusted-wfrun-crate/0.3/#zip-expectations certain ZIP features should not be used, e.g. multipart (for floppies!), ZIP64 extensions are needed for larger than 2 GB, etc. These are documented fairly well in https://www.w3.org/publishing/epub32/epub-ocf.html#sec-zip-container-zipreqs

stain commented 1 year ago

I start thinking that we need multiple profiles depending on if it's a bagit-wrapping ZIP, a "plain" RO-Crate, or a detached RO-Crate JSON-LD..

A ZIP archive with ro-crate-metadata.zip in the root:

Link: <https://example.com/workflows/419/ro_crate.zip> ;
      rel="item" ;
      type="application/zip" ;
      profile="https://w3id.org/ro/crate#archive" 

(or make a new w3id PID space for that)

A bagit zip according to https://www.researchobject.org/ro-crate/1.2-DRAFT/appendix/implementation-notes.html#adding-ro-crate-to-bagit aka foo-something/data/ro-crate-metadata.json:

Link: <https://example.com/workflows/419/bagit.zip> ;
      rel="item" ;
      type="application/zip" ;
      profile="https://w3id.org/ro/bagit/profile/0.3" 

An RO-Crate Metadata Document straight on the web (Detached or Attached):

Link: <https://example.com/workflows/419/ro-crate-metadata.json> ;
      rel="item" ;
      type="application/ld+json" ;
      profile="https://w3id.org/ro/crate" 

And then only the final one corresponds to the profile registered in https://www.iana.org/assignments/profile-uris/profile-uris.xhtml as a JSON-LD profile.

In either case, when retrieving, the profile will be provided as a Link as described in https://trefx.uk/trusted-wfrun-crate/0.3/#media-type-and-profiles

GET http://example.com/crates/42.zip HTTP/1.1

HTTP/1.1 200 OK
Content-Type: application/zip
Link: <https://w3id.org/ro/crate#archive>; rel="profile"`

Or from a landing page, with signposting as above:

HEAD http://example.com/crates/42.html HTTP/1.1

HTTP/1.1 200 OK
Content-Type: text/html
Link: <https://example.com/query-12389.zip>; rel="item", type="application/zip"
Link: <https://w3id.org/ro/crate>; rel="profile"; type="application/zip";
   anchor="https://example.com/query-12389.zip"
dgarijo commented 1 year ago

Hmm, you may be correct, although it complicates things a little.

From my end, I am interested in knowing what to prepare when someone asks for one of my ROs with permanent ids. For example https://w3id.org/dgarijo/ro/sepln2022 i set up json-ld (ro-crate metadata file) and the HTML. But I did not find a recommendation on how to create the zip file when I last browsed the spec.

The text in https://www.researchobject.org/ro-crate/1.2-DRAFT/root-data-entity.html#root-data-entity-identifier points me to https://www.researchobject.org/ro-crate/1.2-DRAFT/profiles.html#how-to-retrieve-a-profile-crate, but it is not clear how I should structure the contents of the zip file.

Also, should my root data entity contain a link to the zip file with the downloadable ro-crate? maybe using the schema.org distribution properties used for datasets.