ResearchObject / ro-crate

Research Object Crate
https://w3id.org/ro/crate/
Apache License 2.0
87 stars 34 forks source link

Add concept Detached RO-Crate #189

Closed stain closed 1 year ago

stain commented 2 years ago

.. to support #183 my logical conclusion is that we need the concept of a Detached RO-Crate.

Suggest definition (from this pull request's structure.md)

There are two classes of RO-Crate detailed below:

Regular RO-Crate : A crate that has a well-defined RO-Crate Root directory and can carry an explicit payload of local data entities as regular files (combined with Web-based Data Entities where needed). This type of RO-Crate can be suitable for long-term preservation, transfer and publishing, as the RO-Crate Metadata File is stored alongside the crate's payload.

Detached RO-Crate : A crate without a defined payload directory. In this kind of crate, all data references are absolute. This approach may be suitable for use with dynamic web service APIs and repositories that can't preserve file paths. As the data of these crates can only be Web-based Data Entities, the payload is implicit and must be preserved/transferred/archived independent of the RO-Crate Metadata File.

See further definition of detached RO-Crate

I think this is necessary because of #183 allowing @id to be any ID, as here proposed in new sub section Root Data Entity identifier - then

If the @id of the Root Data Entity is an absolute URI, the Crate SHOULD NOT contain data entities using relative URI references, but MAY contain Web-based Data Entities using absolute URIs.

And from that my logical conclusion is that the whole concept of "RO-Crate Root" and any relative URIs becomes ambigious and difficult if we no longer have "@id: ./" of the Root Dataset and the URI that serves ro-crate-metadata.json no longer is grounded in something similar to a folder.

I would hope for some discussion on this in the RO-Crate meeting today 2022-01-27.

stain commented 2 years ago

From RO-Crate meeting 2022-01-27:

ptsefton commented 2 years ago

I suggest we use the term "Attached RO-Crate". I

Suggest definition (from this pull request's structure.md)

There are two classes of RO-Crate detailed below:

Attached RO-Crate : A crate that has a well-defined RO-Crate Root directory and can carry an explicit payload of local data entities as regular files (combined with Web-based Data Entities where needed) using relative URIs. This type of RO-Crate can be suitable for long-term preservation, transfer and publishing, as the RO-Crate Metadata File is stored alongside the crate's payload.

If a crate makes any relative references then it is considered an Attached RO-Crate and the Root Dataset ID MUST be "./".

Detached RO-Crate : A crate without a defined payload directory. In this kind of crate, all data references are absolute. This approach may be suitable for use with dynamic web service APIs and repositories that can't preserve file paths. As the data of these crates can only be Web-based Data Entities, the payload is implicit and must be preserved/transferred/archived independent of the RO-Crate Metadata File.

See further definition of detached RO-Crate

I think this is necessary because of #183 allowing @id to be any ID, as here proposed in new sub section Root Data Entity identifier - then

If the @id of the Root Data Entity is an absolute URI, the Crate SHOULD NOT contain data entities using relative URI references, but MAY contain Web-based Data Entities using absolute URIs.

stain commented 2 years ago

Terminology attached/detached RO-Crate agreed in RO-Crate meeting 2022-02-10.

stain commented 2 years ago

I started drafting a section Converting from attached to detached

just wanted to check if we are OK with what comes out of the JSON-LD flattening:

{
  "@context": [
    {"@base": "arcp://uuid,d6be5c9b-132a-4a93-9837-3e02e06c08e6/"},
    "https://w3id.org/ro/crate/1.1/context"
  ],
  "@graph": [
    {
      "@id": "https://about.workflowhub.eu/Workflow-RO-Crate/1.0/ro-crate-metadata.json",
      "@type": "CreativeWork",
      "conformsTo": {"@id": "https://w3id.org/ro/crate/1.1"},
      "about": {"@id": "https://about.workflowhub.eu/Workflow-RO-Crate/1.0/"},
      "creator": {"@id": "https://orcid.org/0000-0001-9842-9718"}
    },
    {
      "@id": "https://about.workflowhub.eu/Workflow-RO-Crate/1.0/",
      "@type": "Dataset",
      "hasPart": [
        { "@id": "https://about.workflowhub.eu/Workflow-RO-Crate/1.0/index.html"},
        { "@id": "https://about.workflowhub.eu/Workflow-RO-Crate/1.0/example/"},
      ],
      "name": "Workflow RO-Crate profile"
    },
  {
      "@id": "https://about.workflowhub.eu/Workflow-RO-Crate/1.0/ro-crate-metadata.json#include-ComputationalWorkflow",
      "@type": "Recommendation",
      "category": "MUST",
      "name": "Include Main Workflow",
      "itemReviewed": {
        "@id": "https://bioschemas.org/ComputationalWorkflow"
      }
    }
  1. ro-crate-metadata.json now have absolute @id (breaks a SHOULD?)
  2. The root dataset matches the public Web presence (but then why did we need Detached?)
  3. #local identifiers become grounded in the original RO-Crate
  4. The identifier of the corresponding attached RO-Crate is preserved in @id

For anything more "proper" I think you would need manual processing, e.g. manual deposit and rewrite of each data entity file, manual UUID for each contextual entity.

simleo commented 2 years ago

I think we should recommend removing the {"@base": "arcp://uuid,.../"} from the converted output. It can be confusing (I was wondering what its purpose was until I read the section about converting to detached).

ptsefton commented 2 years ago

Are the uuids intended to be unique? Cos people will copy and paste, or hardcode them into their crate.

Regarding attached crates can we do the deal with the relativity of paths using base: "./" or similar (or is that not allowed?) I know I have base: null in crates to stop JSON-LD libraries from messing with my paths - would have to refresh my memory

jmfernandez commented 2 years ago

If you leave the arcp based UUID, you should add a small recipe about using python's arcp or how to generate in a couple of programming languages those UUIDs in the namespace of URLs.

import uuid
the_url = 'https://example.org'
the_uuid = uuid.uuid5(uuid.NAMESPACE_URL, the_url)
# the_uuid.hex has the UUID string representation
import arcp
the_arcp = arcp.arcp_location("http://example.com/data.zip", "/file.txt")
# the_arcp has the ARCP string representation
import uuid
the_random_uuid = uuid.uuid4()
# the_random_uuid.hex has the UUID string representation
import arcp
the_random_arcp = arcp.arcp_random()
# the_random_arcp has the ARCP string representation
ptsefton commented 2 years ago

On reflection I don't think we need this attached/detached distinction. I think we should look at providing clear info about how to use relative and absolute paths for various resources.

Based on experience where we have implemented an API that uses the API URL as the @id but it is then not clear how to reconstitute a crate, I think that approach was a mistake. It might be better to go back to an approach where @ids are

  1. Relative URIs To describe how resources would be or laid out on disk as a set of relative paths with ./ for the root
  2. Absolute URIs For URL addressable resources

For packaged crates-on-disk use @base: null with relative paths for data entities

For crates over an API use the dcat:downloadURL property on DataEntities for the place where you can get a file and as per (1) above make its @id the filename it should have relative to the root. and Identifier for IDs like DOIs.

ptsefton commented 2 years ago

Further to my last comment @stain & @simleo. I think I have found a neat solution to the problem we were having with letting "@id" in for a File be a URI - how would you save it to disk and re-construct the relative path structure of a package?

Solution: In RO-Crate Metadata Documents served from a service leave the @ids as relative paths but use DCAT accessUrl (to point to RO-Crate Metadata served over an API) and downloadURL for the actual datastream. We can then recommend that a process for reconstituting an RO-Crate by using the @id to create directories and write file contents.

I have written this up in the work I was doing on a new intro - this detail probably does not all belong in the intro though.

Here's a copy and paste from that Google doc.

an RO-Crate Metadata Document is served from a service use the following DCAT properties:

dcat:accessURL – RO-Crate Metadata Documents

dcat:downloadURL - Direct downloads of bitstreams (files)

Client software to construct RO-Crates SHOULD: Save the RO-Crate Metadata file into an empty create download directory For Dataset that has a relative URI, make a subdirectory with the same path as the Dataset id - eg /data/pictures For each File Entitiy fetch the datastream using downloadURI and write it to a relative path that corresponds to its “@id” (creating directories as needed, even if they are not described in the RO-Crate).

In the case where the RO-Crate metadata Document is being served from a service, { "@context": "https://w3id.org/ro/crate/1.1/context", "@graph": [

{ "@type": "CreativeWork", "@id": "ro-crate-metadata.json", "conformsTo": {"@id": "https://w3id.org/ro/crate/2.0"}, "about": {"@id": "./"} },
{ "@id": "./", "accessUrl": "https://example.com/ro-crate/api/crate/000001", "@type": "Dataset", "datePublished": "2022-02-01", "name": "Example dataset for RO-Crate specification", "description": "If this were real data it would contain a minute-by-minute rainfall readings for my weather gauge", "license": "CC BY-NC-SA 3.0" "hasPart": {"@id": "data.csv"} } { "@id": "data.csv", "downloadURL": https://example.com/ro-crate/api/crate/000001?file=data.csv, "@type": "File", "encodingFormat": "text/csv", "name": "Rainfall data for Katoomba, NSW Australia, 2022-02-01", "license": "CC BY-NC-SA 3.0 AU"

}

] }

jmfernandez commented 2 years ago

Looks nice, @ptsefton @stain @simleo !! I have several questions, some of them offtopic.

simleo commented 2 years ago

I think I have found a neat solution to the problem we were having with letting "@id" in for a File be a URI - how would you save it to disk and re-construct the relative path structure of a package?

The current spec already allows File @ids to be URIs. In ro-crate-py, this is handled via a fetch_remote keyword argument that allows the library user to decide what happens when the crate is written out to disk (the user might not want to download the file -- or be able to do so -- for various reasons):

url = "http://example.com/foo.txt"
# Download file; it will be placed under <CRATE_DIR>/examples when the crate is written out
crate.add_file(url, "examples/foo.txt", fetch_remote=True)
# Don't download file; its @id will still be a URI in the output crate
crate.add_file(url, fetch_remote=False)

In the latter case, a "url": "http://example.com/foo.txt" is automatically added to the entity; however, we're currently not doing that in the former case, and I now realize that we should. But maybe we should use "downloadUrl" rather than "url".

@jmfernandez I don't think there's any requirement for URL schemes to be http[s] in Schema.org.

UPDATE: https://schema.org/downloadUrl is only used in SoftwareApplication, so I guess we should use url UPDATE 2: on second (third?) thought, automatically adding a "url" property in ro-crate-py is a bad thing. Users might want to create a local copy that doesn't list any remote reference (it could be considered a different crate, in a way), or specify a different URI (e.g., from a mirror), which is always possible via properties anyway.

stain commented 2 years ago

UPDATE: https://schema.org/downloadUrl is only used in SoftwareApplication, so I guess we should use url

@simleo url is how we've said in https://www.researchobject.org/ro-crate/1.1/data-entities.html#embedded-data-entities-that-are-also-on-the-web - but how about https://schema.org/contentUrl which is defined for MediaObject aka our File, which would be the schema-org way of doing dcat:downloadURL - rather than url which is just "URL of the item" (and therefore weird to understand when it's different from @id)

stain commented 1 year ago

Call 2023-03-23 agreed to merge all outstanding PRs.

There's outstanding how to do re-construct the relative path -- @simleo may have also thoughts on this now from Workflow Run profile perspective which also needed to this.