Closed stain closed 1 year ago
From RO-Crate meeting 2022-01-27:
@base
tricks.I suggest we use the term "Attached RO-Crate". I
Suggest definition (from this pull request's structure.md)
There are two classes of RO-Crate detailed below:
Attached RO-Crate : A crate that has a well-defined RO-Crate Root directory and can carry an explicit payload of local data entities as regular files (combined with Web-based Data Entities where needed) using relative URIs. This type of RO-Crate can be suitable for long-term preservation, transfer and publishing, as the RO-Crate Metadata File is stored alongside the crate's payload.
If a crate makes any relative references then it is considered an Attached RO-Crate and the Root Dataset ID MUST be "./".
Detached RO-Crate : A crate without a defined payload directory. In this kind of crate, all data references are absolute. This approach may be suitable for use with dynamic web service APIs and repositories that can't preserve file paths. As the data of these crates can only be Web-based Data Entities, the payload is implicit and must be preserved/transferred/archived independent of the RO-Crate Metadata File.
See further definition of detached RO-Crate
I think this is necessary because of #183 allowing @id
to be any ID, as here proposed in new sub section Root Data Entity identifier - then
If the
@id
of the Root Data Entity is an absolute URI, the Crate SHOULD NOT contain data entities using relative URI references, but MAY contain Web-based Data Entities using absolute URIs.
Terminology attached/detached RO-Crate agreed in RO-Crate meeting 2022-02-10.
I started drafting a section Converting from attached to detached
just wanted to check if we are OK with what comes out of the JSON-LD flattening:
{
"@context": [
{"@base": "arcp://uuid,d6be5c9b-132a-4a93-9837-3e02e06c08e6/"},
"https://w3id.org/ro/crate/1.1/context"
],
"@graph": [
{
"@id": "https://about.workflowhub.eu/Workflow-RO-Crate/1.0/ro-crate-metadata.json",
"@type": "CreativeWork",
"conformsTo": {"@id": "https://w3id.org/ro/crate/1.1"},
"about": {"@id": "https://about.workflowhub.eu/Workflow-RO-Crate/1.0/"},
"creator": {"@id": "https://orcid.org/0000-0001-9842-9718"}
},
{
"@id": "https://about.workflowhub.eu/Workflow-RO-Crate/1.0/",
"@type": "Dataset",
"hasPart": [
{ "@id": "https://about.workflowhub.eu/Workflow-RO-Crate/1.0/index.html"},
{ "@id": "https://about.workflowhub.eu/Workflow-RO-Crate/1.0/example/"},
],
"name": "Workflow RO-Crate profile"
},
{
"@id": "https://about.workflowhub.eu/Workflow-RO-Crate/1.0/ro-crate-metadata.json#include-ComputationalWorkflow",
"@type": "Recommendation",
"category": "MUST",
"name": "Include Main Workflow",
"itemReviewed": {
"@id": "https://bioschemas.org/ComputationalWorkflow"
}
}
ro-crate-metadata.json
now have absolute @id
(breaks a SHOULD?)#local
identifiers become grounded in the original RO-Crate@id
For anything more "proper" I think you would need manual processing, e.g. manual deposit and rewrite of each data entity file, manual UUID for each contextual entity.
I think we should recommend removing the {"@base": "arcp://uuid,.../"}
from the converted output. It can be confusing (I was wondering what its purpose was until I read the section about converting to detached).
Are the uuids intended to be unique? Cos people will copy and paste, or hardcode them into their crate.
Regarding attached crates can we do the deal with the relativity of paths using base: "./" or similar (or is that not allowed?) I know I have base: null in crates to stop JSON-LD libraries from messing with my paths - would have to refresh my memory
If you leave the arcp based UUID, you should add a small recipe about using python's arcp or how to generate in a couple of programming languages those UUIDs in the namespace of URLs.
import uuid
the_url = 'https://example.org'
the_uuid = uuid.uuid5(uuid.NAMESPACE_URL, the_url)
# the_uuid.hex has the UUID string representation
import arcp
the_arcp = arcp.arcp_location("http://example.com/data.zip", "/file.txt")
# the_arcp has the ARCP string representation
import uuid
the_random_uuid = uuid.uuid4()
# the_random_uuid.hex has the UUID string representation
import arcp
the_random_arcp = arcp.arcp_random()
# the_random_arcp has the ARCP string representation
On reflection I don't think we need this attached/detached distinction. I think we should look at providing clear info about how to use relative and absolute paths for various resources.
Based on experience where we have implemented an API that uses the API URL as the @id but it is then not clear how to reconstitute a crate, I think that approach was a mistake. It might be better to go back to an approach where @ids are
For packaged crates-on-disk use @base: null with relative paths for data entities
For crates over an API use the dcat:downloadURL property on DataEntities for the place where you can get a file and as per (1) above make its @id the filename it should have relative to the root. and Identifier for IDs like DOIs.
Further to my last comment @stain & @simleo. I think I have found a neat solution to the problem we were having with letting "@id" in for a File be a URI - how would you save it to disk and re-construct the relative path structure of a package?
Solution: In RO-Crate Metadata Documents served from a service leave the @ids as relative paths but use DCAT accessUrl (to point to RO-Crate Metadata served over an API) and downloadURL for the actual datastream. We can then recommend that a process for reconstituting an RO-Crate by using the @id to create directories and write file contents.
I have written this up in the work I was doing on a new intro - this detail probably does not all belong in the intro though.
Here's a copy and paste from that Google doc.
an RO-Crate Metadata Document is served from a service use the following DCAT properties:
dcat:accessURL – RO-Crate Metadata Documents
dcat:downloadURL - Direct downloads of bitstreams (files)
Client software to construct RO-Crates SHOULD: Save the RO-Crate Metadata file into an empty create download directory For Dataset that has a relative URI, make a subdirectory with the same path as the Dataset id - eg /data/pictures For each File Entitiy fetch the datastream using downloadURI and write it to a relative path that corresponds to its “@id” (creating directories as needed, even if they are not described in the RO-Crate).
In the case where the RO-Crate metadata Document is being served from a service, { "@context": "https://w3id.org/ro/crate/1.1/context", "@graph": [
{ "@type": "CreativeWork", "@id": "ro-crate-metadata.json", "conformsTo": {"@id": "https://w3id.org/ro/crate/2.0"}, "about": {"@id": "./"} },
{ "@id": "./", "accessUrl": "https://example.com/ro-crate/api/crate/000001", "@type": "Dataset", "datePublished": "2022-02-01", "name": "Example dataset for RO-Crate specification", "description": "If this were real data it would contain a minute-by-minute rainfall readings for my weather gauge", "license": "CC BY-NC-SA 3.0" "hasPart": {"@id": "data.csv"} } { "@id": "data.csv", "downloadURL": https://example.com/ro-crate/api/crate/000001?file=data.csv, "@type": "File", "encodingFormat": "text/csv", "name": "Rainfall data for Katoomba, NSW Australia, 2022-02-01", "license": "CC BY-NC-SA 3.0 AU"}
] }
Looks nice, @ptsefton @stain @simleo !! I have several questions, some of them offtopic.
@id
to represent the internal, relative placement of the resource?downloadURL
predicate the recommended one to provide it?downloadURL
, is there some proper way to declare that the resource is under controlled access? For instance, data from EGA (European Genome Phenome Archive) or dbGaP (NCBI's database of Genomes and Phenomes)I think I have found a neat solution to the problem we were having with letting "@id" in for a File be a URI - how would you save it to disk and re-construct the relative path structure of a package?
The current spec already allows File
@id
s to be URIs. In ro-crate-py, this is handled via a fetch_remote
keyword argument that allows the library user to decide what happens when the crate is written out to disk (the user might not want to download the file -- or be able to do so -- for various reasons):
url = "http://example.com/foo.txt"
# Download file; it will be placed under <CRATE_DIR>/examples when the crate is written out
crate.add_file(url, "examples/foo.txt", fetch_remote=True)
# Don't download file; its @id will still be a URI in the output crate
crate.add_file(url, fetch_remote=False)
In the latter case, a "url": "http://example.com/foo.txt"
is automatically added to the entity; however, we're currently not doing that in the former case, and I now realize that we should. But maybe we should use "downloadUrl" rather than "url".
@jmfernandez I don't think there's any requirement for URL
schemes to be http[s]
in Schema.org.
UPDATE: https://schema.org/downloadUrl is only used in SoftwareApplication
, so I guess we should use url
UPDATE 2: on second (third?) thought, automatically adding a "url" property in ro-crate-py is a bad thing. Users might want to create a local copy that doesn't list any remote reference (it could be considered a different crate, in a way), or specify a different URI (e.g., from a mirror), which is always possible via properties
anyway.
UPDATE: https://schema.org/downloadUrl is only used in SoftwareApplication, so I guess we should use
url
@simleo url
is how we've said in https://www.researchobject.org/ro-crate/1.1/data-entities.html#embedded-data-entities-that-are-also-on-the-web - but how about https://schema.org/contentUrl which is defined for MediaObject
aka our File
, which would be the schema-org way of doing dcat:downloadURL - rather than url
which is just "URL of the item" (and therefore weird to understand when it's different from @id
)
Call 2023-03-23 agreed to merge all outstanding PRs.
There's outstanding how to do re-construct the relative path -- @simleo may have also thoughts on this now from Workflow Run profile perspective which also needed to this.
.. to support #183 my logical conclusion is that we need the concept of a Detached RO-Crate.
Suggest definition (from this pull request's structure.md)
Regular RO-Crate : A crate that has a well-defined RO-Crate Root directory and can carry an explicit payload of local data entities as regular files (combined with Web-based Data Entities where needed). This type of RO-Crate can be suitable for long-term preservation, transfer and publishing, as the RO-Crate Metadata File is stored alongside the crate's payload.
Detached RO-Crate : A crate without a defined payload directory. In this kind of crate, all data references are absolute. This approach may be suitable for use with dynamic web service APIs and repositories that can't preserve file paths. As the data of these crates can only be Web-based Data Entities, the payload is implicit and must be preserved/transferred/archived independent of the RO-Crate Metadata File.
See further definition of detached RO-Crate
I think this is necessary because of #183 allowing
@id
to be any ID, as here proposed in new sub section Root Data Entity identifier - thenAnd from that my logical conclusion is that the whole concept of "RO-Crate Root" and any relative URIs becomes ambigious and difficult if we no longer have
"@id: ./"
of the Root Dataset and the URI that servesro-crate-metadata.json
no longer is grounded in something similar to a folder.I would hope for some discussion on this in the RO-Crate meeting today 2022-01-27.