Open simleo opened 1 year ago
How about make the directory the dataset with a part that is outside of the directory
{
"@id": "Mirax2-Fluorescence-2",
"@type": "Dataset", <-- or Collection (where the @id would have to be #Mirax2-Fluorescence-2 or could be URI?)
"mainEntity": {"@id": "Mirax2-Fluorescence-2.mrxs",
"hasPart": [
{"@id": "Mirax2-Fluorescence-2/Index.dat"},
{"@id": "Mirax2-Fluorescence-2/Slidedat.ini"},
{"@id": "Mirax2-Fluorescence-2/Data0000.dat"},
{"@id": "Mirax2-Fluorescence-2.mrxs"},
...
]
}
Use https://schema.org/Collection as contextual entity (mentions
from root) for grouping of data entities?
Collection can have hasPart props
Summing up the discussion we had at the latest meeting and suggestions here, the representation of:
https://openslide.cs.cmu.edu/download/openslide-testdata/Mirax/Mirax2-Fluorescence-2.zip
Would be:
{
"@id": "./",
"@type": "Dataset",
"hasPart": [
{"@id": "Mirax2-Fluorescence-2.mrxs"},
{"@id": "Mirax2-Fluorescence-2/"},
],
"mentions": [
{"@id": "https://openslide.cs.cmu.edu/download/openslide-testdata/Mirax/Mirax2-Fluorescence-2.zip"}
]
},
{
"@id": "https://openslide.cs.cmu.edu/download/openslide-testdata/Mirax/Mirax2-Fluorescence-2.zip",
"@type": "Collection",
"mainEntity": {"@id": "Mirax2-Fluorescence-2.mrxs"},
"hasPart": [
{"@id": "Mirax2-Fluorescence-2.mrxs"},
{"@id": "Mirax2-Fluorescence-2/"},
]
},
{
"@id": "Mirax2-Fluorescence-2.mrxs",
"@type": "File",
},
{
"@id": "Mirax2-Fluorescence-2/",
"@type": "Dataset",
}
Or, rather, one of the representations. One might use a local id for the collection (this dataset is on the web, but that might not always be the case) and/or choose to list every single file:
{
"@id": "./",
"@type": "Dataset",
"hasPart": [
{"@id": "Mirax2-Fluorescence-2.mrxs"},
{"@id": "Mirax2-Fluorescence-2/Index.dat"},
{"@id": "Mirax2-Fluorescence-2/Slidedat.ini"},
{"@id": "Mirax2-Fluorescence-2/Data0000.dat"},
...
{"@id": "Mirax2-Fluorescence-2/Data0023.dat"},
],
"mentions": [
{"@id": "#Mirax2-Fluorescence-2"}
]
},
{
"@id": "#Mirax2-Fluorescence-2",
"@type": "Collection",
"mainEntity": {"@id": "Mirax2-Fluorescence-2.mrxs"},
"hasPart": [
{"@id": "Mirax2-Fluorescence-2.mrxs"},
{"@id": "Mirax2-Fluorescence-2/Index.dat"},
{"@id": "Mirax2-Fluorescence-2/Slidedat.ini"},
{"@id": "Mirax2-Fluorescence-2/Data0000.dat"},
...
{"@id": "Mirax2-Fluorescence-2/Data0023.dat"},
]
},
{
"@id": "Mirax2-Fluorescence-2.mrxs",
"@type": "File",
},
{
"@id": "Mirax2-Fluorescence-2/Index.dat",
"@type": "File",
},
{
"@id": "Mirax2-Fluorescence-2/Slidedat.ini",
"@type": "File",
},
{
"@id": "Mirax2-Fluorescence-2/Data0000.dat",
"@type": "File",
},
...
{
"@id": "Mirax2-Fluorescence-2/Data0023.dat",
"@type": "File",
}
Yet another possibility is to list every single file and the dataset, linking to the auxiliary files from hasPart
in the latter:
{
"@id": "./",
"@type": "Dataset",
"hasPart": [
{"@id": "Mirax2-Fluorescence-2.mrxs"},
{"@id": "Mirax2-Fluorescence-2/"},
{"@id": "Mirax2-Fluorescence-2/Index.dat"},
{"@id": "Mirax2-Fluorescence-2/Slidedat.ini"},
{"@id": "Mirax2-Fluorescence-2/Data0000.dat"},
...
{"@id": "Mirax2-Fluorescence-2/Data0023.dat"},
],
"mentions": [
{"@id": "#Mirax2-Fluorescence-2"}
]
},
{
"@id": "#Mirax2-Fluorescence-2",
"@type": "Collection",
"mainEntity": {"@id": "Mirax2-Fluorescence-2.mrxs"},
"hasPart": [
{"@id": "Mirax2-Fluorescence-2.mrxs"},
{"@id": "Mirax2-Fluorescence-2/"},
]
},
{
"@id": "Mirax2-Fluorescence-2.mrxs",
"@type": "File",
},
{
"@id": "Mirax2-Fluorescence-2/",
"@type": "Dataset",
"hasPart": [
{"@id": "Mirax2-Fluorescence-2/Index.dat"},
{"@id": "Mirax2-Fluorescence-2/Slidedat.ini"},
{"@id": "Mirax2-Fluorescence-2/Data0000.dat"},
...
{"@id": "Mirax2-Fluorescence-2/Data0023.dat"},
]
},
{
"@id": "Mirax2-Fluorescence-2/Index.dat",
"@type": "File",
},
{
"@id": "Mirax2-Fluorescence-2/Slidedat.ini",
"@type": "File",
},
{
"@id": "Mirax2-Fluorescence-2/Data0000.dat",
"@type": "File",
},
...
{
"@id": "Mirax2-Fluorescence-2/Data0023.dat",
"@type": "File",
}
From @pauldg: some collections may not have a mainEntity, e.g. in Galaxy
This issue is the outcome of a discussion with @ilveroluca after last Tuesday's Workflow Run RO-Crate meeting, where we started wondering how to represent secondary files (as in CWL's secondaryFiles) in a Workflow RO-Crate.
The actual use case that gave rise to the discussion was the representation of a Mirax image, which consists of:
.mrxs
;The directory contains data files, an index file etc. In the CRS4 tissue/tumor prediction workflow, in order to have CWL pick up all these files, we're using
secondaryFiles
. However, those files are not really secondary, especially the data files, which contain the actual image data. Rather, all files together contribute to the same multi-file input dataset. An example of a format with a similar layout is Zarr.The real question, then, is how to represent such a dataset in RO-Crate. In RO-Crate, a
Dataset
maps to a directory, while single files are represented byFile
(alias forMediaObject
). What about mixes of files and directories? One of the solutions we discussed is to recommend usinghasPart
on the main file:However, in RO-Crate
File
represents a single file, and the files listed inhasPart
are not actually parts (byte chunks) ofMirax2-Fluorescence-2.mrxs
, but rather of the same dataset that includesMirax2-Fluorescence-2.mrxs
. Moreover, not all formats clearly identify a "main" file: In Zarr, for instance,.zattrs
and.zarray
are both metadata files at the same level.Another option could be to change the RO-Crate spec so that
Dataset
would map to a mix of files and directories, rather than a single directory. This is encompassed by the schema.org definition, which is very general. However, such a change at this point where several profiles and software packages already exist would be very disrupting, especially for tools.Though I've used an imaging example, the problem of representing a mix of files and directories as a single entity is quite general, so I think RO-Crate should have an explicit recommendation for this. Using a nested crate seems overkill, and depending on the format there might not be a single containing directory for the metadata file.
In principle, one could use
CreativeWork
, but it's probably too general. It would be hard for tools to identify a multi-file dataset as such. Add a custom e.g.CompositeDataset
type? Is there an existing type that could be a good fit instead? Collection? Anyone knows of existing attempts to represent such datasets in RO-Crate?Another problem is what to put under
@id
, especially when there is no clearly identified "master" file, since all actual files would have to go underhasPart
. Can internal references be used for data entities? E.g.: