Representing composite datasets

simleo commented 1 year ago

This issue is the outcome of a discussion with @ilveroluca after last Tuesday's Workflow Run RO-Crate meeting, where we started wondering how to represent secondary files (as in CWL's secondaryFiles) in a Workflow RO-Crate.

The actual use case that gave rise to the discussion was the representation of a Mirax image, which consists of:

a main file whose name ends with .mrxs;
a directory in the same location as the main file, with the same name minus the extension.

The directory contains data files, an index file etc. In the CRS4 tissue/tumor prediction workflow, in order to have CWL pick up all these files, we're using secondaryFiles. However, those files are not really secondary, especially the data files, which contain the actual image data. Rather, all files together contribute to the same multi-file input dataset. An example of a format with a similar layout is Zarr.

The real question, then, is how to represent such a dataset in RO-Crate. In RO-Crate, a Dataset maps to a directory, while single files are represented by File (alias for MediaObject). What about mixes of files and directories? One of the solutions we discussed is to recommend using hasPart on the main file:

{
    "@id": "Mirax2-Fluorescence-2.mrxs",
    "@type": "File",
    "hasPart": [
        {"@id": "Mirax2-Fluorescence-2/Index.dat"},
        {"@id": "Mirax2-Fluorescence-2/Slidedat.ini"},
        {"@id": "Mirax2-Fluorescence-2/Data0000.dat"},
    ...
    ]
}

However, in RO-Crate File represents a single file, and the files listed in hasPart are not actually parts (byte chunks) of Mirax2-Fluorescence-2.mrxs, but rather of the same dataset that includes Mirax2-Fluorescence-2.mrxs. Moreover, not all formats clearly identify a "main" file: In Zarr, for instance, .zattrs and .zarray are both metadata files at the same level.

Another option could be to change the RO-Crate spec so that Dataset would map to a mix of files and directories, rather than a single directory. This is encompassed by the schema.org definition, which is very general. However, such a change at this point where several profiles and software packages already exist would be very disrupting, especially for tools.

Though I've used an imaging example, the problem of representing a mix of files and directories as a single entity is quite general, so I think RO-Crate should have an explicit recommendation for this. Using a nested crate seems overkill, and depending on the format there might not be a single containing directory for the metadata file.

In principle, one could use CreativeWork, but it's probably too general. It would be hard for tools to identify a multi-file dataset as such. Add a custom e.g. CompositeDataset type? Is there an existing type that could be a good fit instead? Collection? Anyone knows of existing attempts to represent such datasets in RO-Crate?

Another problem is what to put under @id, especially when there is no clearly identified "master" file, since all actual files would have to go under hasPart. Can internal references be used for data entities? E.g.:

{
    "@id": "./",
    "@type": "Dataset",
    "hasPart": [
        {"@id": "#Mirax2Fluorescence2"}
    ]
},
{
    "@id": "#Mirax2Fluorescence2",
    "@type": "CompositeDataset",
    "hasPart": [
        {"@id": "Mirax2-Fluorescence-2.mrxs"},
        {"@id": "Mirax2-Fluorescence-2/Index.dat"},
        {"@id": "Mirax2-Fluorescence-2/Slidedat.ini"},
        {"@id": "Mirax2-Fluorescence-2/Data0000.dat"},
    ...
    ]
}

ptsefton commented 1 year ago

How about make the directory the dataset with a part that is outside of the directory

{
    "@id": "Mirax2-Fluorescence-2",
    "@type": "Dataset", <-- or Collection (where the @id would have to be #Mirax2-Fluorescence-2 or could be URI?)
   "mainEntity":   {"@id": "Mirax2-Fluorescence-2.mrxs",
    "hasPart": [
        {"@id": "Mirax2-Fluorescence-2/Index.dat"},
        {"@id": "Mirax2-Fluorescence-2/Slidedat.ini"},
        {"@id": "Mirax2-Fluorescence-2/Data0000.dat"},
        {"@id": "Mirax2-Fluorescence-2.mrxs"},
    ...
    ]
}

stain commented 1 year ago

Use https://schema.org/Collection as contextual entity (mentions from root) for grouping of data entities?

ptsefton commented 1 year ago

Collection can have hasPart props

simleo commented 1 year ago

Summing up the discussion we had at the latest meeting and suggestions here, the representation of:

https://openslide.cs.cmu.edu/download/openslide-testdata/Mirax/Mirax2-Fluorescence-2.zip

Would be:

{
    "@id": "./",
    "@type": "Dataset",
    "hasPart": [
        {"@id": "Mirax2-Fluorescence-2.mrxs"},
        {"@id": "Mirax2-Fluorescence-2/"},
    ],
    "mentions": [
        {"@id": "https://openslide.cs.cmu.edu/download/openslide-testdata/Mirax/Mirax2-Fluorescence-2.zip"}
    ]
},
{
    "@id": "https://openslide.cs.cmu.edu/download/openslide-testdata/Mirax/Mirax2-Fluorescence-2.zip",
    "@type": "Collection",
    "mainEntity": {"@id": "Mirax2-Fluorescence-2.mrxs"},
    "hasPart": [
        {"@id": "Mirax2-Fluorescence-2.mrxs"},
        {"@id": "Mirax2-Fluorescence-2/"},
    ]
},
{
    "@id": "Mirax2-Fluorescence-2.mrxs",
    "@type": "File",
},
{
    "@id": "Mirax2-Fluorescence-2/",
    "@type": "Dataset",
}

Or, rather, one of the representations. One might use a local id for the collection (this dataset is on the web, but that might not always be the case) and/or choose to list every single file:

{
    "@id": "./",
    "@type": "Dataset",
    "hasPart": [
        {"@id": "Mirax2-Fluorescence-2.mrxs"},
        {"@id": "Mirax2-Fluorescence-2/Index.dat"},
        {"@id": "Mirax2-Fluorescence-2/Slidedat.ini"},
        {"@id": "Mirax2-Fluorescence-2/Data0000.dat"},
        ...
        {"@id": "Mirax2-Fluorescence-2/Data0023.dat"},
    ],
    "mentions": [
        {"@id": "#Mirax2-Fluorescence-2"}
    ]
},
{
    "@id": "#Mirax2-Fluorescence-2",
    "@type": "Collection",
    "mainEntity": {"@id": "Mirax2-Fluorescence-2.mrxs"},
    "hasPart": [
        {"@id": "Mirax2-Fluorescence-2.mrxs"},
        {"@id": "Mirax2-Fluorescence-2/Index.dat"},
        {"@id": "Mirax2-Fluorescence-2/Slidedat.ini"},
        {"@id": "Mirax2-Fluorescence-2/Data0000.dat"},
        ...
        {"@id": "Mirax2-Fluorescence-2/Data0023.dat"},
    ]
},
{
    "@id": "Mirax2-Fluorescence-2.mrxs",
    "@type": "File",
},
{
    "@id": "Mirax2-Fluorescence-2/Index.dat",
    "@type": "File",
},
{
    "@id": "Mirax2-Fluorescence-2/Slidedat.ini",
    "@type": "File",
},
{
    "@id": "Mirax2-Fluorescence-2/Data0000.dat",
    "@type": "File",
},
...
{
    "@id": "Mirax2-Fluorescence-2/Data0023.dat",
    "@type": "File",
}

Yet another possibility is to list every single file and the dataset, linking to the auxiliary files from hasPart in the latter:

{
    "@id": "./",
    "@type": "Dataset",
    "hasPart": [
        {"@id": "Mirax2-Fluorescence-2.mrxs"},
        {"@id": "Mirax2-Fluorescence-2/"},
        {"@id": "Mirax2-Fluorescence-2/Index.dat"},
        {"@id": "Mirax2-Fluorescence-2/Slidedat.ini"},
        {"@id": "Mirax2-Fluorescence-2/Data0000.dat"},
        ...
        {"@id": "Mirax2-Fluorescence-2/Data0023.dat"},
    ],
    "mentions": [
        {"@id": "#Mirax2-Fluorescence-2"}
    ]
},
{
    "@id": "#Mirax2-Fluorescence-2",
    "@type": "Collection",
    "mainEntity": {"@id": "Mirax2-Fluorescence-2.mrxs"},
    "hasPart": [
        {"@id": "Mirax2-Fluorescence-2.mrxs"},
        {"@id": "Mirax2-Fluorescence-2/"},
    ]
},
{
    "@id": "Mirax2-Fluorescence-2.mrxs",
    "@type": "File",
},
{
    "@id": "Mirax2-Fluorescence-2/",
    "@type": "Dataset",
    "hasPart": [
        {"@id": "Mirax2-Fluorescence-2/Index.dat"},
        {"@id": "Mirax2-Fluorescence-2/Slidedat.ini"},
        {"@id": "Mirax2-Fluorescence-2/Data0000.dat"},
        ...
        {"@id": "Mirax2-Fluorescence-2/Data0023.dat"},
    ]
},
{
    "@id": "Mirax2-Fluorescence-2/Index.dat",
    "@type": "File",
},
{
    "@id": "Mirax2-Fluorescence-2/Slidedat.ini",
    "@type": "File",
},
{
    "@id": "Mirax2-Fluorescence-2/Data0000.dat",
    "@type": "File",
},
...
{
    "@id": "Mirax2-Fluorescence-2/Data0023.dat",
    "@type": "File",
}

simleo commented 1 year ago

From @pauldg: some collections may not have a mainEntity, e.g. in Galaxy

ResearchObject / ro-crate

Representing composite datasets #213