Dataset: how to represent a collection of distinct related files

BioSchemas / specifications

Issue tracker, technical wiki, and example markup

https://bioschemas.org

54 stars 52 forks source link

Dataset: how to represent a collection of distinct related files #575

Open cmungall opened 2 years ago

cmungall commented 2 years ago

Let's say I have a directory full of files - perhaps the results of different genome annotation analyses all on the same sample

With frictionless, this might be represented as one DataPackage, with multiple DataResources

DCAT have a series of different examples of loosely structured datasets, e.g example 57 which is analogous:

https://www.w3.org/TR/vocab-dcat-3/#ex-elaborated-bag

here there is one "container" DataSet and multiple individual DataSets, each with their own serialization

Is bioschemas intended to be isomorphic to DCAT3? Should we use the same structure and link to the same documentation?

hasPart is in the profile but it has a very generic description:

Schema: Indicates an item or CreativeWork that is part of this item, or CreativeWork (in some sense). Inverse property: isPartOf

Or perhaps the container should be a catalog?

AlasdairGray commented 2 years ago

The Bioschemas Dataset profile is defined over the existing schema.org Dataset type which itself is drawn from DCAT (version 2).

There are of course multiple ways you could model this, and that would be up to the deployer of the markup, i.e. as one Dataset with multiple parts which themselves are Dataset or as a collection of Datasets. In both cases, there would also be a DataCatalog which is the web site that makes the Datasets available.

:dc a DataCatalog ;
    dataset :x1, :x2, ...

:dc a DataCatalog ;
    dataset :x .
:x hasPart :x1, :x2, ...

I think that both of these are compatible with the proposed profile and it comes down to the markup developer's personal choice.

cmungall commented 2 years ago

It seems that modeling this as one Dataset with multiple distributions would be discouraged though? Even if the cardinality of distribution is >1 (#575) it seems the intent is for distribution is to model an alternate serialization of the same data, rather than different parts of the dataset?

AlasdairGray commented 2 years ago

In my examples, I didn't get to the distributions. That would be added onto the Dataset using the distribution property which should be many since it could be in different RDF serialisations or csv or a multitude of other formats.

To keep things semi-concrete, the markup would become

:dc a DataCatalog ;
    dataset :x1, :x2, ...
:x1 a Dataset ;
    distribution :x1csv, ...

:dc a DataCatalog ;
    dataset :x .
:x hasPart :x1, :x2, ...
:x1 a Dataset ;
    distribution :x1csv, ...