ISA-tools / isa-specs

ISA Model and Serialization Specifications
http://isa-specs.readthedocs.io
5 stars 6 forks source link

Data object extension #15

Open HLWeil opened 1 year ago

HLWeil commented 1 year ago

Preface

Hey, here are our proposed adjustments to the datamodel and documentation for enabling ISA to thoroughly describe data objects. For reference, here is the discussion about this topic: https://github.com/ISA-tools/isa-api/discussions/484

General Goals

Our goal here is to improve the description of data using the isa model.

Currently, the description given in the ISA model points just to the file, but not inside the file. This is not sufficient, if the file format is not well understood or when the actual data object resulting from a measurement or computation is not a full file, but rather a value or value set in a file.

So we wanted to enhance the data object with two things:

Changes made

Datamodel

We came up with the following data model:

Property | Datatype | Description -- | -- | -- File name | String | A file name or full path referencing a data file produced by the related process that MAY be packaged with, or is accessible via, the ISA reference implementation content. Pointer | String | A pointer referencing a location inside the data file. This SHOULD always be specified when the data of interest is not the complete file, but a specific part of it. Generated By | String | A file name, full path or identifier referencing the tool with which this data object was generated. Explication | Ontology Annotation | An ontology annotation qualifying what the data describes. Unit | Ontology Annotation | The unit qualifying the value stored in the data object. Object Type | Ontology Annotation | Specifies the format in which the value in the data object will be stored.

ISA Json

Which results in the following json schema:

{
    "$schema": "http://json-schema.org/draft-04/schema",
    "title": "ISA data schema",
    "description": "JSON-schema representing a data file in the ISA model",
    "description": "JSON-schema representing a data object in the ISA model",
    "type": "object",
    "properties": {
        "@id": { "type": "string", "format": "uri" },
        "name": {
        "filename": {
            "type": "string"
        },
        "pointer": {
            "type": "string"
        },
        "type": {
            "type": "string",
            "enum": [
                "Raw Data File",
                "Derived Data File",
                "Image File"
            ]
        },
        "generatedBy": {
            "type": "string"
        },
        "explication": {
            "$ref": "ontology_annotation_schema.json#"
        },
        "unit": {
            "$ref": "ontology_annotation_schema.json#"
        },
        "objectType": {
            "$ref": "ontology_annotation_schema.json#"
        },
        "label": {
            "type": "string"
        },
        "comments" : {
            "type": "array",
            "items": {
                "$ref": "comment_schema.json#"
            }
        }
    },
    "additionalProperties": false
}

ISA Tab

To integrate these model extensions into the ISA Tab Format, we propose two adjustments:

To enable processes to point into files data files, we propose to add a new column Data Pointer to the Assay file. This column should be used to qualify the Data File column, when the data object resulting from the process is not the full data file, but instead a value or value set in the data file.

Additionally, to give context about the values in the data file, we propose to add a new file to the isa tab family, namely the Dataset file, which carries all other data fields, which we added in the Data Model.

Aux

Open Questions

stain commented 1 year ago

See also https://www.w3.org/TR/annotation-model/#selectors on how fragment selectors are different for different media types. You need to indicate the type of pointer, either as a prefix or pointertype. The media type of filename will then also be essential (equivalent to encodingFormat in RO-Crate for IANA Media type) so the client can know how to resolve the pointer.

muehlhaus commented 1 year ago

I agree with Stain! We need a pointerType and encodingFormat