matthiaskoenig commented 7 years ago

Issue

In L1V3 only NuML, CSV and TSV is defined. We have to add section to the spec describing additional formats.

XSLX
HDF5 (used a lot, bit advantage of being binary and very compact, required for large datasets)
JSON (used a lot, especially in the context of transferring data on the web for web apps to work with SED-ML)

Proposal

Define the respective URIS

urn:sedml:format:xslx
urn:sedml:format:hdf5
urn:sedml:format:json with the restriction of the allowed data and DimensionDescriptions.

This requires the ability to specify complex sources. I.e. nested files and parts of files.

46

jonrkarr commented 3 years ago

EDAM (see #94) already has terms for all of these formats.

I second the use of HDF5. This is key for large datasets.

For structured datasets, another format that might make sense is SQLite.

luciansmith commented 3 years ago

I added hdf5 as an option, as it's clearly already getting a ton of use. Here's what I put in for its section (after the CSV/TSV descriptions):

HDF5 (Hierarchical Data Format version 5) The format HDF5 is defined at https://portal.hdfgroup.org/display/HDF5/HDF5. It supports the storage of multidimensional data, and is therefore ideal for storing the SED-ML output of repeated tasks; particularly nested repeated tasks.

Each dimension of SED-ML RepeatedTask output should be labeled according to the id of the SED-ML object that describes that dimension, namely: The id of the top-level RepeatedTask The id of the SubTask The id of any nested SubTask (for arbitrarily-deeply nested subtasks). The dimension of the data itself (i.e. time for a UniformTimeCourse). The id of the requested variable, or the infix representation of the Math from the DataGenerator.

Each dimension may also be annotated in this format, with some ontology such as the ’Semanticscience Integrated Ontology’ (SIO, https://bioportal.bioontology.org/ontologies/SIO)

luciansmith commented 3 years ago

I didn't add xlsx or JSON or SQLite. I can, though those might be more complicated?

jonrkarr commented 3 years ago

� The id of the top-level RepeatedTask � The id of the SubTask � The id of any nested SubTask (for arbitrarily-deeply nested subtasks).

This information is only straightforward for datasets when datasets derive from a single top-level task. Data sets which arise from computations spanning the results of multiple tasks won't have a single top-level task id or clear semantics for other dimensions.

There's multiple options around this

Focus storage of raw results on variables rather than on reports
Dissallow calculations involving multiple tasks
Particularly when calculations involve multiple tasks, allow investigators to annotate their meaning and copy this information into files which contain results (e.g., HDF, JSON, XLSX, etc.)

I think L1V4 could say something like "when data generators only contain results from a single task, we recommend that reports of their results contain the following metadata ...". Dealing with this properly could be punted to L2.

jonrkarr commented 3 years ago

If JSON is being used, I feel like that would benefit from its own explanation since there's multiple ways data could be encoded.

luciansmith commented 3 years ago

You're right that I should include a bit about the RemainingDimensions, but I don't know of any other way to reduce the dimensionality of SED-ML data through computation, given that we require all calculations to be element-by-element, and for cross-matrix data calculations to have identical dimensions.

I don't know of anyone using JSON; if there is, I would invite them to write about how they're using it to encode this data!

luciansmith commented 3 years ago

OK, I updated the HDF5 section to include:

"When a DependentVariable is used to reduce the dimensionality of a set of data, the ids of whatever dimensions remain should be used (defined by its RemainingDimension children). The dimensions may by annotated to describe the dimension reduction as well. When a DataGenerator contains a Dependent- Variable that outputs a matrix, that matrix can also be labeled appropriately (such as with species or reaction ids).

When output from multiple tasks are combined mathematically, their dimensions must match exactly, so the ids from either (or a combination of both) may be used. Again, annotations are recommended to describe how the data was combined."

I also added this bit to the DataGenerator class:

"When multidimensional data is output to a Report, information about the dimensions should be stored in the output format chosen for the report, such as CSV or HDF5."

(Both CSV and HDF5 are links to the relevant sections.)

jonrkarr commented 3 years ago

When output from multiple tasks are combined mathematically, their dimensions must match exactly,

I think this will conflict with making number of steps optional. Calculations beyond the shape of the smallest input can be defined to be NaN.

so the ids from either (or a combination of both) may be used.

I don't think this is needed. The results of calculations are assigned to data generators, which have ids. Users can set these ids to be meaningful strings as with all other ids.

SED-ML / sed-ml

Add URIs & definitions for additional data formats (XLSX, HDF5, JSON) #52

Issue

Proposal

46