SED-ML / sed-ml

Simulation Experiment Description Markup Language (SED-ML)
http://sed-ml.org
5 stars 2 forks source link

Clarify datasets and reports (labels and multi-dimensional data-generators), label missing from UML #134

Closed matthiaskoenig closed 3 years ago

matthiaskoenig commented 3 years ago

Currently, the description of the datasets and reports are very vague.

The issues I have are:

  1. The required attribute label misses from the UML diagram of DataSet.
  2. A sentence about: labels must be unique within a report is missing
  3. it is not specified what should happen with multi-dimensional outputs and the wording is too restrictive (e.g. column is not working for higher dimensional data generators. The report is a report of the data in data generators and is in general not a 2D table (only in some special cases of single timecourses). The report is in general a HashMap[label, dataGenerator] and should be described as such.

I suggest the following updates:

Old text

The Report class defines a data table consisting of several single instances of the DataSet in the childlistOfDataSets (Figure 2.24). Its output returns the simulation result processed via DataGenerators in actual numbers. The columns of the report table are defined by creating an instance of the DataSet for each column. ... DataSets are labeled references to instances of the DataGenerator class. Each data set in a Report must have an unambiguous label. A label is a human readable descriptor of a data set for use in a Report. For example, for a tabular data set of time series results, the label could be the column heading.

New text

The Report class defines a data map consisting of several single instances of the DataSet in the childlistOfDataSets (Figure 2.24). Its output returns the simulation result processed via DataGenerators in actual numbers. The elements of the report are defined by creating an instance of the DataSet for each element of the report and are identified by the label of the DataSet. ... DataSets are labeled references to instances of the DataGenerator class. Each data set in a Report must have an unambiguous label. A label is a human readable descriptor of a data set for use in a Report. In general the Report is a map between labels and data from datagenerators, but can be interpreted as a data table for certain tasks. For example, in the special case of time series results, the report could be a tabular data set with the label being the column heading and the time series results being the columns.

fbergmann commented 3 years ago

good catch on 1! I have to say i still find it confusing to have a label and a name attribute, but ok.

The old text already had Each data set in a Report must have an unambiguous label. This does mean that it is unique, right?

I dont mind the change of 2-3, but it does not add anything for me.

matthiaskoenig commented 3 years ago

Good point. The sentence about the unambiguous labels solves issue 2. Point 3 is just a bit of clarification/re-formulation which should not affect current implementations, but make things clearer for new implementations.

luciansmith commented 3 years ago

re: both 'name' and 'label': this is behavior inherited from earlier versions of SED-ML so that L1v4 can be backwards compatible. However, if we like, we could do claim that they literally set the same value, like 'numberOfSteps' vs. 'numberOfPoints'. Would that be helpful?

matthiaskoenig commented 3 years ago

For me it would be nice having name and label. I would use these as

fbergmann commented 3 years ago

But we also have the name on the data generator.

matthiaskoenig commented 3 years ago

@Frank Good point. The name of the datagenerator is sufficient for me. So we could have only a single required name/label attribute on DataSet.

matthiaskoenig commented 3 years ago

In the specification is an example with name and label, so for backwards compatibility both attributes are necessary.

<listOfDataSets>
  <dataSet id="d1" name="v1 over time" dataReference="dg1" label="_1">
</listOfDataSets>
fbergmann commented 3 years ago

do we need to add something as to which is supposed to be used where? Your clarification i think makes it clear to use the label for column headers. I'm thinking of the use case of serializing a report to file. What would you do with the name then? Add a 2nd header row with the names? (And yes, that issue existed before, but since we are adding clarifications, we might as well).

matthiaskoenig commented 3 years ago

It is not defined what to do with the name or how to store the report (i.e. what formats) Currently the specification states:

The encoding of simulation results is not part of SED-ML Level 1 Version 4.

For me it would be great if we could add a recommendation to the specification along the lines of:

luciansmith commented 3 years ago

I've updated the text to simply say:

"The encoding of simulation results is not part of SED-ML \currentLV, \changed{but it is recommended that 2D output be exported as CSV files, using the \element{label} as column headers, and that output with more dimensions be exported as HDF5, again using the \element{label} to uniquely identify the data sets.}"

If we need more, that's fine with me as well.