SED-ML / sed-ml

Simulation Experiment Description Markup Language (SED-ML)
http://sed-ml.org
5 stars 2 forks source link

Write Output to file (define output description for reports and plots) #33

Open matthiaskoenig opened 7 years ago

matthiaskoenig commented 7 years ago

Issue

Currently, in L1V3 it is not defined out the output results should be stored. We need a way to store outputs in Files/Combine Archives. I.e. it is not sufficient to define the output, but also how the output should be stored and in which resource.

Not having a defined output format creates many issues. First it makes it difficult to compare results and check reproducibility. We need this especially for the SED-ML testsuite, i.e. to compare the results of the testsuite against implementation results.

A defined output format will simplify implementation, because we can provide reference outputs (reports) for examples. We need a clear definition how the output file format should look (NuML, CSV) and it should mirror the allowed input formats for data, i.e. we are consistent in what files we read and write out.

Examples

Proposals

A solution would use the Resource which allows retrieving/storing complex resources, i.e. how the resource to store the file should look like and in which format One output destination could be the "monitor/console" which is what is currently the default for outputs.

OutputDescription(SED-Base):
    output: SIdRef (Output), i.e. report or plot reference
    resource: SIdRef (Resource), i.e. to which file to write and in which format 

This could than have subclasses for report (mimicking the dataDescription import) and plot

reportDescription(OutputDescription):
   format: anyURI (numl, csv, tsv, ...) 
   dimensionDescription: 

plotDescription(OutputDescription):
   format: anyURI (image format pdf, tif, png, ...) 
   size {use: optional}
   resolution {use: optional}

See also

jonrkarr commented 3 years ago

For BioSimulators, we've needed outputs to be much more concrete. SED-ML is underspecified. As Mattias points out, this generates a variety of issues.

This is what we've aligned on for BioSimulators: https://biosimulators.org/standards/simulation-reports. We have 10 tools producing output in this format, spanning multiple model languages, modeling frameworks, and algorithms.

luciansmith commented 3 years ago

The SED-ML editors have consistently rejected any attempt to do this over the years, but you can always try again! The explanation always seems to make sense at the time, and then I walk away and the next day can't figure out why it made sense. @nickerso : would you mind explaining once more why a 'filename' attribute on Output objects goes against SED-ML philosophy?

jonrkarr commented 3 years ago

To chime in again, a clear protocol for output is imperative. Output is necessary to enable investigators to do further processing and visualization that is beyond the scope of SED-ML (at least without making SED-ML much more complex). Without output, SED-ML is a dead end -- only simple analyses and plots can be done and anything beyond that is treated as a hack.

If we can't agree on a filename attribute, I would be comfortable with agreeing to derive output file paths from the id attributes of outputs.

Regarding a filename attribute, I think it would be important for there to be a way to indicate a relative path so that a COMBINE archive can have multiple SED-ML files and each one can save outputs to different, non-overlapping paths. One possibility is to direct the output of each SED-ML file to a distinct subdirectory.

matthiaskoenig commented 3 years ago

If we can't agree on a filename attribute, I would be comfortable with agreeing to derive output file paths from the id attributes of outputs.

Yes, we could write a recommendation or best practice. So tools and services who want to write files can do this in a consistent and exchangeable manner.

jonrkarr commented 3 years ago

If we're not all comfortable with best practices, they could be described in an external document (web page) that's less formal. I think SBOL does a good job of separating the two so that there's a stream of SBOL that can evolve quickly.

nickerso commented 3 years ago

agree that this is a best practice issue. Filenames make no sense for applications which don't make files and which would derive their own form of identifiers as needed. Also a filename on its own is not sufficient to define a serialised output and I really don't think we should go down that rabbit hole for L1V4.

jonrkarr commented 3 years ago

A filename attribute could be optional. I don't see why the fact that some tools don't produce outputs needs to prevent other tools from being able to do so clearly.

nickerso commented 3 years ago

Having a filename specified sets the expectation that that is the desired outcome of executing the simulation experiment. But it is also that a filename doesn't give any information on what to serialise to that file, how that filename should be interpreted in terms of file storage, size or resolution or format for image files, etc. So at best, all a filename will do is potentially further confuse users while giving no more information that the spec suggesting that the output id is a suitable value for a unique filename in the scope of a given SED-ML document.

Not sure how a simple filename attribute would in any way make producing outputs more clear....

matthiaskoenig commented 3 years ago

The one problem I see is that different tools will overwrite the files.

I.e. if I execute an omex which will create a report ./results/report1.h5 with my tool the file will be created in the archive. The next tool touching the archive will overwrite the file because there is nothing as a pattern in the filename. The ideal case for me would be to have tool dependent filenames such as. ./results/{tool_id}/report1.h5 This would allow to execute the archive with multiple tools and collect the results. Similar with ./results/{tool_id}/figure1.svg ./results/{tool_id}/plot1.svg

If we could come up with a best practice how to clearly state where to write reports, plots and figures in a tool dependent manner this would be great.

jonrkarr commented 3 years ago

Yes, outputs need to pipe to unique location.

Unique locations can be derived from the locations of the SeD-ML files within COMBINE archives. I think a natural solution is to use subdirectories that mirror the structure of the COMBINE archive:

{ path/to/results (chosen by user -- user could choose to include the name of the tool in this) }
    /{ sed-ml-location in COMBINE archive }
        /{ output-id }.{ output extension }