SED-ML / sed-ml

Simulation Experiment Description Markup Language (SED-ML)
http://sed-ml.org
5 stars 2 forks source link

Create URNs for SBML fbc, multi, and qual packages #74

Open jonrkarr opened 3 years ago

jonrkarr commented 3 years ago

To facilitate the use of SED-ML beyond kinetic models encoded in SMBL, I think it would be helpful to add URNs for the SBML-fbc, multi, and qual packages. Because these packages are largely used separately from the core of SBML, and largely implemented by different software tools, to investigators these function as substantially distinct formats. Separate URNs, or some other mechanism to express that a model uses an SBML package, would help investigators know which simulation tools are needed to execute a given model and help software tools determine whether they have the capabilities needed to execute a given model.

nickerso commented 3 years ago

is there no way in SBML to know what packages are in use? This sounds more like an SBML issue than SED-ML...

I'm pretty sure people mix SBML-comp with other packages, but are there other combinations that are actually used in practice? would you need to define URNs for all possible combinations?

jonrkarr commented 3 years ago

This can be assessed by inspecting the SBML file.

The issue is that because several of the SBML extensions are fairly distinct, to the community they function almost as separate formats -- covered by separate specification documents, executed by different tools, deposited to different repositories, used by distinct communities. This could be more clearly communicated in SED-ML through more specific URNs.

Defining URNs for each combination is not the only possible solution. An alternative is to capture a list of URNs.

nickerso commented 3 years ago

I guess the question is whether a URN in a SED-ML document is going to help users?

Software tools would simply pass any SBML model document to libSBML to discover more about it, right? Is the idea to be able to have a SED-ML supporting tool discover this kind of information about the model without having to first load the document into a SBML tool to find out?

We have a similar issue with CellML, although without the different specs, in terms of DAE vs ODE vs algebraic models, and have generally assumed that tools would use libCellML to discover that kind of information about a model.

jonrkarr commented 3 years ago

Yes, this would make it easier for tools and investigators to know whether they should be able to execute a model or perhaps that a model should be skipped because the tool doesn't expect to be able to execute it. Because SBML is heterogenous, determining which tools are capable of executing a model isn't trivial. Rather than asking tools to infer whether they have the capabilities to execute model, checking for the usage of all packages and classes, I think it would be easier for models to declare the capabilities that they require. Language URNs are SED-ML's closest mechanism for this.

Once place this can be relevant is with archives that include multiple types of models. While BioUML might be able to execute all of them, VCell or tellurium, for example, wouldn't. This could lead VCell or tellurium to fail on the whole archive, instead of executing the tasks that they can and skipping the others. Of course, there are multiple ways that tools can try to infer the capabilities necessary to execute a model.

While people who are intimately familiar with SBML can likely sort through all of this, this issue creates confusion for the broader community of users of models.

I'm trying to be cognizant that tools aren't required to use libSBML or libCellML. Even if they do, they may not use information about the required packages that libSBML collects.

luciansmith commented 3 years ago

My inclination here is to handle this in the SED-ML algorithm: if the analysis is a flux balance analysis, it should know to expect the fbc package from an SBML model, or similar constructs in the CellML model, etc.

There are indeed parts of SBML files that different tools support or don't, and you have to find the right tool, but this is true even when you restrict yourself to the core specification: for a long time, Copasi didn't support SBML events, for example. Other things like 'non-constant compartment sizes' or 'MathML that has trig functions', or any of the other 20+ tags in the SBML test suite would also be candidates for distinguishing what set of tools are appropriate for running a given model. Or as David notes, DAE/ODE/algebraic models for CellML.

Currently, I think it should be relatively straightforward for a 'delegating tool' to parse the model file directly (whether SBML or CellML), gather information about what constructs it contains, and report back about what tools may be used accordingly.

However, I could see the utility of something in SED-ML that provided this additional information about the model directly, maybe as a 'features' child of a Model object, or some such? It would be an optional thing that could list the salient features of the model, at least in the mind of the person putting together the SED-ML file. The only issue is that this would repeat the information contained in the model itself, which worries me a little, simply in that the two could diverge.

jonrkarr commented 3 years ago

Delegating this to individual tools, which don't really know how to handle this, is basically the status quo. The SBML test suite probably comes closest to handling this, but simulation tools don't have access to that functionality because its not (straightforwardly) provided by libSBML. Asking tools to handle this, especially if libSBML does make it easy, will likely continue to result in few tools doing so.

luciansmith commented 3 years ago

'Does this model use the FBC package' is indeed information straightforwardly provided by libSBML. If tools aren't checking that already, I'm pretty sure that adding the same information to the SED-ML file won't change the situation, unfortunately. (You can also find out the information from a basic string match on the file, for that matter.)

However, I thought you were talking about SED-ML-specific tools that parse SED-ML and parcel out simulation experiments to one or more tools that it knows about. Am I wrong? What workflow are you envisioning here?

jonrkarr commented 3 years ago

The issue that when an investigator reads a publication about a simulation study or obtains one from a repository such as BioModels, they need to know which tool(s) they can use to execute it. This information is necessarily not contained in COMBINE, SED-ML, or SBML unless its in a comment. BioModels does not clearly communicate this either. While the small community of people who know about all of this can navigate through this, its confusing for the average investigator who doesn't necessarily even focus on modeling. Somewhere in the ecosystem it should be easier to figure out which tools can execute a COMBINE archive, SED-ML file, or model. The blanket URN for SBML perpetuates this confusion.

I would disagree about it being straightforward to determine which SBML features are needed to execute a model. While its easy to query whether an SBML package is used, as you point out, there are many SBML features that need to be checked for. The ones you point out are not as straightforward to check for.

luciansmith commented 3 years ago

OK! That helps a lot.

Your initial proposal was simply to add URNs for SBML packages; those are the things that are the easiest to find from an SBML model. But I absolutely agree that other aspects of what a model contains (like the ones tagged in the test suite) are often not trivial at all, and that in the situation you describe, some hints would be nice.

What if we had something like this:

https://docs.google.com/drawings/d/1333lnFe_gvJ5jTqiXkCuHB4Njj_DV7SxPRbFD1Z57D0/edit

(Something like this could be added right now as an annotation, of course, but if it was made official, it might look something like this.)

jonrkarr commented 3 years ago

I tried to use a salient example. But the issue is more complicated than just SBML packages.

Something like what you outline would work. To make this practical, it would also be helpful for the the core libraries (e.g., libSBML) to assess the features used by a model.

nickerso commented 3 years ago

looks good to me. Should likely be on the task class, right? as it would be the list of features a tool would need to support for the application of a specific simulation algorithm to a model...and after any changes are made to the model.

matthiaskoenig commented 3 years ago

Just some additions:

luciansmith commented 3 years ago

The list of model features could potentially be split into language-dependent and language-independent lists, I think. But even there, some things just make sense as something to check in one context but not in another. Take 'variable compartment size' for example: if an SBML model with variable compartment sizes was translated to CellML, the compartment variable would then look like any other variable, so nothing in particular would be gained by asking if a CellML tool supported compartment variation.

So overall, if we do indeed develop such a list, I would recommend trying to make it language-agnostic when we can, but to not worry about making language-specific tags if need be. (I'd definitely recommend having that list on our website and not written into the spec, so we can update it as needed.)

As far as the potential 100's of tags: we've worked with developing the test suite to try to come up with a reasonable tag list where we didn't delve into the weeds, but provided enough of a feature overview that if you claim to support X, you really need to support all the tests in the suite with X, unless it comes with a different thing Y you don't support. So 'SBMLNonConstantCompartmentViaRateRule' just gets 'NonConstantCompartment', and we test all the various 'via...' options, but don't label each one: if you support non-constant compartments, you should support all the ways that might happen.

It's sort of an art, but overall, it's basically the difference between a feature set and a bug: if you don't implement a particular feature set, you're going to fall all of those tests, and if you support all but two of them, those two are due to bugs. We do occasionally get feedback from tool developers asking to refine particular tags because they support X but not Y, and we lump both X and Y together: in those cases, we'll break apart X and Y into their own tags.

I guess the usefulness of any given tag would be: Does this usefully distinguish between simulation tool design? And can it be used to point out bugs in tools when found?

As far as updating libSBML to support this, that seems relatively straightforward. I already wrote the code you'd need as a separate program for the test suite; it's more of a design issue than anything else at this point.

matthiaskoenig commented 3 years ago

@luciansmith thanks for the clarification. I think the list of required features for models/simulations and supported features for simulators is a very good idea. Like you stated this could be easily generated for SBML simulators/tools from the test suite results and also be made part of libSBML (which would be great). This would also allow a more fine grained support by simulators. E.g. most simulators just do not simulate models with distrib (because they don't know the package), even if there are only uncertainties in distrib (which are annotations and not changing the simulations). Same thing with comp models. For many of my coupled models I define ports for the submodels, but also want to simulate the submodels in isolation. Due to the comp package simulators do often not take the model (but ports are only annotation). By having a list of simulation relevant features it would not be all or nothing with the support of SBML packages.

As a side note: Somehow we face the same issue at many steps of SED-ML not only at the model. I.e. which SED-ML features are supported by a simulator/tool. E.g. what kind of model changes, what kind of ranges, what kind of URIs, ...

jonrkarr commented 3 years ago

Yes, the same issue exists for support for SED-ML features.

In BioSimulators, we've created a place for simulation tools to describe the features of each format that they support. If developers annotate this information, it can be shared with the community. If the community can agree on how to describe features, tools could automatically process this information. SED-ML URNs and EDAM are too coarse, although either could be made to work. I think this needs granularity similar to that of the SBML test suite.

jonrkarr commented 3 years ago

This would be addresses by using EDAM in place of URNs (#94). Terms for these packages are in the EDAM curation pipeline.