Add higher-level validation rules described in SED-ML validator manuscript to B1.1

jonrkarr commented 3 years ago

https://arxiv.org/abs/2106.00844

This covers more "advanced" rules such as:

The network of model sources can't be cyclic
List of curves must have length >= 1
Uniform range must have at least 1 step

luciansmith commented 3 years ago

Do you have a list of these rules? I'm more than happy to add them in.

jonrkarr commented 3 years ago

They're in the supplementary materials to the link above. I've copied them below. The first items are intended to encapsulate everything that's currently stated in the appendix.

\subsection{SED-ML files}
\begin{itemize}[leftmargin=6pt]
\item SED-ML files are syntactically consistent with the SED-ML schema (implemented using SED-ML).
    \begin{itemize}[labelwidth=0pt]
    \item Files are encoded in the SED-ML schema.
    \item Each required attribute is defined.
    \item The values of Boolean, integer, and float-valued attributes are valid Booleans, integers, and floats.
    \end{itemize}
\item Each element has a unique id.
\item The references that are required in specific contexts are set in those contexts (e.g., each variable of each data generator has a task reference).
\item The references which are not allowed in specific contexts are not set in those contexts (e.g., no variable of a data generator has a model reference).
\item Each reference (e.g., from a task to a model) can be resolved.
\item URLs should be used in place of URNs for model sources because URNs will likely be deprecated in a future version of SED-ML (warning).
\item Each model source can be resolved.
\item The networks of references are acyclic:
    \begin{itemize}[labelwidth=0pt]
        \item Model references.
        \item Compute model changes.
        \item Repeated tasks.
        \item Functional ranges.
    \end{itemize}
\item Each container element has at least one child:
    \begin{itemize}[labelwidth=0pt]
    \item Repeated tasks have at least one subtask.
    \item Reports have at least one data set.
    \item 2D plots have at least one curve.
    \item 3D plots have at least one surface.
    \end{itemize}
\item The source of each model is encoded in its language. 
    \begin{itemize}[labelwidth=0pt]
    \item CellML-encoded models: The validator uses LibCellML (\href{https://libcellml.org}{https://\allowbreak{}lib\allowbreak{}cell\allowbreak{}ml.\allowbreak{}org}) to check that each model source is a valid CellML file. 
    \item NeuroML-encoded models: The validator uses LibNeuroML \citep{vella2014libneuroml} to check that each model source is a valid NeuroML file. 
    \item SBML-encoded models: The validator uses LibSBML \citep{bornstein2008libsbml} to check that each model source is a valid SBML file. 
    \item Other XML-encoded models: The validator checks that each model source is a valid XML file. 
    \item Other formats: The validator warns users that these model sources could not be validated.
    \end{itemize}
\item For XML-encoded models, the namespaces required to describe each model change and variable of each data generator are defined.
\item For XML-encoded models, each element of each model change is an XML element or a list of XML elements. 
\item For XML-encoded models, each target of each static model change is a valid XPath.
\item Each uniform range has at least one step.
\item When a repeated task has multiple ranges, the secondary ranges are at least as long as the primary range.
\item The output start time of each time course simulation is at least its initial time.
\item The end time of each time course simulation is at least its output start time.
\item Each time course has at least one step.
\item Each time course has a number of steps evenly divisible by five (warning). This validation rule was motivated by observing that many SED-ML files in BioModels have extra unintended steps due to historical confusion about the meaning of the SED-ML attribute which stores this information and bugs in creating SED-ML files by some software tools. Note, we are helping address the underlying causes for both of these issues through revisions to the SED-ML specifications and fixes to these software tools.
\item The KiSAO id of each algorithm and the KiSAO id of each algorithm parameter are ids of KiSAO terms for specific algorithms and algorithm parameters.
\item Each parameter of an algorithm has a unique KiSAO id.
\item For XML-encoded models, each target of each data generator to a static model is a valid XPath that matches a single model element.
\item Each mathematical expression can be evaluated (e.g., all symbols are defined).
\item Each data set has a unique label within each report (warning).
\item The axes of the curves and surfaces of plots have consistent (log or linear) scales (warning).
\item The data generators for each plot are linked to basic simulation tasks and not repeated tasks (warning). The validator raises warnings for such plots because the SED-ML specifications do not officially support such plots, and some software tools do not have the capability to create these plots.
\item Each task contributes to at least one output (warning).
\item Each data generator contributes to at least one output (warning).
\item The shapes of the outputs of the sub-tasks of each repeated task are consistent (warning).
\item The shapes of the outputs of the variables of each data generator are consistent (warning).
\item The shapes of the outputs of the data sets of each report are consistent (warning).
\item The shapes of the outputs of the x and y data of each curve are consistent (warning).
\item The shapes of the outputs of the x, y, and z data of each surface are consistent (warning).
\end{itemize}

luciansmith commented 3 years ago

Thanks for pointing this out!

Is there anywhere I can look for more specific rules when they're general, i.e.:

\item The references that are required in specific contexts are set in those contexts (e.g., each variable of each data generator has a task reference).

Is there code somewhere where all of these checks are performed in one place?

luciansmith commented 3 years ago

Question about:

\item The networks of references are acyclic:
    \begin{itemize}[labelwidth=0pt]
        \item Compute model changes.

Does this mean that biosimulators does not allow a model change that is (for example):

model.S1 = model.S1 + 3

Or is there some other cycle I'm not thinking of here that you're referencing?

jonrkarr commented 3 years ago

My interpretation is that model changes are processed in the order they're defined. Under this interpretation, its fine to say something like

The source of model1 is model0
Add 3 to S1 of model1

The network issue is different. The source of a model can be another model. E.g.,

The source of model1 is model0
The source of model0 is model1

In my opinion, such descriptions ill-posed. They should be considered invalid and users should see errors, preferably prior to any execution.

I don't know how other tools are handling this. They're probably doing one of a few of things:

The tool doesn't support model sources like this.
The tool doesn't support model changes
The tool didn't consider this "edge" case. The tool would crash due to exceeding its recursion limit.
The tool does detects this error at run-time.
The tool reports a custom error.

jonrkarr commented 3 years ago

There's a few cases where ill-defined cycles could be defined with SED-ML. These cycles are relatively easy to validate for (e.g., with networkx in Python).

Another example is a repeated task (e.g., a repeated task is a subtask of itself). This should be invalid because this describes an infinite loop.

luciansmith commented 3 years ago

The source cycle I did get--that was actually handled in a different line-item on your list (which I hadn't quoted in its entirety):

\item The networks of references are acyclic:
    \begin{itemize}[labelwidth=0pt]
        \item Model references.
        \item Compute model changes.
        \item Repeated tasks.
        \item Functional ranges.
    \end{itemize}

Model references are where the 'source' attributes are cyclical. Repeated task cycles are when a subtask references its parent RepeatedTask. I think FunctionalRanges are where a Variable child of the FunctionalRange references the FunctionalRange itself?

But I couldn't come up with a similar problem for a ComputeChange: if a Variable child of a ComputeChange references the parent Model, I would have counted that as being OK--but perhaps you all have a reason to think it's a problem, or were actually referencing a different problem.

jonrkarr commented 3 years ago

Each functional range is a transformation of another range (FunctionalRange@range). This could be used to describe a cycle.

Cycles can arise through expressions involving the variables of compute changes

set x of model1 to x of model0 (expression involving a variable whose target is x of model0)
set x of model0 to x of model1 (expression involving a variable whose target is x of model1)

As I understand, instances of model (listOfModels) should be thought of as a unordered set. As a result, there's no defined order in which model (and their changes) can be resolved (except for dependencies implied by compute changes that involve other models). This means that models can be resolved in independently (e.g., in parallel), except for these dependencies.

As I understand, set value is different because there is an order of operations (listOfChanges is an ordered list).

luciansmith commented 3 years ago

Aha! So directly referencing yourself is actually OK in this context; it's just indirectly referencing yourself that's the problem. That makes sense. Thanks!

(For what it's worth, in SBML the convention is that self-referential changes like these (in particular for event assignments) generally apply to all assignments at once. That is, if the model contains:

S1 = 3 S2 = 4

And in your ListOfEventAssignments, you have:

S1 = S2+3 S2 = S1+3

You use the previous value for all right-hand calculations, so in the end, you have

model.S1 = 7 model.S2 = 6

Execution can happen in any order in this situation.

I don't believe people were thinking of the ListOfChanges as being ordered, and would therefore have to fall under the 'use the previous version for everything' category in order to work. I doubt anyone has yet created a file where this is an actual issue, but it should probably be resolved.

jonrkarr commented 3 years ago

I've thought of changes as unordered because I've tried to take as permissive a view of things that can be expressed in the SED-ML syntax as possible. If changes were considered unordered, this would imply an additional constraint (e.g., that a target can be changed twice in the same listOfChanges).

I have no preference between the two interpretations, except that (a) tools should handle this consistently, (b) the specifications should be clear, and (c) the specifications should be enforced by validation rules.

That said, the unordered interpretation is more declarative, whereas the ordered interpretation is more procedural. The unordered interpretation is more inline with the spirit of pursuing declarative descriptions of simulations.

luciansmith commented 3 years ago

I think this is like the fifth time I've re-discovered that writing up validation rules is a super useful thing to do, and can reveal lots of subtle inconsistencies or lack of clarity in the spec. I don't know why it surprises me every time...

luciansmith commented 3 years ago

Another comment/question: you have rules in your list that check the referenced model (and presumably could reference external data, too). Does it make sense to put validation rules in SED-ML itself that requires parsing external models/data to check? My instinct was to say that those issues would be OMEX issues, not SED-ML issues, but maybe it would be helpful to put them in anyway.

jonrkarr commented 3 years ago

I think these rules conceptually belong to SED-ML since SED-ML files reference models, and those references are independent of whether the SED-ML is in a combine archive.

From an implementation perspective, I think validating this should be done outside libSEDML. I don't think we want to add libraries like libSBML as dependencies of libSEDML. I think the two should be combined at a higher level such as the BioSimulators package. That library has each model language library as an optional dependency.

luciansmith commented 3 years ago

All of the mentioned issues were incorporated into the JIB submission!

SED-ML / sed-ml

Add higher-level validation rules described in SED-ML validator manuscript to B1.1 #182