SED-ML / sed-ml

Simulation Experiment Description Markup Language (SED-ML)
http://sed-ml.org
5 stars 2 forks source link

Stochastic simulations (number of runs) - SimpleRepeatedTask #22

Closed matthiaskoenig closed 3 years ago

matthiaskoenig commented 7 years ago

Issue

A highly requested feature is better/simpler support for stochastic simulations which are currently quite complicated with the repeatedTasks. It should be simple to just define the number of runs for a stochastic simulation (instead of the overhead of a repeated task).

In addition it must be possible to easily calculate summary functions over the runs of the stochastic simulation. Things like mean, variance, std. Also the individual runs must be easily indexable.

Examples

Proposals

SimpleRepeatedTask(Task): numRepeats: int reset: boolean



# Related Issue
* related to this is how to calculate summary statistics over repeatedTasks or stochastic simulations ( #53 )
* dealing with multidimensional data ( #21 ), i.e indexing calculating math on certain dimensions

edit: I opened a separate issue for the math related part ( #53 )
matthiaskoenig commented 7 years ago

mentioning @cjmyers, so informed about updates on this

luciansmith commented 7 years ago

I am not super excited about having two ways to do exactly the same thing. I think this leads to incompatibilities between software tools, and confusion for our users about how to accomplish things.

I also don't really buy the 'but repeated tasks are too complicated!' argument. If you're only going to support a small subset of repeated tasks for stochastic simulations, then just support that subset of RepeatedTask abilities, and move on. Just don't implement the parts you don't care about.

My feeling is that once we solve the other end of things (namely, how best to treat the results of a repeated stochastic task in the output, and how to get means and stddevs from them), this end of things won't matter so much.

cjmyers commented 7 years ago

Repeated tasks are NOT stochastic tasks. Repeated tasks are for sweeping parameters, etc. Their semantics are not consistent with stochastic tasks which do not change anything, but rerun with new random values. Repeated tasks semantics are to restart with new values. This would mean every task would be identical, since the SEED would reset each time.

I agree with Lucian that we should not have two ways to do things. Repeated Tasks should be forbidden as a means for stochastic simulation for the reasons I just gave. If we are introducing many new types of Tasks as the new UMLs indicate, then I don't see any problem with making one of the new types of tasks a stochastic task. We have found in our experimentations that trying to shoe-horn repeated task to do stochastic simulation just does not work.

luciansmith commented 7 years ago

I disagree with your semantic assessment of what a RepeatedTask is. In my view, a RepeatedTask is just a task that is repeated. The class is agnostic as to the reason (in my head, at least).

My hypothesis as to why repeated tasks for stochastic simulation don't work for you is because of a lack of support on the output side. Let's get that end working (which will have to work with all repeated tasks of every stripe, including stochastic repeats), and then revisit this issue at that point, perhaps?

cjmyers commented 7 years ago

That is incorrect. There are two reasons. One which I've already mentioned is that it is too heavy a hammer. It is an extremely complicated way to express repeat for N runs.

However, the main reason is the fact that you must either set resetModel to true or false. If you set it to true, then you reset everything each time around, which would mean resetting the SEED too, so you get identical simulations each time. If you set it to false, then the initial values do not get set back to their initial value as they should.

The way I see repeatedTask is that it enables you to potentially call a simulator like in a script to run a series of tasks. Each time you call the simulator you send it the Model (perhaps with changed parameter values) and the simulation options (including the SEED). Then you simulate. The simulator does not need to maintain any state information.

StochasticSimulation is NOT like this. It is a single indivisible Task. Namely, you should send it to the simulator as one single task to execute. StochasticTasks are therefore Tasks, NOT RepeatedTasks.

luciansmith commented 7 years ago

How is the seed part of the model? This is a genuine question. You can't actually store it there, can you? What would be the purpose of ever storing the seed with a model?

cjmyers commented 7 years ago

It is not in the model, but there is no other switch in repeatedTask that says anything about reseting a parameter or not.

Anyway, this is not the point. The more important point is the semantics, which as I explained above, a StochasticTask is a Task, i.e., an indivisible action that must be considered to be run as a unit (this is the only way you can get proper stochastic behavior). A repeated task on the other hand is a set of separable tasks that can be run independently. Stochastic tasks are not truly independent. If you call a simulator with the same algorithm parameters multiple times, then you would be sending the same SEED over and over again and getting the same result. There is state that is preserved from one task to the next to ensure that the random number generator is not reset.

There is currently no way that I'm aware of in RepeatedTasks to change algorithm parameters, and even if there were this would cause one to attempt to encode changes to the SEED for each run for something as simple as run a set of stochastic simulation runs. Furthermore, it would not emulate what is actually happening in the simulation that treats this as a single complete simulation task.

For all these reasons, we are simply using a single Task for stochastic simulation with an algorithm parameter setting the number of runs. The reason why we want a StochasticTask is to enable us to say that a StochasticTask actually is different from a Task in that it returns a 2-dimensional array of results over time and runs. Namely, StochasticTask gives us the ability to do better validation once we are able to access the results as arrays.

luciansmith commented 7 years ago

So, what I hear you saying is that we need to say something about the random number seed in our explanation of the 'reset' parameter. (We clearly need to do this whether or not we introduce a separate StochasticTask task.) I would say that the most obvious thing to do would be to say "The random number seed never resets in the completion of a SED-ML experiment." Maybe we could introduce a special sedml-defined 'seed' csymbol (like we do for 'time') if people really wanted to reset the seed for some reason. (Or maybe there's a KiSAO term? Hmm.)

cjmyers commented 7 years ago

It is not just about the SEED. The current approach that I've taken is fine, and as far as I can tell perfectly legal in SED-ML now. Namely, I have a Task and an algorithm parameter for the number of runs that task completes. I'm not going to change this to RepeatedTask because of the semantic problems I've just described. I want the simulator to be called exactly once. This is a Task, not a RepeatedTask. No solution for SEED is going to change the fact that I want to consider a stochastic run as a single indivisible task.

However, once we consider the fact that Tasks can return arrays. We need a clean way to indicate how many dimensions to expect. A stochastic simulation will always return an array of values with two dimensions, so StochasticTask that inherits from Task would allows us to specify that fact.

Note that I'm not against RepeatedTasks. We use them for parameter sweeps. However, the Task in the parameter sweep may be a StochasticTask. In this case, we would not necessarily care if we did reset the SEED, since each StochasticTask run in this parameter sweep can be thought of as a single simulation, and starting with the same SEED might not be a bad thing in this case.

luciansmith commented 7 years ago

Wait, wait wait. You already have a solution? Why don't we just use that? It sounds like you don't need a StochasticTask; you need a way to indicate the dimensions of the returned array for an arbitrary Task. If you've already discovered one way to change the expected return dimensions of a Task, surely people will find other ways as we move forward.

cjmyers commented 7 years ago

Ok, almost. I still need a proper KISAO term for number of runs. I'm currently using: KISAO:0000326 which is technically Number of Samples. It is the closest I could find. I need a term for Number of Runs. I've just submitted a ticket for a new term. Not sure if anyone is watching this tracker though as there is one open issue submitted in 2011.

Even if we develop a different technique to indicate the number of dimensions for a task, I'm still concerned about people being confused and using RepeatedTask for stochastic simulation. Having a StochasticTask would make this clear. We will end up having this same discussion over and over again. Creating a StochasticTask would make it clear that this is what should be used for doing stochastic simulation. I could relent on this though, if the specification expressly points out that one should NOT use repeated tasks for stochastic simulation.

matthiaskoenig commented 7 years ago

The reason why we want a StochasticTask is to enable us to say that a StochasticTask actually is different from a Task in that it returns a 2-dimensional array of results over time and runs. Namely, StochasticTask gives us the ability to do better validation once we are able to access the results as arrays.

The repeatedTask is exactly the same and also returns a 2-dimensional array of results. Personally I don't see any difference between a repeatedTask and a stochasticTask with multiple runs.

Like Lucian said it breaks all down to deal with the multi-dimensional data repeatedTasks return. If there would be a clear way to handle this most of our issues would be solved. This is the one major thing, i.e. access of multi-dimensional array data and defining what kind simulations & task create, we have to get right in the next weeks (for L1V4). Because all the other features will fall out of this one, i.e. multi-dimensional plotting, stochastic runs, stochastic means & std.

Repeated Tasks should be forbidden as a means for stochastic simulation for the reasons I just gave.

Personally I also see the problem that there are suddenly multiple ways to do things. Because most implementations of SED-ML already use repeatedTasks for stochastic simulations. Also there will be the issue with distrib models run with deterministic algorithms (is this a stochastic task?) and sampling from distributions to initialize models (is this a stochastic task?)

The reasons are not really convincing:

However, the main reason is the fact that you must either set resetModel to true or false. If you set it to true, then you reset everything each time around, which would mean resetting the SEED too, so you get identical simulations each time. If you set it to false, then the initial values do not get set back to their initial value as they should.

This is wrong: The specifications show clearly in the examples L1V2 that the SEED is not reset and repeatedTasks are used for stochastic simulations.

There are two reasons. One which I've already mentioned is that it is too heavy a hammer. It is an extremely complicated way to express repeat for N runs.

It's a format read by computers. IMHO it does not make a big difference if you read an annotation or if you iterate over an repeated task (you don't have to support any of the functionalRanges even, but just get the number of repeats out.

Also implementing the repeatedTasks would allow you to perform parameter scans which you could not do right now.

But I see clearly that repeated task is somehow an overkill. We would need some easy way to run something Repeatedly (not necessarily with a stochastic simulator). I could imagine something like

SimpleRepeatedTask:
    numRepeats: int
    reset: boolean

which is a subclass of task.

Which is equivalent to a repeatedTask with no listOfChanges, a simple range 0, ... numRepeats-1, and only allows to perform a single task. I would not call it stochasticTask, so one can also run deterministic models with distrib with them.

Personally I like this SimpleRepeatedTask and people could easily map it to the repeated Tasks in their implementations.

M

cjmyers commented 7 years ago

Again, my point is a Task is an indivisible computation. A stochastic simulation is an indivisible computation. It is not the complexity of RepeatedTask that is the main issue for me. It is the semantics problem of RepeatedTask referring to Tasks. Namely, a RepeatedTask is stating loop through these Tasks. I don't like SimpleRepeatedTask either, since it would presumably still refer to a Task that is being repeated. I would prefer to add "numRepeats" to the Task class. That is much simpler, and, more importantly, semantically correct.

matthiaskoenig commented 7 years ago

But adding the numRepeats is exactly what SimpleRepeatedTask does (or whatever is a good name for it). It is a subclass of task and adds the attribute numRepeats and reset (in case people don't want to reset the initial concentrations). Via a KISAO for seed in the simulation you can than even set if the SEED is reset every run (analog to the repeatedTask right now).

Task:
    modelReference:
    simulationReference
/ \
 |
SimpleRepeatedTask(Task):
    numRepeats: int
    reset: boolean

You need a new subclass where one can define the behavior (otherwise what happens with the second run? Is everything reset? or not?

Probably the name SimpleRepeatedTask is confusing, because it is not a subclass of RepeatedTask, it is a Task which runs the simulation multiple time (as I understand this is exactly what you want).

cjmyers commented 7 years ago

I'm okay with StochasticTask still for the name, since even if you are doing multiple runs of an ODE simulation with random distributions for the initialAssignments, this is still a stochastic task.

cjmyers commented 7 years ago

If the task is not stochastic, there would be little point to repeat.

matthiaskoenig commented 7 years ago

How would you easily check that it is not stochastic:

How about IteratedTask, perhaps SampledTask? If we can agree that the construct is what you want, i.e. a subclass of task with numRepeats (and reset). Than we can argue about the name :)

On Fri, Jul 14, 2017 at 1:30 AM, cjmyers notifications@github.com wrote:

If the task is not stochastic, there would be little point to repeat.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/SED-ML/sed-ml/issues/22#issuecomment-315229317, or mute the thread https://github.com/notifications/unsubscribe-auth/AA29usQja8hjE0kZQK9pRAbMqrq1y0rvks5sNqiXgaJpZM4OXVA_ .

-- Dr. Matthias König Junior Group Leader LiSyM - Systems Medicine of the Liver Humboldt-University Berlin, Institute of Biology, Institute for Theoretical Biology https://www.livermetabolism.com konigmatt@googlemail.com Tel: +49 30 20938450 Tel: +49 176 81168480

luciansmith commented 7 years ago

There are two different types of 'stochastic', in this case. A simulation with a 'stochastic' KiSAO term means 'treat the reactions in a stochastic manner'. In this way, each repeat of the simulation is different, if there are any active reactions in the model.

But a second kind of stochastic model uses 'distrib' or the like to set values during the simulation. In that case, any repeated simulation of the model, regardless of the KiSAO term used, would produce a different result. And in fact, you might want to run either a stochastic-reaction run of a model with 'distrib'-set parameters, or a deterministic-reaction run of a model with 'distrib'-set parameters.

Another option that might work is if we just adjusted 'repeatedtask' slightly to allow just a 'numRepeats' attribute instead of a 'range' reference:

https://docs.google.com/drawings/d/1CbShcFxJYWyrAOmp6YnUpxeE8Pr3ij3Z1AQeVNY3C_4/edit

Alternatively, we could create a simpler 'UniformRange' object (https://docs.google.com/drawings/d/195Wmeo8WtE6daf80rLELxOP_jdclZbiRtadzq_vew5Q/edit) with just one attribute: 'numberOfPoints'.

cjmyers commented 7 years ago

Lucian: I think I'm sounded like a broken record, but RepeatedTask for this is a non-starter. It is not the complexity of RepeatedTask, but its semantics. What I want is a Task that has a number of runs, period. This is the only thing that makes sense semantically. Having a RepeatedTask that is simpler does not change the fact that it is acting on Tasks and repeating them.

Matthias: I'm okay with "SampledTask" deriving from Task with NumberOfRuns as an added variable.

matthiaskoenig commented 7 years ago

@luciansmith I get the point of Chris. I think we should provide an easy way for multiple runs of a model. All the solutions based on RepeatedTasks make things much more complicated (they achieve the right behavior but are somehow hacks which make things more confusing).

I prefer the solution of: give the users what they want, i.e. a SampledTask which allows easily to run the same simulations multiple times with no overhead. Especially because it makes life not more difficult for people which already implemented the repeated tasks, and allows others to easily run things like stochastic simulations. I have to admit that sometimes RepeatedTasks are too heavy. And a SampledTask would allow me to easily do a multi-dimensional parameter scan and just put the SampledTask in the innner nested repeated Task :)

I had some informal agreement with Chris: If we make this happen in a timely manner, they will implement data reading in iBioSim (L1V3 data).

luciansmith commented 7 years ago

Well, yes: we're circling back to the original disagreement where you think RepeatedTask means one thing, and I think it means something else.

However, more importantly, handling the post-task data is obviously a problem for everyone, and needs to be addressed. Similarly, we need to handle 'seed' semantics, so everyone can be on the same page with that, too.

If we must define repeated tasks three different ways, I would vote that the third way simply be the addition of an optional attribute 'numRepeats' on Task. No need to sub-class anything.

(There is zero way that I can think of to implement Chris's suggestion that we somehow forbid people from using the RepeatedTask construct to repeat tasks.)

cjmyers commented 7 years ago

I prefer "NumberOfRuns" rather than "NumRepeats", since the later makes it sound like it is a "RepeatedTask" where we already have some confusion on its meaning.

I was not saying to "forbid" people. I was saying to make it clear in the specification that RepeatedTasks are not appropriate for stochastic simulation. Essentially, it should be made clear that in a RepeatedTask that, for example, all the algorithm parameters including the SEED are re-assigned at the beginning of each Task.

matthiaskoenig commented 7 years ago

@luciansmith

As I understood you the major reason of your dislike for a new SampledTask comes from the point that there would be suddenly two ways to do certain things. But there is already a similar case in SED-ML, which provides some simple shortcut for a common use case: the ChangeAttribute vs the ChangeXML

The \hyperref[class:changeXML]{ChangeXML} class covers the possibilities provided by the \hyperref[class:changeAttribute]{ChangeAttribute} class. I.e.\ everything that can be expressed by a \concept{ChangeAttribute} construct can also be expressed by a The \hyperref[class:changeXML]{ChangeXML}. However, for the common case of changing an attribute value \concept{ChangeAttribute} is easier to use, and so it is recommended to use the \concept{ChangeAttribute} for any changes of an XML attribute's value, and to use the more general \hyperref[class:changeXml]{ChangeXML} for other cases.

Most of the current SED-ML implementations only implement the ChangeAttribute because it covers a lot of use cases and is much simpler than the full ChangeXML.

In my opinion the introduction of SampledTask vs the RepeatedTask would be similar. We would provide a common shortcut for an otherwise very complicated implementation (of the full RepeatedTask) which provides many use cases. In my opinion this would help the adoption of SED-ML.

jonrkarr commented 3 years ago

I read this issue to try to better understand the intended meaning of resetModel. I think my interpretation is the same as Chris', that resetModel=True encompasses resetting ALL state, including that of any random number generators. Because the specifications don't describe any boundary for resetting, I assumed this must include ALL state, including what could be considered "simulation" or "simulator" state. While I would consider the state of a random number generator "simulation" state rather than "model" state, it seems that this should be covered by resetModel because the description of resetModel suggests that both model specifications and simulation state should be reset.

Regarding whether multiple stochastic runs can be described with L1V3 RepeatedTask, assuming the resetModel issue is addressed, I think multiple stochastic runs can be adequately described with RepeatedTask. While RepeatedTask doesn't provide the simplest possible syntax for describing multiple stochastic runs, to me, its semantics seem consistent with multiple independent stochastic runs.

It seems to me that the central point of discussion in this issue is about the semantic interpretation of the SED classes.

Its seems to me that part of the reason for diverging opinions arises from SED currently being a middle ground between the two extremes outlined above. Most of the SED classes are focused on capturing computational operations. This is exemplified by AlgorithmParameter which uses KiSAO terms to define their semantic meaning. But, the simulation classes (SteadyState, OneStep, and UniformTimeCourse) convey specific semantic meaning about the computation. Because SED takes an intermediate approach with some degree of semantic meaning, I don't think its 100% clear which classes/attributes have specific semantic meanings and which do not.

One place where these competing visions is particularly relevant is the discussion about how to apply SED to other modeling frameworks such as logical modeling. Similar to what Chris advocates here, #8 advocates for several additional classes for logical simulation, each of which would function substantially similar to an existing class. Similar to what Chris advocates, this would create simpler syntax for specific types of simulations. However, this would come at the cost of increasing the complexity of SED. In turn, this would likely result in further fracturing of software support for SED. This would likely make it less likely that simulation experiments (SED files) can be ported from one tool to another.

To keep SED as simple as possible, to make SED as easy as possible for software developers to implement, and to maximize the portability of SED documents between tools, I would vote for keeping SED as free of semantic meaning as possible and using ontology terms (e.g., KiSAO) to describe the semantic meaning of instances of SED classes. This requires more KiSAO terms, but new terms are easy to add. For example, we've recently added many new terms.

To avoid confusion about the semantic meaning of SED classes and attributes, ideally I would also vote to remove the existing semantic meaning from SED (e.g., replace UniformTimeCourse with a Simulation classes and encode initialTime, outputStartTime, etc. into AlgorithmParameters with appropriate KiSAO ids). Rather than using these classes (OneStep, SteadyState, UniformTimeCourse) to describe specific combinations of parameters that simulation tools should recognize, BioSimulators provides a place for simulation tools to advertise the KiSAO ids they support for each algorithm. This is much more flexible than what's possible with the three existing semantically-motivated simulation classes (OneStep, SteadyState, UniformTimeCourse). I think removing SED meaning from SED would also making the interpretation of the remaining classes clearer, addressing the central issue discussed here.

cjmyers commented 3 years ago

I support @jonrkarr proposal. I think keeping the classes simple with semantics covered in parameters will increase the use and applicability of SED-ML. This is inline with my advocation for using an algorithm parameter rather than a repeated task for stochastic runs. It also will help library development which is stalled in some instances, such as Java.

jonrkarr commented 3 years ago

We have already proceeded in this direction -- combining the existing SED classes with KiSAO terms to create semantic meaning for simulations and providing a more flexible place to advertise how investigators can use the particular combinations supported by each tool. This has enabled us to use SED with a broader range of simulations:

matthiaskoenig commented 3 years ago

This issue summarizes the information about the SimpleRepeatedTask in L1V4. I updated the title of the issue accordingly to reflect this.

luciansmith commented 3 years ago

Just as a note--at this point, there are no interpreters that support the SimpleRepeatedTask (and I believe Chris is supporting his mode through the use of KiSAO terms, so even he has no real reason to support them, either). So as it stands right now, it will probably be dropped from the specification (due to MSB by the end of July).

luciansmith commented 3 years ago

Done (but reversible) with https://github.com/SED-ML/sed-ml/commit/99cdcf868f1553846594246c6638e057ea68ad0b