SED-ML / sed-ml

Simulation Experiment Description Markup Language (SED-ML)
http://sed-ml.org
5 stars 2 forks source link

Clarify the semantics of resetModel of repeatedTask and simplify its intended behavior #110

Closed jonrkarr closed 3 years ago

jonrkarr commented 3 years ago

resetModel aims to enable investigators to instruct a simulation tool to clear all model changes (rather than continue to accumulate changes) at the beginning of each iteration of a repeated task.

In addition, the specifications suggest that

There are multiple issues with this

Suggestions for resolution

luciansmith commented 3 years ago

Random number generators should definitely not be reset at any time, whether or not 'resetModel' is true or not. I just searched, and couldn't even find the word 'random' in the spec at all, so I'm not sure why your impression was that the RNG should be reset. Can you post the bit that made you think that? It will definitely need to be changed.

jonrkarr commented 3 years ago

The issue is that its not clear what state should be preserved between iterations and which should be reset. The name implies the model specification should be reset, but the description of subtasks and "lore" suggests that this refers to simulation state as well. If this refers to simulation state, what is the boundary? Does this extend as far as random number generate states?

What is resetModel=false intended to mean? Does this mean that the next task should continue from the final state of the last task? Or does this simplify mean that the changes to the model should continue to accumulate and the next task should begin from the initial conditions outlined in the model specification (with changes accumulated).

luciansmith commented 3 years ago

The description of subtasks is:

"The required listOfSubTasks contains one or more subTasks that specify which Tasks are performed in every iteration of the RepeatedTask. All subTasks have to be carried out sequentially, each continuing from the current model state (i.e. as at the end of the previous subTask, assuming it simulates the same model), and with their results concatenated (thus appearing identical to a single complex simulation). The order in which to run multiple subTasks must be specified using the order attribute on the subTask."

Nothing in that implies to me that anything about the simulator should be reset--tolerances, the random number seed, or any other potential 'algorithm parameter' defined in SED-ML. (Also, with the exception of the random number seed, none of these values has the possibility of changing over the course of a simulation, at least not with current SED-ML constructs.) I was the one that mentioned 'lore' the other day (half tongue-in-cheek) but I don't know what about my statement there would have caused you to believe that the simulator state could potentially be reset, either.

I would love to be able to be more precise and communicate better, both when discussing things here and in the spec, so if any term seems to mean something different to me than it does to you, let's try to figure that out.

In my understanding, 'model state' refers to the the value of any symbol defined in model.xml (or whatever). In SBML (and BNGL), this would mean the current mathematical values of species, parameters, compartments, and species references. In CellML, it would be the mathematical values of its variables.

(The 'time' variable is a little more complicated, but that doesn't seem to be what you're talking about here, I think. But for SBML simulators, 'time' is something that the simulator controls, and therefore does not get included in model state. In CellML, 'time' is a variable just like anything else, so (I think) would count as part of the model state. But I'm not 100% sure on this.)

((To further complicate things, simulators such as libroadrunner have the concept of a 'soft reset' when all the species values get reset to their initial values, but all the parameters are left at their current values. This has perplexed me for some time, and perhaps @hsauro can explain the history of this, and how it's useful. But this is not included in SED-ML anyway, so I mention it only because some people might want something like it in the future..))

jonrkarr commented 3 years ago

From my perspective, I see several distinct kinds of state

Because SED has multiple mechanisms to read and write these various status, each of which are available in different contexts through different mechanism, I'm trying to make sure I understand each one.

From one sub-task to the next, my understanding is that all state should be preserved (copy from the output of the previous sub-task to the input of the next sub-task) except any UniformTimeCourse/OneStep and AlgorithmParameter defined for the next sub-task. In practice, simulation don't need to copy state as I outlined. This is just a conceptual description of the process to emphasize the flow of information.

I'm less clear about what's intended to happen from one iteration of a repeated task to another. Because this iteration is outside simulation, it seems to me that repeated tasks should work this way:

From talking to others, I've been led to believe that iteration through repeated tasks is intended to behave more like iteration over sub-tasks, where independent and dependent simulation variables are intended to be copied/preserved between iterations.

How does reset=False between iterations compare to what's intended to happen between sub-tasks?

luciansmith commented 3 years ago

As I understand it, reset=False should behave identically to what's intended to happen between sub-tasks. The following would produce identical output (if concatenated):

repeat 100 times, reset=False: task1

repeat 50 times, reset=False: task1 task1

Here's my understanding of your 'how things should work' matches my understanding of how things work. I believe you are accurate in your assessment of what happens when resetModel=True, but not when it is False:

Here's some pseudocode that describes what I think should happen:

for r in range(repeatedTask.range1): #Other range values, if any, change in lock step with this one.
  for c in range(changes.length):
       apply(changes[c])
  for s in range(subtasks.length):
       subtask = subtasks.find(order==s)
       subtask.execute() #Here's where 'time' is reset.  Tolerances, etc. are always the same, but the rng seed is not reset.
  if repeatedTask.resetModel==True:
       models = repeatedTask.findModels()
       for model in models:
           model.variables = findOrig(model).variables #i.e. all species, parameters, etc.

Does that make sense?

(Also, if any other SED-ML veteran wants to jump in here, please do, particularly if I'm off in some way.)

jonrkarr commented 3 years ago

This is the behavior that I outlined at the beginning of this issue.

If this is the intended meaning, I think the specifications could be clearer. I think a resetSimulations attribute would also help.

luciansmith commented 3 years ago

I completely agree we can make the spec clearer here.

I don't understand your resetModel/resetSimulation suggestion. I don't know what you mean by resetting the 'specification': the only thing I see that can be reset is the model state. What sorts of things are you thinking that could change from the 'specification' that would not be in the 'state'?

The section of the spec that attempts to describe what you want is in the definition of 'resetModel' in section 2.2.8.3:

"The repeatedTask has a required attribute resetModel of type boolean. It specifies whether the model should be reset to the initial state before processing an iteration of the defined subTasks. Here initial state refers to the state of the model as given in the listOfModels."

I can take a stab at re-writing this, but I might be too close to see what might throw people off. Do you have a suggestion for an alternative wording, or if not, can you point out what phrase or word choice gave you the wrong impression?

jonrkarr commented 3 years ago

My interpretation is that when resetModel=False, changes can accumulate across iterations (i.e., set value calculations can depend on values changes in previous iterations). Because these changes are made with SetValue, they can be to any attribute of a model specification, including values of parameters. As a result, such changes would not be reset by restoring the simulation state (e.g., concentrations of molecules).

Here's an example of a set value that could accumulate changes:

<setValue target="x">
    <listOfParameters> 
        <parameter id="a" value="2" />
    </listOfParameters>
    <listOfVariables>
        <variable id="x" target="x" />
    </listOfVariables>
    <math>
        a * x
    </math>
</setValue>

The change causes model variable "x" to double. If the model specification is preserved (not reset) between successive iterations, "x" will double each iteration. If the value of "x" was 1 in the XML file for the model, its value would become 2, 4, 8, 16, ... at successive iterations.

I've assumed that reset=True means that the specification should be reset. In that case, the value of "x" gets reset to 1 at the beginning of each iteration. The set value is then applied, and the set value always changes the value of "x" to 2.

Is the specification intended to be reset even when resetModel=False?

jonrkarr commented 3 years ago

This discussion with Chris Myers expresses similar concerns about the meaning of resetModel. Chris also appears to think that resetModel=True means that ALL state should be reset, including that of the random number generator(s). Its seems that part of the confusion is also about what the meaning of "model" is in "resetModel" because "model" in "resetModel" is actually mean simulation state in addition to model specification.

hsauro commented 3 years ago

((To further complicate things, simulators such as libroadrunner have the concept of a 'soft reset' when all the species values get reset to their initial values, but all the parameters are left at their current values. This has perplexed me for some time, and perhaps @hsauro https://github.com/hsauro can explain the history of this, and how it's useful. But this is not included in SED-ML anyway, so I mention it only because some people might want something like it in the future..))

Because there are is a common use case (at least common for me) where one has changed parameters but we need to reset the initial conditions back without touching the parameter values.

Herbert

On Thu, Feb 4, 2021 at 3:49 PM Lucian Smith notifications@github.com wrote:

The description of subtasks is:

"The required listOfSubTasks contains one or more subTasks that specify which Tasks are performed in every iteration of the RepeatedTask. All subTasks have to be carried out sequentially, each continuing from the current model state (i.e. as at the end of the previous subTask, assuming it simulates the same model), and with their results concatenated (thus appearing identical to a single complex simulation). The order in which to run multiple subTasks must be specified using the order attribute on the subTask."

Nothing in that implies to me that anything about the simulator should be reset--tolerances, the random number seed, or any other potential 'algorithm parameter' defined in SED-ML. (Also, with the exception of the random number seed, none of these values has the possibility of changing over the course of a simulation, at least not with current SED-ML constructs.) I was the one that mentioned 'lore' the other day (half tongue-in-cheek) but I don't know what about my statement there would have caused you to believe that the simulator state could potentially be reset, either.

I would love to be able to be more precise and communicate better, both when discussing things here and in the spec, so if any term seems to mean something different to me than it does to you, let's try to figure that out.

In my understanding, 'model state' refers to the the value of any symbol defined in model.xml (or whatever). In SBML (and BNGL), this would mean the current mathematical values of species, parameters, compartments, and species references. In CellML, it would be the mathematical values of its variables.

(The 'time' variable is a little more complicated, but that doesn't seem to be what you're talking about here, I think. But for SBML simulators, 'time' is something that the simulator controls, and therefore does not get included in model state. In CellML, 'time' is a variable just like anything else, so (I think) would count as part of the model state. But I'm not 100% sure on this.)

((To further complicate things, simulators such as libroadrunner have the concept of a 'soft reset' when all the species values get reset to their initial values, but all the parameters are left at their current values. This has perplexed me for some time, and perhaps @hsauro https://github.com/hsauro can explain the history of this, and how it's useful. But this is not included in SED-ML anyway, so I mention it only because some people might want something like it in the future..))

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SED-ML/sed-ml/issues/110#issuecomment-773678928, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIBSDTGDYALJDOGZ6AS7ILS5MW65ANCNFSM4XDVI7JA .

-- Herbert Sauro, Professor University of Washington, Bioengineering 206-685-2119, www.sys-bio.org hsauro@uw.edu Books: http://books.analogmachine.org/

luciansmith commented 3 years ago

So by 'specification', you mean 'the mathematical values of the Parameters in an SBML model'? All of them? Or just the ones marked 'constant=true'?

SED-ML makes no distinction between parameters and species values (nor compartments and species references). CellML make no distinction between them internally, either--everything is a 'variable'. If the species values are reset, the parameter values are reset. If the species values remain as they were at the end of the previous simulation, the parameter values remain as they were at the end of the previous simulation. In your example, if 'x' had a rate rule that caused it to increase by 5 over the course of the simulation, successive iterations of x in a repeated task like you describe would be 2, 14, 38, 86, etc. If the repeated task was set 'resetModel=true', the values would, as you suspected, be '2, 2, 2, 2'.

If you wanted a regularly-increasing value to 'x', while the model otherwise was reset each time, I would use the setValue to set x to be some function of a range value. It should be possible to accomplish whatever you need this way.

It would be possible to add something to SED-ML that would allow it to distinguish between types of variables in a model, and change subgroups of them en masse, and allow the situation that Herbert describes. I worry that this would be very language-dependent, however: for SBML, the definition of such parameters might be 'Parameters and SpeciesReferences that are set Constant', but I don't think BNGL distinguishes between constant and variable symbols (though it would in the translated SBML, I suppose?). and it would be impossible in CellML to distinguish between those two types of variables because everything is a 'variable', and they don't even have a 'constant' attribute. Everything is just a symbol that might change or no.

So the upshot is that for now, SED-ML does not distinguish between SBML Parameters, Species, Compartments, nor Species References, and if 'resetModel=true', they all get reset, and if 'resetModel=false', they all retain their previous values. SED-ML treats 'the model' as one giant set of symbols with values, and does not attempt to distinguish between them.

jonrkarr commented 3 years ago

By specification I mean anything in the model file.

From my vantage point, SED does make a distinction between values in model specifications and values that are produced by simulations. The former can be queried out of the specification of a model. The later has to be produced by a simulation algorithm. As a result, these different types of information are available at different points in the execution of SED documents. Because they are treated differently by some parts of SED, I'm trying to make sure I understand how each part of SED handles these two distinct types of information.

jonrkarr commented 3 years ago

One thing that might help me clarify this discussion is what is the intended data structure for the output of a repeated task? Is this intended to be concatenated end-to-end, similar to what's described for sub-tasks? Or is the output intended to be multi-dimensional?

My interpretation is that the results are intended to be thought of as multi-dimensional, at least when resetModel=True.

If the intention of repeated tasks is to encode algorithms (e.g., describe algorithm as the composition of multiple iterations of a repeated tasks and possibly multiple subtasks), then one would expect the output to have the same dimension as a basic Task. In this case, repeated task would be awkward for things such as a parameter scan or set of independent runs of a stochastic simulation.

luciansmith commented 3 years ago

I don't see anywhere in the SED-ML specification where it can tell the difference between symbol 'x' defined in a model and symbol 'y' defined in a model. Maybe x is a species and y is a parameter; maybe x is constant and y varies. Maybe y is a boundary species and maybe x is a parameter set 'constant=false' but it doesn't actually change. The only thing it knows is that the model defines the symbol, and must be queried if you want to know what its current value is.

Every such symbol in the model is either reset or not according to the 'resetModel' attribute on a repeated task.

In contrast Algorithm Parameters are values that SED-ML knows about, and controls itself. None of them are reset by the 'resetModel' flag, because they are not part of the model; they are part of the algorithm.

The only thing 'produced by a simulation algorithm' that is not in the model would be, I suppose, a DataGenerator?

For your other question:

For a very long time, the form of the output of a RepeatedTask was ambiguous. A perennial question was "If I run repeated stochastic time course simulations, how do I tell SED-ML that I want to collect the means and standard deviations of the output variables at each time point?" And there was no good answer, and we spent a lot of hours in discussions at various meetings trying to come up with a way to do this.

In the end, we came up with the RemainingDimensions construct to assist with things like the 'avg' and 'stdev' functions, so you could say 'the average over all repeats', 'the average over the time course', or 'the average over all repeats and all time courses'. With that, I think we finally cemented the idea that RepeatedTask output was indeed multidimensional, as you suspected. But there was a long time when the state of the art was sort of 'do what makes sense in context', which mostly worked, but was less than ideal.

Repeated Tasks were used from day 1 to perform parameter scans and independent runs of stochastic simulations. So whatever makes sense in that context was always the intent of the construct.

jonrkarr commented 3 years ago

Yes, the variables of data generators collect simulation outputs that are not part of models. This is what I think of as "simulation" state. Although this can be addressed by pseudo XPATHs, I see this as distinct from things that can be queried directly out of model files such as kinetic parameters and initial conditions.

It sounds like the consensus view of repeated tasks is that simulation tools are afforded some freedom to decide how to handle their results.

matthiaskoenig commented 3 years ago

Some input:

There are definitely clarifications missing about what repeated tasks do. I would add the following clarifications to the specifications which will make things much clearer in general and help implementation:

Every repeated task creates an additional dimension of these results. The results of individual repeats are separate i.e. are never concatenated in any manner. A single repeated tasks results in

[M(0), M(1), M(2), ....M(r1-1)]

with r1 being the lenght of the repeated task., i.e. a vector of results. A repeated task of repeated tasks results in a matrix of task results

[M(0,0), M(0,1), M(0,2), ....M(0,r2-1)
M(1,0), M(1,1), M(1,2), ....M(01,r2-1)
...
M(r1-1,0), M(r1-1,1), M(r1-1,2), ....M(r1-1,r2-1)]

The subtasks are concatenated!

luciansmith commented 3 years ago

There's been a few updates to the spec itself since the above discussion; most importantly the addition of a 'concatenate' attribute, but here's what I changed in the description of 'what happens when' bit of the RepeatedTask section to reflect the above discussion:

"The order of activities within each iteration of a RepeatedTask is as follows: ˆ Any AlgorithmParameter children of the associated Simulation are applied (with the possible exception of the seed; see Section 2.2.7.2). ˆ The entire Model state is reset if specified by the resetModel attribute. ˆ Any changes to the model specified by SetValue objects in the listOfChanges are applied to the Model. Then, for each SubTask child of the RepeatedTask, in the order specified by its order attribute: ˆ Any AlgorithmParameter of the associated Simulation are applied. ˆ Any SetValue children of the SubTask are applied to the relevant Model. ˆ The referenced Task of the SubTask is executed.

Does this cover what everyone wants to clarify?

jonrkarr commented 3 years ago

That seems right except for Step (1). I think this should be removed. My interpretation is that algorithm parameters are set in Step (4) and that this is sufficient.

Also, I would swap the order of steps (4) and (5). Currently, my understanding is that Steps (4) and (5) are independent because SetValue can only change model elements. However, I think it would be useful for SetValue to be able to change simulation parameters as well. Under this schema, the simulation configuration should be evaluate first (with SetValue) and then applied -- step (5) then step (4).

My understanding of concatenated is that this only affects the shape of the results. As such, I don't think it needs to be mentioned in this sketch of the order of execution of repeated tasks.

luciansmith commented 3 years ago

I don't see how switching 4 and 5 would work how you want it to work--if you set an algorithm parameter with a SetValue first, wouldn't it then be overwritten by the general algorithm parameter setting in the current step 4? Maybe I'm envisioning things differently than you are. I agree that if they aren't independent, it would be nice to be able to override general algorithm parameters with something more particular.

(Perhaps you're envisioning an algorithm parameter with the value like 'the smallest concentration in the model divided by 100', in which case you'd need to reset the smallest concentration in the model first?)

However, I really don't see a way with current SED-ML where you can use a SetValue to change an algorithm parameter. As such, I kind of hate to make this suggestion, but what if we add algorithm parameters to SubTask, that override the ones on the Simulation?

Your suggestions about step 1 and concatenation seem good to me!

jonrkarr commented 3 years ago

Agreed, its not currently possible to use SetValue to change an algorithm parameter. As a result, steps (4) and (5) are independent in the current version of SED-ML.

The reason I suggest swapping the order of (4) and (5) is to anticipate a future version in which it is possible to change the value of an algorithm parameter within an iteration of a repeated task. Because applying SetValue to simulations, algorithms, and algorithm parameters requires multiple changes, I suggest we leave this to a future version of SED-ML, except for avoiding language that would complicate such a future addition to the language.

If it were possible to apply SetValue to simulations, algorithms, and algorithm parameters, this order would make sense to me

  1. Apply SetValue to instances of sedml:Simulation, sedml:Algorithm, or sedml:AlgorithmParameter
  2. A simulation tool can interpret the changed simulation configuration

I thoughts what was meant by steps (5) then (4), but maybe this isn't what you intended to communicate.

luciansmith commented 3 years ago

OK, so I took out the first bit, which leaves us with:

The order of activities within each iteration of a RepeatedTask is as follows: 1 The entire Model state is reset if specified by the resetModel attribute. 2 Any changes to the model specified by SetValue objects in the listOfChanges are applied to the Model.

Then, for each SubTask child of the RepeatedTask, in the order specified by its order attribute: 3 Any AlgorithmParameter children of the associated Simulation are applied (with the possible exception of the seed; see Section 2.2.7.2). 4 Any SetValue children of the SubTask are applied to the relevant Model. 5 The referenced Task of the SubTask is executed.

I don't see a clearer way to say for step 3 that you're applying the general AlgorithmParameters that are found in the Simulation, and then for step 4 you're applying SetValues. In the future, if SetValues could change AlgorithmParameters, you'd want them to be applied after the ones found in the Simulation (step 3).

Is there a better way to phrase this that would remove the different interpretation you had?

jonrkarr commented 3 years ago

The language for the individual steps seems fine to me. Its the order of the steps which I would change.

My interpretation of "are applied" in step (3) is that the simulation tool interprets instances of sedml:Simulation, sedml:Algorithm, and sedml:AlgorithmParameter. Is that what you intended? To me, such interpretation has to come after any changes might be made to this information by a future version of SetValue.

luciansmith commented 3 years ago

Oh! I finally get it. I was envisioning SetValue directly setting the active algorithm parameter. In that scheme, if you set that first, and then overwrite it with the parameter from the Simulation, that's not very helpful.

But instead, you're envisioning SetValue setting the SED-ML algorithm parameter, which would then need to be applied to the task next.

In pseudoscript form, here's what I was thinking:

for subtask in subtasks:
    task = subtask.getTask()
    sim = task.getSimulation()
    model = task.getModel()
    sim.applyParams(sim.getAlgParams)  //My step 3
    for change in task.changes:        //My step 4
          if  change.type==model:
              model.applyChange(change)
         if change.type==AlgParam:
              sim.applyChange(change)
  task.execute()  //My step 5

I think this scheme is slightly more in line with two other aspects of SED-ML: first, the fact that the ModelChange list for the Model is applied first, and then subsequent changes are applied with SetValue. Second, the fact that global AlgorithmParameters apply first, and can then be overwritten by 'local' Simulation AlgorithmParameters.

In the end, it's a tiny thing that, as you say, doesn't make a difference yet because SetValue cannot apply to algorithm parameters. But I'm happy to change the wording to make things extra clear.

jonrkarr commented 3 years ago

Providing pseudocode would be helpful. I think such pseudocode would go a long way to breaking down the barrier to further adoption of SED-ML.

To be really clear, I think the pseudo needs to address how reset is intended to work, including with nested repeated tasks. Also, it make sense to me to describe processing the changes to repeated tasks rather than processing changes to repeated tasks nested within repeated tasks. In addition, I think its important to clearly articulate what processing can happen solely internal to SED-ML vs what simulation tools have to do using model files.

How about this?

def execute_repeated_task(doc):
    # make a copy of the document so that any changes are insulated from the processing of the parent SED-ML file or repeated task
    copy_doc = deepcopy(doc)

    # iterate over iterations
    for iter in sed_repeated_task.getIterations():

        # iterate over subtasks
        if sed_repeated_task.reset:
            copy_doc = deepcopy(doc)

        # apply changes
        for sed_change in sed_task.changes:
            if sed_change.type == model:
                  sed_model = sed_getModel(copy_doc, sed_change.getTarget())
                  sed_applyModelChange(sed_change, sed_model)                               # edits the sed_model data structure; change all reflected in copy_doc

             elif change.type in [Simulation, Alg, AlgParam]:
                  sed_sim = sed_getSimulation(copy_doc, sed_change.getTarget())
                  sed_applySimulationChange(sed_change, sed_sim)                            # edits the sed_sim data structure; change all reflected in copy_doc

        # execute subtasks
        for sed_subtask in sed_repeated_task.getSubtasks():
            sed_task = sed_getTask(copy_doc, sed_subtask.getTask())
            sed_model = sed_getModel(copy_doc, sed_task.getModel())
            sed_sim = sed_getSimulation(copy_doc, sed_task.getSimulation())

            if sed_task is SedTask:
                simulator_model = simulator_initModel(sed_model)
                simulator_sim = simulator_initSim(sed_sim)                                  # My step 3        
                simulator_executeTask(simulator_model, simulator_sim)                       # My step 5

            else:
                execute_repeated_task(copy_doc)
luciansmith commented 3 years ago

This is starting to diverge from the original fix, but maybe? It seems to me that the most appropriate place for 'pseudocode that implements SED-ML' would be an appendix. Does that seem doable?

jonrkarr commented 3 years ago

An appendix or other linked document works.

I think the pseudocode makes things much more concrete. This removes a lot of the work needed to figure out how to implement SED-ML.

jonrkarr commented 3 years ago

Related to this, another thing I think needs to be clearer is how resetModel works with respect to nested repeated tasks. I.e. how far back the resetting should go? The pseudocode provides an opportunity to make this very clear.

My intrepretation is that iterations should reset to the state of the parent repeated task, rather than to the original model definition. This is what's outlined in the pseudocode above and this is what the 14 BioSimulators tools do. @luciansmith as another reference point, what does tellurium do?

fbergmann commented 3 years ago

I've looked at the draft section of the repeatedTask and the clarifications listed there look fine to me and are what i would expect. At this point i would not add further clarifications/changes, and save them for L2