Open dilpath opened 3 months ago
I can see the advantages and I think that it is a viable solution that should be explored further.
In general the long format for the condition table means that conditions with many parameters are not as simple and clearly arranged as they are in the wide format, but I think that the flexibility in the long format, also w.r.t. further columns, is the stronger argument. (It should also be trivial to create the conditions table in wide format and then switch using the v1
to v2
converter.)
What I am critical about is the potential ambiguity.
Rethinking conditions and time courses (sequences of conditions) as the same input-value-(time) structure is logical but requires a higher level of abstraction from the user. It is also a bit misleading because different rows can still have different interpretations.
For instance, re. the rows in the newly suggested table in Example 2: cond1
-cond3
are used to set parameters while measurements can only be assigned to tc1
.
That is also inconsistent with Example 1 where there is no explicit time course. If that means that every condition is by default a time course starting at time = 0
we end up with a mix of implicitly and explicitly defined time courses.
I am also considering whether too much flexibility could deter new/basic users because it would be more difficult to figure out (1.) which way to correctly specify a PEtab problem and (2.) understand if there are undesired consequences of specifying a PEtab problem in one way, e.g.: If Example 2 cond1
is also an implicit time course, would it be simulated?
Thanks @dilpath , great to see efforts to make the tables more normalized! But similar to @m-philipps , I think putting rows with different meaning in the same table should be avoided. My suggestion would be something like
experimentId | targetId | time | value | isDelta |
---|---|---|---|---|
S1_knockout | S1 | 0 | 0 | 0 |
S1_knockdown | S1 | 0 | s1kd_init | 0 |
BolusA | drugA | 10 | 5 | 1 |
BolusA | drugA | 20 | 5 | 1 |
InfusionA | kInDrugA | 10 | 1 | 0 |
InfusionA | kInDrugA | 20 | 0 | 0 |
Here, we have four different experiments. targetId
is probably what you called inputId
. It can be an species or parameter (or compartment, I guess). value
can be numeric or a parameter from the parameter table. Per default, the targedId
is set to value
. However, if isDelta
, the value is added to the targetId
. For example, at t=20 not all drugA
might be degraded, but the patient takes the next pill.
One problem with this table is that you might have to change the model if you want to do infusion dosing (i.e. the model might not have a kInDrugA
parameter, when it was originally not constructed for infusion dosing). To facilitate reuse of models, there could just be another column isRate
.
experimentId | targetId | time | value | isDelta | isRate |
---|---|---|---|---|---|
S1_knockout | S1 | 0 | 0 | 0 | 0 |
S1_knockdown | S1 | 0 | s1kd_init | 0 | 0 |
BolusA | drugA | 10 | 5 | 1 | 0 |
BolusA | drugA | 20 | 5 | 1 | 0 |
InfusionA | drugA | 10 | 1 | 0 | 1 |
InfusionA | drugA | 20 | 0 | 0 | 1 |
One issue I see is that you could accidentally do
experimentId | targetId | time | value | isDelta | isRate |
---|---|---|---|---|---|
InfusionA | drugA | 10 | 1 | 0 | 1 |
InfusionA | drugA | 20 | 0 | 0 | 0 |
which should not switch off the infusion imo, just set drugA to zero, while the infusion keeps going.
Of course, to save rows you could also add an optional repeat
column that contains delta_t and n_reapeats. For example 5; 10
would mean repeat what the row does every 5 time units 10 times.
Lmk what you think. I have not been involved with v2 yet, so maybe I'm missing crucial bits here.
Thanks for the quick feedback. I changed the first post to address some of it.
@m-philipps
It should also be trivial to create the conditions table in wide format and then switch using the v1 to v2 converter.
This will be available in the v1<->v2 converter that we will supply in libpetab-python.
That is also inconsistent with Example 1 where there is no explicit time course.
For instance, re. the rows in the newly suggested table in Example 2: cond1-cond3 are used to set parameters while measurements can only be assigned to tc1.
Yes, this is a discussion point -- should we have a default t0=0
or require a user to specify this? Specifying it explicitly is fine with me. Conditions in PEtab v1 are implicitly a constant timecourse with t0=0
-- we could make that explicit in PEtab v2.
it would be more difficult to figure out (1.) which way to correctly specify a PEtab problem
Yes, the flexibility makes things a little more complicated. However, the examples the users see in the docs can be rather basic and in two "separate" tables, e.g. Example 2 can be specified like
normalized_conditions.tsv |
experimentId | inputId | value |
---|---|---|---|
cond1 | k0 | 1 | |
cond2 | k0 | 2 | |
cond3 | k0 | 3 |
normalized_timecourses.tsv |
experimentId | inputId | time |
---|---|---|---|
tc1 | cond1 | 0 | |
tc1 | cond2 | 10 | |
tc1 | cond3 | 250 |
This is equivalent to separate normalized formats for the conditions and timecourses table. However, they would be combined into a single experiments table in the PEtab YAML, because they're all just timecourses to the tool.
...
problems:
- experiment_files:
- normalized_conditions.tsv
- normalized_timecourses.tsv
measurement_files:
- ....tsv
...
and (2.) understand if there are undesired consequences of specifying a PEtab problem in one way, e.g.: If Example 2 cond1 is also an implicit time course, would it be simulated?
It's currently up to the tool whether it chooses to perform simulations that have no measurements...
@paulflang
putting rows with different meaning in the same table should be avoided
To me, the rows do not have a different meaning. Every row describes the (piecewise-)constant input function of a dynamical system. The only different between rows is whether they update a single, or multiple, model parameters, but hopefully this is made clear through the use of experimentId
s in the inputId
column.
My suggestion would be something like
This will be supported. i.e. your suggestion
experimentId | targetId | time | value | isDelta |
---|---|---|---|---|
BolusA | drugA | 10 | 5 | 1 |
can be expressed like so in PEtab v2 (regardless of which table format we go with, expressions will be supported...)
experimentId | inputId | time | value |
---|---|---|---|
BolusA | drugA | 10 | drugA+5 |
We opted for expressions over a column like isDelta
since it was requested (e.g. https://github.com/PEtab-dev/PEtab/issues/577) and is more flexible. However, this sets the value of drugA
at t=10
, which then remains constant during the next timecourse period.
you might have to change the model if you want to do infusion dosing
Would be fine for me to include this isRate
column :+1: This then really requires a tool to modify the mathematical model, rather than requesting that the simulator simply simulate with different parameters, so I guess most tools won't support this, but also fine since it's optional. The PEtab library can do that work too. It also means the input function can now be piecewise-linear.
One issue I see is that you could accidentally do
experimentId targetId time value isDelta isRate InfusionA drugA 10 1 0 1 InfusionA drugA 20 0 0 0 which should not switch off the infusion imo, just set drugA to zero, while the infusion keeps going.
I guess instead of value
and isRate
, columns like value
and rate
might be more intuitive. e.g. set inputId=value
, and set rateOf(inputId)=rate
. Then your suggestion becomes explicit
experimentId | inputId | time | value | rate |
---|---|---|---|---|
InfusionA | drugA | 10 | 1 | |
InfusionA | drugA | 20 | 0 | 1 |
or we use a priority
column, which I would include anyway to resolve multiple changes to the input function at the same time point.
experimentId | inputId | time | value | isRate | priority |
---|---|---|---|---|---|
InfusionA | drugA | 10 | 1 | 1 | 1 |
InfusionA | drugA | 20 | 0 | 0 | 0 |
InfusionA | drugA | 20 | 1 | 1 | 1 |
To me, the rows do not have a different meaning. Every row describes the (piecewise-)constant input function of a dynamical system.
Ok. I see what you mean now. But I also think that PEtab should stick to the classic relational database model, where foreign keys point to rows in other tables. Then you can just go with object-relational mapping (table=class, column=class attribute, row=instance) to represent everything in (Python) objects. Otherwise the format specification would break the principle of least astonishment. But I'm definitely not a database expert, so I might be wrong here.
Problem with the relational system is, that for n levels of nesting, you require n tables. And you don't know how deep a user wants to nest. But imo the last table before the "Nesting" headline looks good to me. And tbh, for most reasonable biological applications the number of rows will still be something reasonable.
We opted for expressions over a column like isDelta since it was requested (e.g. https://github.com/PEtab-dev/PEtab/issues/577) and is more flexible.
OK, that's a good point.
However, this sets the value of drugA at t=10, which then remains constant during the next timecourse period.
Hmm. That's a problem. I think at three cases need to be supported:
So perhaps instead of using a Rate
column, we ne a valueType
column, that tells us what type that value is. Allowed cases could be:
Then you can just go with object-relational mapping (table=class, column=class attribute, row=instance) to represent everything in (Python) objects.
Also not an expert, but I think this is already supported. I am currently working with this proposed table and importing it into Python objects using pydantic without issue.
Problem with the relational system is, that for n levels of nesting, you require n tables. And you don't know how deep a user wants to nest. But imo the last table before the "Nesting" headline looks good to me. And tbh, for most reasonable biological applications the number of rows will still be something reasonable.
Agreed, IMO the nice thing about the proposed format is that it supports the very intuitive "condition-only long table" and "timecourse-only long table" (see updated Example 2), while also supporting the arbitrary nesting that avoids multiple, or very long, tables.
So perhaps instead of using a Rate column, we ne a valueType column, that tells us what type that value is.
Sounds good to me!
Thanks for the quick feedback. I changed the first post to address some of it.
Thanks for bringing up this discussion. For some joining in late to this discussion it would have been helpful to have some kind of markup in the post to see changes.
Thanks for the quick feedback. I changed the first post to address some of it.
Thanks for bringing up this discussion. For some joining in late to this discussion it would have been helpful to have some kind of markup in the post to see changes.
Agreed, but I didn't make any changes to the format of the tables in the first post yet (e.g. all columns and their meaning have remained the same so far). I only added explanatory text or re-arranged the order of some things so people can hopefully understand the proposal better.
I now "quote" any tables in the old formats (e.g. the old conditions and timecourses tables) in the first post, so it's clear which tables are in the proposed format.
I guess instead of value and isRate, columns like value and rate might be more intuitive. e.g. set inputId=value, and set rateOf(inputId)=rate. Then your suggestion becomes explicit
I like this proposal quite a bit as it would resolved some of the ambiguity in the assignment that we currently have, which requires pretty detailed understanding of SBML semantics. Would be great to also differentiate between value
and initialValue
.
I am a bit unhappy about the experimentId
notation. On the one hand the meaning of that column will be semantically quite difficult to describe. On the other hand, it creates a loot of room for weird corner cases. For example, what happens there are nested assignments with time
specified? What is supposed to happen when you assign a condition as rate
, but the condition also assigns initialValue
. My impression is that a lot of issues could be avoided and explanations simplified if there are two separate files even if there is some overlap in the columns.
I am a bit unhappy about the
experimentId
notation. On the one hand the meaning of that column will be semantically quite difficult to describe.
To me, it simply means "input function", but I probably misunderstood your point...
On the other hand, it creates a loot of room for weird corner cases. For example, what happens there are nested assignments with
time
specified?
Yes, I left an explanation of these corner cases out of the first post so far :see_no_evil: But this is how I would do it:
Each experiment has a t0
for the start time of the simulation/input function. This is defined as the lowest value in the time
column for all rows belonging to that experimentId
. If the user leaves time
empty, t0
defaults to 0
.
To your question: although a user specifies time
absolutely, these times are all considered as relative to the t0
of the experiment. Then, when nesting occurs, we simply change the t0
, and all subsequent periods are updated to maintain the same "relative" time delta to the new t0
. Here's an example:
Consider the switchSequence
in Example 3. Using this "relative to t0
" definition, I am able to construct a timecourse that intuitively adjusts the time
correctly, after every repeatEvery
repetition of the nested timecourse experiment1
. If I didn't use this "relative to t0
" definition, then all switch=1
occurrences would be at time=0
, which is undesirable. Here's the result of denesting experiment1
in Example 3:
experiment_id | input_id | input_value | time |
---|---|---|---|
experiment1 | switch | 1 | 0 |
experiment1 | switch | 0 | 5 |
experiment1 | switch | 1 | 10 |
experiment1 | switch | 0 | 15 |
experiment1 | switch | 1 | 20 |
experiment1 | switch | 0 | 25 |
experiment1 | switch | 1 | 30 |
... | |||
experiment1 | switch | 0 | 100 |
e.g. the first switchSequence
repeat is at t0=0
, and hence switch=0 @ time=5
. The second repeat is at t0=repeatEvery=10
, and hence switch=0 @ time=15
. The nth repeat is at t0=repeatEvery*n=10*n
and hence switch=0 @ time=10*n+5
.
This is useful for me to be able to concisely define a radiation therapy involving "radiation on" then "radiation off" repeatedly until some end time point. It denests into something that can be easily verified to have the intended timecourse, and I would include a plotting function in libpetab-python
to visualize these input functions so users can have a visual check too. The denested version is also valid because of the "single table" in this proposal, which allows model parameters to be directly defined in timecourses, instead of via "conditions".
What is supposed to happen when you assign a condition as
rate
, but the condition also assignsinitialValue
. My impression is that a lot of issues could be avoided and explanations simplified if there are two separate files even if there is some overlap in the columns.
I haven't thought too much about these additional columns that mean things like SBML initialAssignmentRule
(initialValue
), assignmentRule
and rateRule
(rate
). To avoid contradictions, specifying both assignmentRule
+rateRule
should be excluded, as well as initialAssignmentRule
+assignmentRule
.
But I don't see an issue with initialAssignmentRule
+rateRule
, since then this is just saying the "input function" is an IVP with dInputFunctionVariable/dt = rate
, InputFunctionVariable(t=t0) = initialValue
, right?
I am a bit unhappy about the experimentId notation. On the one hand the meaning of that column will be semantically quite difficult to describe.
This is because it nests three different concepts, as @dilpath mentioned:
- switchOn and switchOff are like PEtab v1 conditions
- switchSequence is like a timecourse as in https://github.com/PEtab-dev/PEtab/pull/581
- experiment1 is a nested timecourse where switchSequence is repeated every 10 time units to simulate the repeated toggling of the switch, until t=100.
Of course, they can be considered as one concept ("input function", as Dilan suggested), but still, input functions are so diverse things that they almost deserve their separate tables.
That said, we would not have these problems if we just would not support nesting. Dilan's denested table above looks very clean and easy to understand. I don't see big disadvantages. Pandas, DataFrames.jl and even Excel allows you easily to create such denested tables. Maybe file size could become a little pain if you exceed GitHub file size and have to start using Git-lfs at some point, but this should be super rare. On the other hand the advantages of the denested form seem much much larger for me:
(And if anyone wants to have a more concise notation, maybe simply allow both, scalars and start:step:end strings in the time column -> should do 90% of the conciseness with 10% of the confusion)
I am a bit unhappy about the
experimentId
notation. On the one hand the meaning of that column will be semantically quite difficult to describe.To me, it simply means "input function", but I probably misunderstood your point...
Well, we probably want something a bit more biology oriented. In this thread we often refer to "experiment", but with one-layer nesting, it's probably more appropriate to call respective entries something like "experimental phase" and I am not sure what to call 3-layer nesting or why it would be necessary. But as Paul mentioned, if we already introduce such notation why not make our lives easier and just separate them in individual files?
What is supposed to happen when you assign a condition as rate, but the condition also assigns initialValue. My impression is that a lot of issues could be avoided and explanations simplified if there are two separate files even if there is some overlap in the columns. I haven't thought too much about these additional columns that mean things like SBML initialAssignmentRule (initialValue), assignmentRule and rateRule (rate). To avoid contradictions, specifying both assignmentRule+rateRule should be excluded, as well as initialAssignmentRule+assignmentRule.
I am not worried about combinations of assignment, but using this in combination with nesting.
denested table above looks very clean and easy to understand. I don't see big disadvantages.
Agreed, this should be allowed regardless if we adopt the long format; this means users do not need to define conditions, because they can specify model parameters in this experiments table directly.
On the other hand the advantages of the denested form seem much much larger for me
Thanks, nice points, I can see the benefits. No ambiguity is definitely a big advantage.
Well, we probably want something a bit more biology oriented.
Makes sense. I have been thinking in the context of drug regimens for a couple of applications so far. e.g. maybe one only applies a radiation therapy during 9 am and 5 pm (first level timecourse); and only on weekdays (second level timecourse); and only on the first week of the month (third level timecourse); and only for six months (fourth level timecourse); and then measures tumor response at the seventh month. Some of this I made up -- in a current application I am only looking at a third level timecourse, but I think these larger nesting applications are plausible. The nesting allows me to define one "covariate condition" per patient, and define the "radiation therapy" nested timecourse once, and then combine "covariate condition"+"radiation therapy timecourse" to create patient-specific "experiments". I am not sure of a suitable biology term here though, apart from "experiment" or "protocol". This makes for a neat specific for my problem, though.
But as Paul mentioned, if we already introduce such notation why not make our lives easier and just separate them in individual files?
If we end up only supporting "one-layer nesting" (i.e. equivalent to a conditions and timecourses file), then individual files is completely fine for me. But I think there is a similar complexity cost (in terms of user comprehension) from introducing too many tables, compared to the complexity of understanding nested experiments.
Alright, one last attempt to see if I can make this nesting intuitive for new users. What if we say that, if any row in the experiments table is missing a time
value, then that experiment is a "floating" experiment that cannot be used directly?
e.g.
experimentId | inputId | value | time | repeatEvery |
---|---|---|---|---|
switchOn | switch | 1 | ||
switchOff | switch | 0 | ||
switchSequence | switchOn | |||
switchSequence | switchOff | 5 | ||
experiment1 | switchSequence | 0 | 10 | |
experiment1 | switchOff | 100 |
Here, switchOn
and switchOff
are floating (classic PEtab conditions), so must be used in another experiment. switchSequence
is missing a time in its first row, so is also floating. Its timepoints are relative to t0
, which doesn't exist until it's used in another experiment that defines t0
. Floating experiments can only have non-negative or empty time
values. In experiment1
, switchSequence
is given a t0=0
, so experiment1
can be used in the measurements table.
If we can agree that this is sensible, then I would try to make a case for nested initialAssignment
/assignment
/rate
rules.
But if you think it's still to confusing/ambiguous, then I'll open a new issue and we can move forward with long-format versions of the conditions and timecourses tables, since I guess we will agree on those. I can then design a third table for nested timecourses, since this has been requested by a couple of users to implement repeating timecourses like experiment1
. The problem is, I don't see how having a third table will reduce ambiguity about nesting... does it?
(And if anyone wants to have a more concise notation, maybe simply allow both, scalars and start:step:end strings in the time column -> should do 90% of the conciseness with 10% of the confusion)
I don't think experiment1
can be specified like this, can it? I think this ends up being the same as repeatEvery
: start
is already time
, step
is repeatEvery
, and end
is defined by the start of the next period.
A bit late to the game, but I agree with Paul here that we should avoid nesting.
I find the nesting to be confusing, while the long format is more intuitive and I think overall easier to understand (and isRate
is a good suggestion, can be a bit tricky to implement, but doable). I see the point of one layer nesting, and think it is good to have two files (one with when each condition is applied, and the second what a condition specifies).
Well, this did not go the way I expected! But there were some good additional suggestions, thanks for the feedback :slightly_smiling_face: Unless someone says otherwise, I'll make a suggestion for two simpler "conditions" and "timecourses" long-format tables in the next week, with the additional columns like isRate
, and no nesting.
Thanks Dilan, that sounds good to me. However, since you mentioned earlier
the value of drugA at t=10, which then remains constant during the next timecourse period.
I think we have three options of how the value could be interpreted:
eventAssignment
(useful for modeling bolus dosing)rateRule
(useful for modeling patient infusions)constant
(i.e. held constant; useful for cells in bioreactors)So instead of isRate
, it is maybe better to have an optional column that is called something like valueType
, where cells are allowed to contain either assignment
(default), rate
or constant
or sth like that.
@matthiaskoenig and others have suggested we make our tables more normalized.
Instead of a timecourses table (#581), I would suggest an experiments table, which merges the conditions and timecourses tables into a single table. The main idea is that conditions/timecourses all describe the "input" function of the dynamical system. Combining this "input function information" into a single table enables some additional operations.
All "unquoted" tables in this post are in the new proposed format.
Conditions table -> experiments table
The following columns are sufficient to define a "normalized" PEtab v1 conditions table.
Example 1: classic conditions table as experiments table
This PEtab v1 conditions table
This enables additional optional columns, e.g. for units.
Timecourses table -> experiments table
This experiments table can be extended to support timecourses like #581, with the following optional column:
Example 2: timecourses table as experiments table
This timecourse in the currently-proposed format (#581)
is now specified in these long formats for the conditions and timecourses
normalized_conditions.tsv
normalized_timecourses.tsv
which are specified in the PEtab YAML like
Here, you might notice the trick. The two tables are combined into a single
experiments
table, i.e., those two long tables, and the joint table below, are equivalent tables in the exact same format -- all are valid tables in the proposed format.This joint table enables a lot more flexibility, e.g. the following two features.
(1) Timecourses can be specified in terms of model parameters directly, e.g. the above joint table is equivalent to
(2) Nesting is now possible, for easier specification of periodic timecourses.
Nested timecourses
We already agreed that repeating timecourse specification is useful. I would add nested timecourses too, since I already have a use case. Hence the following optional column:
Example 3: Nested and repeating timecourse
This describes an experiment where a switch is toggled on/off every 5 time units until
t=100
.switchOn
andswitchOff
are like PEtab v1 conditionsswitchSequence
is like a timecourse as in #581experiment1
is a nested timecourse whereswitchSequence
is repeated every 10 time units to simulate the repeated toggling of the switch, untilt=100
.Pros
timecourse1 = 0:condition1
) timecourse table to convert their PEtab v1 problems into v2, and can instead use any condition/timecourse/nested timecourseexperimentId
in the measurements table. I think this is more intuitive for users.cond1
with 1000 input variables at just one of its input variables likei.e., I think this format future-proofs PEtab v2 by supporting many features/operations on conditions. In the end, these can all be "denested" easily into things that look like PEtab v1 conditions applied at specific time points (or, SBML events), so it makes no difference to PEtab-compatible tools.
Cons