PEtab-dev / PEtab

PEtab - an SBML and TSV based data format for parameter estimation problems in systems biology

https://petab.readthedocs.io

MIT License

59 stars 12 forks source link

Use somewhat normalized "experiments" table instead of conditions/timecourses #585

Open dilpath opened 3 months ago

dilpath commented 3 months ago

@matthiaskoenig and others have suggested we make our tables more normalized.

Instead of a timecourses table (#581), I would suggest an experiments table, which merges the conditions and timecourses tables into a single table. The main idea is that conditions/timecourses all describe the "input" function of the dynamical system. Combining this "input function information" into a single table enables some additional operations.

All "unquoted" tables in this post are in the new proposed format.

Tables quoted like this are in old formats.

Conditions table -> experiments table

The following columns are sufficient to define a "normalized" PEtab v1 conditions table.

experimentId
    The experiment ID.
inputId
    The input variable ID.
    For example, a experimental condition parameter in the SBML model.
value
    The value that `inputId` takes.

Example 1: classic conditions table as experiments table

This PEtab v1 conditions table

conditionId k0 k2

cond1 5 3

conditionId	k0	k2
cond1	5	3

is now this PEtab v2 experiments table	experimentId	inputId	value
cond1	k0	5
cond1	k2	3

This enables additional optional columns, e.g. for units.

Timecourses table -> experiments table

This experiments table can be extended to support timecourses like #581, with the following optional column:

time
    The time at which the condition is applied.
    The earliest time of the `experimentId` is its `t0` for simulation.

Example 2: timecourses table as experiments table

This timecourse in the currently-proposed format (#581)

conditionId k0

cond1 1

cond2 2

cond3 3

timecourseId timecourse

tc1 0:cond1; 10:cond2; 250:cond3

conditionId	k0
cond1	1
cond2	2
cond3	3

timecourseId	timecourse
tc1	0:cond1; 10:cond2; 250:cond3

is now specified in these long formats for the conditions and timecourses

`normalized_conditions.tsv`	experimentId	inputId
cond1	k0	1
cond2	k0	2
cond3	k0	3

`normalized_timecourses.tsv`	experimentId	inputId
tc1	cond1	0
tc1	cond2	10
tc1	cond3	250

which are specified in the PEtab YAML like

...
problems:
- experiment_files:
  - normalized_conditions.tsv
  - normalized_timecourses.tsv
  measurement_files:
  - ....tsv
  ...

Here, you might notice the trick. The two tables are combined into a single experiments table, i.e., those two long tables, and the joint table below, are equivalent tables in the exact same format -- all are valid tables in the proposed format.

experimentId	inputId	value	time
cond1	k0	1
cond2	k0	2
cond3	k0	3
tc1	cond1		0
tc1	cond2		10
tc1	cond3		250

This joint table enables a lot more flexibility, e.g. the following two features.

(1) Timecourses can be specified in terms of model parameters directly, e.g. the above joint table is equivalent to

experimentId	inputId	value	time
tc1	k0	1	0
tc1	k0	2	10
tc1	k0	3	250

(2) Nesting is now possible, for easier specification of periodic timecourses.

Nested timecourses

We already agreed that repeating timecourse specification is useful. I would add nested timecourses too, since I already have a use case. Hence the following optional column:

repeatEvery
    The `inputId` is repeated (reapplied/restarted from its `t0`) every `repeatEvery` time units.

Example 3: Nested and repeating timecourse

This describes an experiment where a switch is toggled on/off every 5 time units until t=100.

switchOn and switchOff are like PEtab v1 conditions
switchSequence is like a timecourse as in #581
experiment1 is a nested timecourse where switchSequence is repeated every 10 time units to simulate the repeated toggling of the switch, until t=100.

experimentId	inputId	value	time	repeatEvery
switchOn	switch	1
switchOff	switch	0
switchSequence	switchOn		0
switchSequence	switchOff		5
experiment1	switchSequence		0	10
experiment1	switchOff		100

Pros

most users do not need timecourses, but v2 currently requires a timecourses table. This combined experiments table means users don't need a dummy (timecourse1 = 0:condition1) timecourse table to convert their PEtab v1 problems into v2, and can instead use any condition/timecourse/nested timecourse experimentId in the measurements table. I think this is more intuitive for users.
this experiments table defines inputs, and then supports merging/repeating/concatenating into timecourses and nested timecourses. i.e. all "input" information that PEtab core intends to support is in a single table.
some basic operations on experiments is possible. For example, one could modify some complicated condition cond1 with 1000 input variables at just one of its input variables like

experimentId	inputId	value
cond2	cond1
cond2	k999	3

i.e., I think this format future-proofs PEtab v2 by supporting many features/operations on conditions. In the end, these can all be "denested" easily into things that look like PEtab v1 conditions applied at specific time points (or, SBML events), so it makes no difference to PEtab-compatible tools.

Cons

users should see Example 1 in the docs, and the optional columns in Examples 2 and 3 should be presented carefully since they are irrelevant (and potentially confusing) to most.

m-philipps commented 3 months ago

I can see the advantages and I think that it is a viable solution that should be explored further.

In general the long format for the condition table means that conditions with many parameters are not as simple and clearly arranged as they are in the wide format, but I think that the flexibility in the long format, also w.r.t. further columns, is the stronger argument. (It should also be trivial to create the conditions table in wide format and then switch using the v1 to v2 converter.)

What I am critical about is the potential ambiguity. Rethinking conditions and time courses (sequences of conditions) as the same input-value-(time) structure is logical but requires a higher level of abstraction from the user. It is also a bit misleading because different rows can still have different interpretations. For instance, re. the rows in the newly suggested table in Example 2: cond1-cond3 are used to set parameters while measurements can only be assigned to tc1. That is also inconsistent with Example 1 where there is no explicit time course. If that means that every condition is by default a time course starting at time = 0 we end up with a mix of implicitly and explicitly defined time courses.

I am also considering whether too much flexibility could deter new/basic users because it would be more difficult to figure out (1.) which way to correctly specify a PEtab problem and (2.) understand if there are undesired consequences of specifying a PEtab problem in one way, e.g.: If Example 2 cond1 is also an implicit time course, would it be simulated?

paulflang commented 3 months ago

Thanks @dilpath , great to see efforts to make the tables more normalized! But similar to @m-philipps , I think putting rows with different meaning in the same table should be avoided. My suggestion would be something like

experimentId	targetId	time	value	isDelta
S1_knockout	S1	0	0	0
S1_knockdown	S1	0	s1kd_init	0
BolusA	drugA	10	5	1
BolusA	drugA	20	5	1
InfusionA	kInDrugA	10	1	0
InfusionA	kInDrugA	20	0	0

Here, we have four different experiments. targetId is probably what you called inputId. It can be an species or parameter (or compartment, I guess). value can be numeric or a parameter from the parameter table. Per default, the targedId is set to value. However, if isDelta, the value is added to the targetId. For example, at t=20 not all drugA might be degraded, but the patient takes the next pill.

One problem with this table is that you might have to change the model if you want to do infusion dosing (i.e. the model might not have a kInDrugA parameter, when it was originally not constructed for infusion dosing). To facilitate reuse of models, there could just be another column isRate.

experimentId	targetId	time	value	isDelta	isRate
S1_knockout	S1	0	0	0	0
S1_knockdown	S1	0	s1kd_init	0	0
BolusA	drugA	10	5	1	0
BolusA	drugA	20	5	1	0
InfusionA	drugA	10	1	0	1
InfusionA	drugA	20	0	0	1

One issue I see is that you could accidentally do

experimentId	targetId	time	value	isDelta	isRate
InfusionA	drugA	10	1	0	1
InfusionA	drugA	20	0	0	0

which should not switch off the infusion imo, just set drugA to zero, while the infusion keeps going.

Of course, to save rows you could also add an optional repeat column that contains delta_t and n_reapeats. For example 5; 10 would mean repeat what the row does every 5 time units 10 times.

Lmk what you think. I have not been involved with v2 yet, so maybe I'm missing crucial bits here.

dilpath commented 3 months ago

Thanks for the quick feedback. I changed the first post to address some of it.

@m-philipps

It should also be trivial to create the conditions table in wide format and then switch using the v1 to v2 converter.

This will be available in the v1<->v2 converter that we will supply in libpetab-python.

That is also inconsistent with Example 1 where there is no explicit time course.

For instance, re. the rows in the newly suggested table in Example 2: cond1-cond3 are used to set parameters while measurements can only be assigned to tc1.

Yes, this is a discussion point -- should we have a default t0=0 or require a user to specify this? Specifying it explicitly is fine with me. Conditions in PEtab v1 are implicitly a constant timecourse with t0=0 -- we could make that explicit in PEtab v2.

it would be more difficult to figure out (1.) which way to correctly specify a PEtab problem

Yes, the flexibility makes things a little more complicated. However, the examples the users see in the docs can be rather basic and in two "separate" tables, e.g. Example 2 can be specified like `normalized_conditions.tsv`	experimentId	inputId
cond1	k0	1
cond2	k0	2
cond3	k0	3

`normalized_timecourses.tsv`	experimentId	inputId
tc1	cond1	0
tc1	cond2	10
tc1	cond3	250

This is equivalent to separate normalized formats for the conditions and timecourses table. However, they would be combined into a single experiments table in the PEtab YAML, because they're all just timecourses to the tool.

...
problems:
- experiment_files:
  - normalized_conditions.tsv
  - normalized_timecourses.tsv
  measurement_files:
  - ....tsv
  ...

and (2.) understand if there are undesired consequences of specifying a PEtab problem in one way, e.g.: If Example 2 cond1 is also an implicit time course, would it be simulated?

It's currently up to the tool whether it chooses to perform simulations that have no measurements...

dilpath commented 3 months ago

@paulflang

putting rows with different meaning in the same table should be avoided

To me, the rows do not have a different meaning. Every row describes the (piecewise-)constant input function of a dynamical system. The only different between rows is whether they update a single, or multiple, model parameters, but hopefully this is made clear through the use of experimentIds in the inputId column.

My suggestion would be something like

This will be supported. i.e. your suggestion

experimentId	targetId	time	value	isDelta
BolusA	drugA	10	5	1

can be expressed like so in PEtab v2 (regardless of which table format we go with, expressions will be supported...)

experimentId	inputId	time	value
BolusA	drugA	10	drugA+5

We opted for expressions over a column like isDelta since it was requested (e.g. https://github.com/PEtab-dev/PEtab/issues/577) and is more flexible. However, this sets the value of drugA at t=10, which then remains constant during the next timecourse period.

you might have to change the model if you want to do infusion dosing

Would be fine for me to include this isRate column :+1: This then really requires a tool to modify the mathematical model, rather than requesting that the simulator simply simulate with different parameters, so I guess most tools won't support this, but also fine since it's optional. The PEtab library can do that work too. It also means the input function can now be piecewise-linear.

One issue I see is that you could accidentally do

experimentId targetId time value isDelta isRate

InfusionA drugA 10 1 0 1

InfusionA drugA 20 0 0 0

which should not switch off the infusion imo, just set drugA to zero, while the infusion keeps going.

I guess instead of value and isRate, columns like value and rate might be more intuitive. e.g. set inputId=value, and set rateOf(inputId)=rate. Then your suggestion becomes explicit

experimentId	inputId	time	value	rate
InfusionA	drugA	10		1
InfusionA	drugA	20	0	1

or we use a priority column, which I would include anyway to resolve multiple changes to the input function at the same time point.

experimentId	inputId	time	value	isRate	priority
InfusionA	drugA	10	1	1	1
InfusionA	drugA	20	0	0	0
InfusionA	drugA	20	1	1	1

paulflang commented 2 months ago

To me, the rows do not have a different meaning. Every row describes the (piecewise-)constant input function of a dynamical system.

Ok. I see what you mean now. But I also think that PEtab should stick to the classic relational database model, where foreign keys point to rows in other tables. Then you can just go with object-relational mapping (table=class, column=class attribute, row=instance) to represent everything in (Python) objects. Otherwise the format specification would break the principle of least astonishment. But I'm definitely not a database expert, so I might be wrong here.

Problem with the relational system is, that for n levels of nesting, you require n tables. And you don't know how deep a user wants to nest. But imo the last table before the "Nesting" headline looks good to me. And tbh, for most reasonable biological applications the number of rows will still be something reasonable.

We opted for expressions over a column like isDelta since it was requested (e.g. https://github.com/PEtab-dev/PEtab/issues/577) and is more flexible.

OK, that's a good point.

However, this sets the value of drugA at t=10, which then remains constant during the next timecourse period.

Hmm. That's a problem. I think at three cases need to be supported:

chemostat (drug set to a constant value)
infusion (drug influx set to a constant value)
bolus (drug added, but may decay after)

So perhaps instead of using a Rate column, we ne a valueType column, that tells us what type that value is. Allowed cases could be:

constant
rate
assignment

dilpath commented 2 months ago

Then you can just go with object-relational mapping (table=class, column=class attribute, row=instance) to represent everything in (Python) objects.

Also not an expert, but I think this is already supported. I am currently working with this proposed table and importing it into Python objects using pydantic without issue.

Problem with the relational system is, that for n levels of nesting, you require n tables. And you don't know how deep a user wants to nest. But imo the last table before the "Nesting" headline looks good to me. And tbh, for most reasonable biological applications the number of rows will still be something reasonable.

Agreed, IMO the nice thing about the proposed format is that it supports the very intuitive "condition-only long table" and "timecourse-only long table" (see updated Example 2), while also supporting the arbitrary nesting that avoids multiple, or very long, tables.

So perhaps instead of using a Rate column, we ne a valueType column, that tells us what type that value is.

Sounds good to me!

FFroehlich commented 2 months ago

Thanks for the quick feedback. I changed the first post to address some of it.

Thanks for bringing up this discussion. For some joining in late to this discussion it would have been helpful to have some kind of markup in the post to see changes.

dilpath commented 2 months ago

Thanks for the quick feedback. I changed the first post to address some of it.

Thanks for bringing up this discussion. For some joining in late to this discussion it would have been helpful to have some kind of markup in the post to see changes.

Agreed, but I didn't make any changes to the format of the tables in the first post yet (e.g. all columns and their meaning have remained the same so far). I only added explanatory text or re-arranged the order of some things so people can hopefully understand the proposal better.

I now "quote" any tables in the old formats (e.g. the old conditions and timecourses tables) in the first post, so it's clear which tables are in the proposed format.

FFroehlich commented 2 months ago

I guess instead of value and isRate, columns like value and rate might be more intuitive. e.g. set inputId=value, and set rateOf(inputId)=rate. Then your suggestion becomes explicit

I like this proposal quite a bit as it would resolved some of the ambiguity in the assignment that we currently have, which requires pretty detailed understanding of SBML semantics. Would be great to also differentiate between value and initialValue.

I am a bit unhappy about the experimentId notation. On the one hand the meaning of that column will be semantically quite difficult to describe. On the other hand, it creates a loot of room for weird corner cases. For example, what happens there are nested assignments with time specified? What is supposed to happen when you assign a condition as rate, but the condition also assigns initialValue. My impression is that a lot of issues could be avoided and explanations simplified if there are two separate files even if there is some overlap in the columns.

dilpath commented 2 months ago

I am a bit unhappy about the experimentId notation. On the one hand the meaning of that column will be semantically quite difficult to describe.

To me, it simply means "input function", but I probably misunderstood your point...

On the other hand, it creates a loot of room for weird corner cases. For example, what happens there are nested assignments with time specified?

Yes, I left an explanation of these corner cases out of the first post so far :see_no_evil: But this is how I would do it: Each experiment has a t0 for the start time of the simulation/input function. This is defined as the lowest value in the time column for all rows belonging to that experimentId. If the user leaves time empty, t0 defaults to 0.

To your question: although a user specifies time absolutely, these times are all considered as relative to the t0 of the experiment. Then, when nesting occurs, we simply change the t0, and all subsequent periods are updated to maintain the same "relative" time delta to the new t0. Here's an example:

Consider the switchSequence in Example 3. Using this "relative to t0" definition, I am able to construct a timecourse that intuitively adjusts the time correctly, after every repeatEvery repetition of the nested timecourse experiment1. If I didn't use this "relative to t0" definition, then all switch=1 occurrences would be at time=0, which is undesirable. Here's the result of denesting experiment1 in Example 3:

experiment_id	input_id	input_value	time
experiment1	switch	1	0
experiment1	switch	0	5
experiment1	switch	1	10
experiment1	switch	0	15
experiment1	switch	1	20
experiment1	switch	0	25
experiment1	switch	1	30
...
experiment1	switch	0	100

e.g. the first switchSequence repeat is at t0=0, and hence switch=0 @ time=5. The second repeat is at t0=repeatEvery=10, and hence switch=0 @ time=15. The nth repeat is at t0=repeatEvery*n=10*n and hence switch=0 @ time=10*n+5.

This is useful for me to be able to concisely define a radiation therapy involving "radiation on" then "radiation off" repeatedly until some end time point. It denests into something that can be easily verified to have the intended timecourse, and I would include a plotting function in libpetab-python to visualize these input functions so users can have a visual check too. The denested version is also valid because of the "single table" in this proposal, which allows model parameters to be directly defined in timecourses, instead of via "conditions".

What is supposed to happen when you assign a condition as rate, but the condition also assigns initialValue. My impression is that a lot of issues could be avoided and explanations simplified if there are two separate files even if there is some overlap in the columns.

I haven't thought too much about these additional columns that mean things like SBML initialAssignmentRule (initialValue), assignmentRule and rateRule (rate). To avoid contradictions, specifying both assignmentRule+rateRule should be excluded, as well as initialAssignmentRule+assignmentRule.

But I don't see an issue with initialAssignmentRule+rateRule, since then this is just saying the "input function" is an IVP with dInputFunctionVariable/dt = rate, InputFunctionVariable(t=t0) = initialValue, right?

paulflang commented 2 months ago

I am a bit unhappy about the experimentId notation. On the one hand the meaning of that column will be semantically quite difficult to describe.

This is because it nests three different concepts, as @dilpath mentioned:

switchOn and switchOff are like PEtab v1 conditions

switchSequence is like a timecourse as in https://github.com/PEtab-dev/PEtab/pull/581

experiment1 is a nested timecourse where switchSequence is repeated every 10 time units to simulate the repeated toggling of the switch, until t=100.

Of course, they can be considered as one concept ("input function", as Dilan suggested), but still, input functions are so diverse things that they almost deserve their separate tables.

That said, we would not have these problems if we just would not support nesting. Dilan's denested table above looks very clean and easy to understand. I don't see big disadvantages. Pandas, DataFrames.jl and even Excel allows you easily to create such denested tables. Maybe file size could become a little pain if you exceed GitHub file size and have to start using Git-lfs at some point, but this should be super rare. On the other hand the advantages of the denested form seem much much larger for me:

increased community uptake due to simplicity
increased software support due to easier implementation
increased robustness to human error
no ambiguity

(And if anyone wants to have a more concise notation, maybe simply allow both, scalars and start:step:end strings in the time column -> should do 90% of the conciseness with 10% of the confusion)

FFroehlich commented 2 months ago

I am a bit unhappy about the experimentId notation. On the one hand the meaning of that column will be semantically quite difficult to describe.

To me, it simply means "input function", but I probably misunderstood your point...

Well, we probably want something a bit more biology oriented. In this thread we often refer to "experiment", but with one-layer nesting, it's probably more appropriate to call respective entries something like "experimental phase" and I am not sure what to call 3-layer nesting or why it would be necessary. But as Paul mentioned, if we already introduce such notation why not make our lives easier and just separate them in individual files?

What is supposed to happen when you assign a condition as rate, but the condition also assigns initialValue. My impression is that a lot of issues could be avoided and explanations simplified if there are two separate files even if there is some overlap in the columns. I haven't thought too much about these additional columns that mean things like SBML initialAssignmentRule (initialValue), assignmentRule and rateRule (rate). To avoid contradictions, specifying both assignmentRule+rateRule should be excluded, as well as initialAssignmentRule+assignmentRule.

I am not worried about combinations of assignment, but using this in combination with nesting.

dilpath commented 2 months ago

denested table above looks very clean and easy to understand. I don't see big disadvantages.

Agreed, this should be allowed regardless if we adopt the long format; this means users do not need to define conditions, because they can specify model parameters in this experiments table directly.

On the other hand the advantages of the denested form seem much much larger for me

Thanks, nice points, I can see the benefits. No ambiguity is definitely a big advantage.

Well, we probably want something a bit more biology oriented.

Makes sense. I have been thinking in the context of drug regimens for a couple of applications so far. e.g. maybe one only applies a radiation therapy during 9 am and 5 pm (first level timecourse); and only on weekdays (second level timecourse); and only on the first week of the month (third level timecourse); and only for six months (fourth level timecourse); and then measures tumor response at the seventh month. Some of this I made up -- in a current application I am only looking at a third level timecourse, but I think these larger nesting applications are plausible. The nesting allows me to define one "covariate condition" per patient, and define the "radiation therapy" nested timecourse once, and then combine "covariate condition"+"radiation therapy timecourse" to create patient-specific "experiments". I am not sure of a suitable biology term here though, apart from "experiment" or "protocol". This makes for a neat specific for my problem, though.

But as Paul mentioned, if we already introduce such notation why not make our lives easier and just separate them in individual files?

If we end up only supporting "one-layer nesting" (i.e. equivalent to a conditions and timecourses file), then individual files is completely fine for me. But I think there is a similar complexity cost (in terms of user comprehension) from introducing too many tables, compared to the complexity of understanding nested experiments.

Alright, one last attempt to see if I can make this nesting intuitive for new users. What if we say that, if any row in the experiments table is missing a time value, then that experiment is a "floating" experiment that cannot be used directly?

e.g.

experimentId	inputId	value	time	repeatEvery
switchOn	switch	1
switchOff	switch	0
switchSequence	switchOn
switchSequence	switchOff		5
experiment1	switchSequence		0	10
experiment1	switchOff		100

Here, switchOn and switchOff are floating (classic PEtab conditions), so must be used in another experiment. switchSequence is missing a time in its first row, so is also floating. Its timepoints are relative to t0, which doesn't exist until it's used in another experiment that defines t0. Floating experiments can only have non-negative or empty time values. In experiment1, switchSequence is given a t0=0, so experiment1 can be used in the measurements table.

If we can agree that this is sensible, then I would try to make a case for nested initialAssignment/assignment/rate rules.

But if you think it's still to confusing/ambiguous, then I'll open a new issue and we can move forward with long-format versions of the conditions and timecourses tables, since I guess we will agree on those. I can then design a third table for nested timecourses, since this has been requested by a couple of users to implement repeating timecourses like experiment1. The problem is, I don't see how having a third table will reduce ambiguity about nesting... does it?

(And if anyone wants to have a more concise notation, maybe simply allow both, scalars and start:step:end strings in the time column -> should do 90% of the conciseness with 10% of the confusion)

I don't think experiment1 can be specified like this, can it? I think this ends up being the same as repeatEvery: start is already time, step is repeatEvery, and end is defined by the start of the next period.

sebapersson commented 2 months ago

A bit late to the game, but I agree with Paul here that we should avoid nesting.

I find the nesting to be confusing, while the long format is more intuitive and I think overall easier to understand (and isRate is a good suggestion, can be a bit tricky to implement, but doable). I see the point of one layer nesting, and think it is good to have two files (one with when each condition is applied, and the second what a condition specifies).

dilpath commented 2 months ago

Well, this did not go the way I expected! But there were some good additional suggestions, thanks for the feedback :slightly_smiling_face: Unless someone says otherwise, I'll make a suggestion for two simpler "conditions" and "timecourses" long-format tables in the next week, with the additional columns like isRate, and no nesting.

paulflang commented 2 months ago

Thanks Dilan, that sounds good to me. However, since you mentioned earlier

the value of drugA at t=10, which then remains constant during the next timecourse period.

I think we have three options of how the value could be interpreted:

like an SBML eventAssignment (useful for modeling bolus dosing)
like an SBML rateRule (useful for modeling patient infusions)
like an SBML constant (i.e. held constant; useful for cells in bioreactors)

So instead of isRate, it is maybe better to have an optional column that is called something like valueType, where cells are allowed to contain either assignment (default), rate or constant or sth like that.