Tests for model output - Githubissues

tahorst commented 5 years ago

As discussed in the wholecell meeting, we would like to create a framework that tests outputs of the model against expected values or historical distributions as a way to catch model implementation bugs that might not produce code failures that would be caught by unit tests or CI failures. This should make identifying and troubleshooting errors easier and more automated.

The first step in implementing such tests would be creating a list of simulation outputs that we feel will most accurately and sensitively catch unwanted model changes (eg doubling time, mass fractions, ribosome/RNAP elongation rates and active fractions). Please add comments with suggestions that you often look at to verify output.

After these have been identified, we can create a new task/runscript that will load in data from a simulation, calculate the desired outputs, compare to expected values (possibly hard coded ranges or distributions within the repo or historical distributions) and alert us to any values that do not meet the expected targets as part of our CI.

Implementation notes:

If we have hard coded values, it will make it easier to update if a new modeling change has a desired impact on one of the targets while storing a historical distribution will require filtering negative results and allowing a reset if new improvements change one of the metrics significantly.
We can get historical values from daily builds that we have verified produce the outputs want (or just use the most recent one) after writing the script to read and calculate values.
If we use distributions, we could give an overall probability that all of the values take on expected values.

prismofeverything commented 5 years ago

After the discussion today we emerged with the following metrics to test model behavior:

RNA polymerase and ribosome active fractions and concentrations.
Various mass fractions (protein, mRNA, etc)
Doubling time
Elongation rates
The set of limiting metabolites
Fluxome/proteome/transcriptome correlation

The other goal is that we want these to support development, but not constrain it. Along these lines, the idea is that these metrics will be checked against data in the repository, so if you expect that you are changing model behavior you can update the affected data along with it. This will also act as a declaration that you are expecting model behavior to change.

Some additional ideas that came up:

Find more metrics by going through listeners and finding stable distributions?
Use unsupervised learning to make predictions of model behavior?

Any additional thoughts or ideas welcome!

1fish2 commented 4 years ago

This will be invaluable for noticing progress and problems in the overall simulation and in each process. But I'm unclear on some key questions.

Take test_library_performance as a simple example of such metrics, if that experience is relevant. We've found it useful to measure the speed of a handful of library operations like dot product and ODE integration, then get notified in the daily build if any of them got slower than specific threshold limits. We want to find out if that happened unexpectedly after upgrading Python, OpenBLAS, NumPy and SciPy, or even NFS.

There's a lot of measurement noise, so picking a failure cutoff is an unhappy tradeoff. The initial values came from manual runs but we had to increase the thresholds sometimes when a nightly build exceeded the thresholds by a small amount.

What do you want to happen when the measurement is just a tad beyond the cutoff?

Furthermore if something speeds up, we won't know about it or tighten up the thresholds since the tests will merely keep passing.

What'd be better is to plot daily performance metrics over time. Keep the alert when a metric is worse than some threshold -- maybe an adaptive threshold like 5-day moving average -- but much of the value would come from glancing at the chart.

Bio-simulation metrics

Are the most useful metrics scalars like:
- doubling time,
- max elongation rate, and
- protein mass fraction averaged over a cell's life cycle?
Are they functions of time over the cell's life cycle?
Are they aggregates over multiple cells with different seeds and generations?
Is it helpful to declare pass/fail for each metric?
Are the reference thresholds:
- derived from the flat file input data?
- tuned manually from experience with model outputs (and occasionally adjusted after a test barely?
- adaptive thresholds from recent data?
Or do we really just need to see a day-by-day chart with reference guidelines or adaptive guidelines?
Should the output indicate how closely the metrics fit within expected bounds.
- Hard bounds?
- Aggregate fuzzy bounds, i.e. a weighted average of the metric fitness values?
Measure per-process metrics by running each process:
- in isolation?
- within some kind of reference framework?
- in the context of all the other processes and their interdependencies?

tahorst commented 4 years ago

Are the most useful metrics scalars like: doubling time, max elongation rate, and protein mass fraction averaged over a cell's life cycle? Are they functions of time over the cell's life cycle?

I'd say they are mostly scalars that should not change as a function of time but we could consider each time point - compare distribution to expected distribution (mean and stdev) or compare each point to min/max cutoff. Comparing values that change over time would be much more complicated with noise in our sims.

Are they aggregates over multiple cells with different seeds and generations?

Ideally it would consider several seeds and generations to get a better idea of data distributions since noise in only one generation could lead to erroneous comparisons.

Is it helpful to declare pass/fail for each metric? Are the reference thresholds: derived from the flat file input data? tuned manually from experience with model outputs (and occasionally adjusted after a test barely? adaptive thresholds from recent data?

I think it's helpful to declare a pass/fail for each one. To start, it is probably simpler to define them in a flat file until we get a better idea of how they change with updates to the model and how often the limits need to change (I don't think it should be that often but might depend how stringent our cutoffs are). These cutoffs should come from historical data and a script to generate them from a set of sims would be helpful. Adaptive thresholds seems like a nice idea but probably not necessary at first.

Or do we really just need to see a day-by-day chart with reference guidelines or adaptive guidelines? Should the output indicate how closely the metrics fit within expected bounds. Hard bounds? Aggregate fuzzy bounds, i.e. a weighted average of the metric fitness values?

Both would be nice to have but not critical at first.

Measure per-process metrics by running each process: in isolation? within some kind of reference framework? in the context of all the other processes and their interdependencies?

This could be useful as a quicker test instead of needing to run a large set of sims to see if the metrics pass. We could save a few reference simulation states and pass them to either all the processes or individual processes. This could be created as a separate issue - getting processes to run on their own and saving and loading a single timestep.

U8NWXD commented 4 years ago

A couple of questions about the proposed metrics:

Fluxome/proteome/transcriptome correlation

Does this mean the correlation among the fluxome, proteome, and transcriptome activity levels, or does it mean correlating those activity levels to previous runs?
The set of limiting metabolites

Which metabolites? In the GrowthLimits table I see pool, used, and allocation counts for GTP, amino acids, and NTPs. Are these what we're interested in?

tahorst commented 4 years ago

Fluxome/proteome/transcriptome correlation

Each of these datasets has a validation dataset so we can compare simulation averages for each flux/protein/transcript to measured values for each flux/protein/transcript. Comparing the simulation to measured values within each dataset can give us a correlation statistic (Pearson r or R^2 value) for each simulation run (or set of sims if we average over many generations). Analysis scripts calculate this correlation for us like for the fluxome or proteome.

The set of limiting metabolites

This is another one that is derived from data that we get out of sims and not directly saved. This script does the analysis that we were thinking of to get the set of metabolites. Basically, metabolites that are not produced for a certain number of timesteps are classified as limiting so we would want to do the same analysis and compare it to the metabolites that we know are often limiting at some point in our sims.

U8NWXD commented 4 years ago

@tahorst in the script, it looks like we are filtering for metabolites whose levels are constant within some window of time:

https://github.com/CovertLab/wcEcoli/blob/aa2cffe020b3fc9e146f2070e04aeedda9c1312c/models/ecoli/analysis/multigen/limitedMetabolites.py#L66

Wouldn't this identify metabolites that are limited, not those that are limiting?

Similarly, when you say that

metabolites that are not produced for a certain number of timesteps are classified as limiting

it seems to me that if a metabolite is not being produced (presumably due to a lack of the needed reactants), it is limited, but it is the missing reactant that is limiting. Am I understanding this terminology correctly?

tahorst commented 4 years ago

There isn't a real distinction between limiting and limited here. These metabolites are limiting growth because enzymes are not available to produce them so their production is limited. It could be the case that some metabolite upstream is also limited but we don't have too many metabolite concentrations so more often than not it is the enzyme that is limiting the production of metabolite. There are often multiple enzymes in pathways that could be 0 so looking at downstream metabolites that are limited in their doubling is a good way to capture the overall state of the metabolic network and find anything limiting growth.

U8NWXD commented 4 years ago

@tahorst for transcriptome correlation, what reference data set were you thinking of comparing against?

tahorst commented 4 years ago

It isn't a true validation reference set but you could make the same comparison as in this analysis. I think @heejochoi worked with other transcriptome data sets so she might have a reference to make another comparison to that isn't used in model construction.

CovertLab / wcEcoli

Tests for model output #675