CovertLab / arrow

Stochastic simulations in python
MIT License
3 stars 1 forks source link

Negative counts from StochasticSystem #39

Open tahorst opened 5 years ago

tahorst commented 5 years ago

I've noticed that under certain conditions, StochasticSystem.evolve() can produce negative counts which should not be possible if I'm not mistaken. I've pushed a branch (neg-counts) which has a script to reproduce this behavior - python neg_counts.py should have the system evolve until negative counts are reached. There are also very few reactions occurring even with an extremely long time, which might be a related issue.

jmason42 commented 5 years ago

I haven't run the system but I did inspect the data file - guessing this is adapted from the complexation stoichiometry. Barring an implementation issue, I'd guess we're looking at a numerical problem, probably downstream of the fact that there are very large number of reactants in some reactions - for example, there's one where there are 130 reactants, 120 of which are the same species. Depending on the implementation, "n choose 120" could easily overflow (e.g. if you try to compute 120 factorial). Just a guess at this point, I'll try to take a look. Thanks for the example code.


Going to edit with findings until I have a better idea of what's happening. First clue here is that the 'time' is NaN immediately following the first step. So that should be caught and thrown early.

Additionally, the only reaction that occurs is the first reaction, over and over until the system fails.

Hmm, 'choose' seems OK. It spits out what look like reasonable numbers. They're strictly positive, which I think is because anything that would be zero gets filtered before the call to 'choose'.

Yes, so many of these reaction propensities are extreme. E.g. at the final step, the propensity for reaction 84 is 1.59e91 reactions per second. That's not even the largest, and there is at least one reaction with a NaN propensity. These problems exist from the first simulated step.

jmason42 commented 5 years ago

There are several issues coming together at once here: 1) Extreme numbers of reactants participating in reactions (several types, several copies of a single type). 2) Large numbers of molecules. 3) A single, large stochastic rate constant being utilized for every reaction.

So, effectively, you have several big numbers (sometimes very big) multiplied together, resulting in some NaNs. The exact consequence of these NaNs isn't really worth tracking down but you can imagine how NaN propagation throws a wrench into the Gillespie SSA update procedure.

The remediation isn't clear. Obviously this package needs some error checking, but this is more of a GIGO issue. Ideally I'd recommend the following:

I don't see any other approach that wouldn't be inferior to simply reverting complexation to my old procedure.

prismofeverything commented 5 years ago

Interesting, so we have found the practical bounds of this algorithm. Thanks for the analysis @jmason42! From looking at it it seems

Having the code detect these conditions is desirable and can be added without too much trouble, but as @jmason42 points out does not solve the problem. Reducing the number of simultaneous reactions would do it ("dimerization" of the stoichiometry), and the stoichiometry could be decomposed internally to this library so as to be transparent to the end user. With that we would increase the effective bounds of when this algorithm is applicable to a problem, though how much would need to be tested.

prismofeverything commented 5 years ago

Expand all (n>2)-meric complexation reactions into dimeric reactions. This could and should be done computationally.

@jmason42 Did you have an implementation in mind here? The main choices I see are either a tree form where you split an aggregate reaction into successive "halves" and complexation works through recursive pairing, or the simple linear approach where you track each N-mer and apply one monomer at a time. The tree would be logarithmic in space if this needs to support reactions with 1000s of simultaneous reactants. Either way it could all be done in the python before passing the stoichiometry into the obsidian layer, and it would have to reassemble the original shape of the counts vector once it gets its results back from evolve.

jmason42 commented 5 years ago

Yes, I see this as an issue external to this library. My basic idea would be to totally reformulate the system such that the single-monomer-addition intermediates are real species, tracked by the simulation. This is fine in the small cases, e.g. A + B + C -> ABC would require introducing and tracking the AB, BC, and AC species (and replacing the modeled reaction with A + B -> AB, B + C -> BC, A + C -> AC, AB + C -> ABC, BC + A -> ABC, AC + B -> ABC).

Easy to expand this computationally but the number of possibilities grows catastrophically in many cases. I think a 10-mer is probably around the biggest reaction you'd be comfortable decomposing, hence my comment about disabling all the complexation that has no impact in the model. This still leaves important thing likes ribosomes, which have dozens of subunits.

So, running with the ribosome idea - your first option is to model their formation somewhere else, using something like the old complexation procedure. Reasonable if it's not likely to be important for the rest of the model, though unsatisfying because you start having several different realizations of similar physical processes.

Let's say that's not acceptable, and furthermore, the explicit representation of all different subunit combinations is computationally infeasible. At this point you can apply some physics/biology, and start culling down the number of possible assembly pathways to a handful. After all, it's not easy for a bunch of protein/RNA blobs to all be in contact with each other at once. This would reduce the problem considerably but I think there are some profound data issues here - @mialydefelice would know better. Even if the data is available it would be quite a chore to implement all the particular reactions.

Stepping back, you could keep all the possible assembly pathways, but rethink the underlying representation. Maybe there's a new "partial complex" unique molecule that could be used to represent those non-functional intermediates - this way you don't need a new entry in bulk molecules for all combinatorial possibilities. Furthermore you could consider collapsing reactions; e.g. maybe you believe that AB + C -> ABC and B + C -> BC are equivalent in terms of kinetics, so there's just a generic B + C -> BC reaction you need to update. (This is a similar idea to what we talked about with @heejochoi and the stochastic modeling of polypeptide elongation.)

So, those are some approaches. I don't really have a clear idea of what approach would be best, but practically speaking it would be easiest to pick one strict assembly pathway (even if the biological evidence is weak to nonexistent). That would keep the number of intermediate species linear w.r.t. the number of subunits. More explicit, detailed modeling can come later. The one major downside I see to this is that going from A to AB, to ABC, to ABCD, etc. would be slower than going from {A, C} to {AB, CD}, to ABCD - if you follow. Again, troublesome for the larger cases, but without more data on what those actually are I can't recommend a single appropriate course of action. E.g. there's no sense wracking our brains trying to accommodate special cases like flagella formation.

That was a big pile of information - in summary:


One more thing I neglected to mention: if you are modeling intermediates, you probably need to model reverse reactions, too. Otherwise you'll risk silly things like an accumulation of AB and BC that can't be taken apart and reassembled to make the terminal ABC complex. We usually think of intermediates as being unstable/highly dynamic, so in most cases, reversibility makes a lot of sense.