Time-step reversion (for unique molecules)

jmason42 commented 6 years ago

Towards the end of dynamic time steps, it will be necessary to detect when simulation steps have failed (e.g. does not meet some minimal accuracy criterion), and if so, revert the state of the system back to the previous step. A few things make this challenging:

This is a major paradigm shift - the simulation has always evaluated forward.
As a time-step progresses, various calculations are 'locked in'. In particular, we would need to worry about synchronizing I/O and associated features such that we can cleanly 'revert' the system.
Our biggest immediate concern: While the BulkMolecules state does not finalize a step until its .merge method is called, the underlying UniqueMolecules object is directly modified by processes.

The latter was done by myself as a lazy optimization - if processes directly modified their allocated portion of the state, we wouldn't need to worry about the overhead associated with updating and maintaining the very large UniqueObjectsContainer object. However, reversion requires that we have some way to recover the old state as needed. I'd appreciate @1fish2's ideas on how to approach this, but I have three of my own:

Keep two UniqueObjectsContainers in memory. One will be the previous state, and the other will be the (to be) modified state. If the step is accepted, we copy the modified state into the previous state, and vice-versa if the step is rejected. This is a great short-term solution, but I could see this having a lot of overhead (memory and CPU time). If we're careful we may be able to write one container into another's memory, which could be more efficient (less garbage collection from discarding old containers).
Only maintain one UniqueObjectsContainer objects, but track updates as little 'diffs'. These 'diffs' can then be undone as needed. I like this idea because it keeps the usual, desired behavior (accept and apply) fast, at the cost of the hopefully exceptional behavior (reject and undo) being the slower, harder operation.
Instead of maintaining the states in memory for reversion, just read and recover from disk when needed. This could be very slow. Also, we don't currently write the UniqueObjectsContainer to disk because it's so large (and not useful for analysis).

1fish2 commented 6 years ago

How UniqueObjectsContainer works and its usage pattern are not obvious to me. It looks vaguely like a Python-implemented dict of ndarrays. (If it's frequently accessed and can't do the heavy lifting via built-in types, it could be worth writing a Cython extension type.) We can discuss it before Thursday's WC team meeting.

Guessing on that part, my going-in thought is the approach that keeps previous + next instances of UniqueObjectsContainer (or of its guts) in memory: snapshot the current state, modify one of the two, then accept one by making a fresh copy of the other or else by copying in the other's contents, whichever turns out to be faster. This isolates the revert feature in code that's easy to test and could be implemented with an optimized loop.

If the ndarrays get replaced rather than modified, the collection could be snapshotted via a fast shallow copy like a_dict.copy().

Tracking diffs is viable but would be more complicated. That's OK if all the changes are done via a small number of methods, and it could save a lot of memory, but if code outside this class has to help with diff-tracking then it'd be a "leaky abstraction," hard to get right, and hard to maintain. If memory is a problem then I'd first look at making the collection more compact, e.g. _entryState is an np.int64 but maybe only holds a boolean?

jmason42 commented 6 years ago

You're correct, it's a dictionary of ndarrays. More specifically, the ndarrays are 'structured' arrays, and the arrays themselves are resized to accommodate more elements (I don't remember the exact pattern, but it tries to expand in blocks rather than an element at a time so that it's not constantly creating huge new arrays). I'm not sure how much value Cython would add, as the frequently used operations should all be vectorized.

I think you're right about the 'diff' approach, I could see it breaking down if the interface isn't really tight. The operations should be pretty limited and manageable but it does mean checking that no part of the code is sneaking into the class's private attributes. Copying will probably be the way to go.

prismofeverything commented 5 years ago

I have a plan for this that involves encoding updates as "messages" (much like your "diffs" above), then the list of messages can be processed to resolve any conflicts and finally applied once they pass through the resolver. Resolution can be a series of independent custom "filters" that basically take the list of updates and emit a new list of (possibly) altered updates based on each resolver's own internal logic.

Once this is complete then we will be able to access the container in "read-only" mode from any process and we can shed the cumbersome request structure and avoid ever mutating the unique objects in place.

jmason42 commented 5 years ago

Hopefully the resolutions can be largely generic rather than case specific. Trying to think of some useful test cases, here's what I got:

Two processes update the mass of some unique molecule (e.g. it undergoes two independent modifications). The resolution would be to simply add the masses together.
Two processes both flip a boolean attribute from False to True. Depending on what that flag means, this could be an error or just fall through. E.g. if the flag represents a modification, that would be an error (double counting modification reactions). If the flag represents some sort of abstract information ('time to divide'), it could fall through.
Two processes move a molecule along DNA/RNA (e.g. a ribosome updates its progress). The resolution would be to throw an error, since otherwise we'd be double-counting some reactions and miscounting (discounting?) others.
One process modifies a molecule, while another process destroys it. I think this is the hardest case to resolve.
- If the modification doesn't modify the structure/mass of the molecule in any fashion, then the deletion can simply happen.
- Otherwise, resolution is hard. You would probably need a custom solution (reaching back into multiple processes), which is unsatisfying. I can think of other resolution modes but none that fit within the current framework (e.g. would require a post-evolve process step, or asynchronous updates), other than defaulting back to partitioning for the deletion cases (e.g. a process decides that it will delete some molecule, thus earmarking it and not allowing any other process to modify it).

Minor note: if you're thinking about applying updates as resolved diffs, you could also consider saving those diffs instead of saving the container each time step. Presumably it would use less disk space (so long as most unique molecules live for at least a few time steps).

prismofeverything commented 5 years ago

@jmason42 Great considerations, thank you. Yes I went through a similar exercise in my mind and most conflicts resolve easily. In your case with simultaneous alteration and deletion, I assert we either find a copacetic resolution (which I think is possible in this case, we would have to explore the details) or in the extreme case where it can't be resolved we discard all the updates and trigger the recomputation of a shorter time step (as described in https://github.com/CovertLab/wcEcoli/issues/401). This is what we would do in the case of negative molecule counts in bulk molecules which is basically the same situation: a state that cannot be resolved given the current set of updates.

jmason42 commented 5 years ago

It's a bit harder than making the time step shorter, due to the biased dice re-rolling metaphor we discussed a few days back. But at its heart, the idea is correct. Ideally it would work something like this: 1) The two processes calculate their updates for the full time step.

Process A: simulate the molecule's update for 100% of the time step.
Process B: 70% of the way through time step, delete the molecule. 2) At merge, the conflict is detected. 3) Process A is recomputed for only 70% of the time step, exactly up to the time of deletion. 4) Then Process B performs its deletion.

This requires a little more information (when did the deletion occur?) and some infrastructure that we don't currently yet have.

prismofeverything commented 5 years ago

Just verified this function never gets called (at least under normal conditions for a single generation), which simplifies things a bit: https://github.com/CovertLab/wcEcoli/blob/master/wholecell/containers/unique_objects_container.py#L797

CovertLab / wcEcoli

Time-step reversion (for unique molecules) #151