CovertLab / vEcoli

Whole cell model of E. coli implemented with Vivarium
https://covertlab.github.io/vEcoli/
MIT License
12 stars 3 forks source link

Numpy conversion #194

Closed thalassemia closed 1 year ago

thalassemia commented 1 year ago

Changes

  1. Bulk molecules are now stored in a single structured array, with one column for molecule names, another for counts, and the rest for the various masses (e.g. protein, rRNA, etc.). Unique molecules each have their own structured array with a column for each attribute plus a metadata column called _entryState that marks whether a given row contains an active unique molecule.
  2. Since processes now just connect to the bulk array, their ports schemas no longer clearly indicate what molecules each process uses. Worse, to access specific molecule counts, at the first timestep, each process has to find and store the indices within the structured array corresponding to those molecules. Here is an example. https://github.com/CovertLab/vivarium-ecoli/blob/5e29fc7c777dea1b747ed9425319d379f4abd2fa/ecoli/processes/rna_degradation.py#L202-L217
  3. For efficiency, wcEcoli deletes molecules marking their corresponding as inactive in their _entryState column. Then, it can add new molecules in those inactive rows instead of growing the array. One downside to this approach is that updates that modify unique molecule attributes must occur before those that add or delete molecules. For example, if you first deleted a molecule and then added a new molecule in that row, a subsequent update trying to modify that row would have no idea that the molecule it was expecting at that position no longer exists, resulting in the improper modification of a completely unrelated molecule. wcEcoli gets around this by accumulating updates until all evolvers have run, then sorts them before applying them. To mimic this functionality, I made a hacky updater that is actually a method of a class, allowing me to accumulate updates using instance variables. https://github.com/CovertLab/vivarium-ecoli/blob/5e29fc7c777dea1b747ed9425319d379f4abd2fa/ecoli/library/schema.py#L188-L240 Then I added a Step called UniqueUpdate that runs after all the processes but before any other Steps. This Step signals to each unique molecule updater that it is time to apply all the accumulated updates and clear the instance variables. This gets complicated with cases like ChromosomeStructure, a Step that sends updates to unique molecule stores. For now, I’ve added some boilerplate in ecoli_master to automatically add a UniqueUpdate Step after each other Step in the composite (right now we technically only need one more after ChromosomeStructure, but I figured this solution is more futureproof and with minimal performance impact). https://github.com/CovertLab/vivarium-ecoli/blob/5e29fc7c777dea1b747ed9425319d379f4abd2fa/ecoli/composites/ecoli_master.py#L244-L250
  4. New dividers for unique molecules adapted from wcEcoli
  5. Migration tests all updated and simplified
  6. Reworked EngineProcess to get initial state for inner simulation by calling initial_state method of the Composite generated by calling generate on self.parameters['inner_composer']
  7. Offloaded division threshold calculations to CellDivision process. Fixes issue in colony simulations where threshold was calculated using the dry mass at ('listeners', 'dry_mass'), which is not 100% accurate immediately after division (before MassListener has run). This entailed adding a new DivisionDetector Step to the inner simulation which sets a flag telling EngineProcess it is time to divide.
  8. Separated the build_ecoli and run methods of EcoliSim. Users should now call build_ecoli before calling run (more flexibility to modify composite before running).
  9. Added Clock process for MassListener to calculate expected_mass_fold_change

Results

Runtime down to <10 minutes per cell cycle and simulation results perfectly match wcEcoli branch commit c4261d97 (compare to #174). Notably, while this matches the runtime of the reference wcEcoli commit, vivarium-ecoli becomes noticeably (~35%) faster when simulation results are stored in RAM instead of being emitted to MongoDB on disk at every timestep. Further optimization may be possible by using bulk insert operations (insert_many instead of insert_one). composite_mass

Todo

eagmon commented 1 year ago

This is a fantastic optimization! I'm kind of sad we lose the ability to use keys into the state to retrieve values, but it is definitely worth this runtime improvement. Numpy everywhere! Regarding points 2 and 3, I think the performance gains make this worth it, so I am in support. Please comment the reasoning behind that UniqueNumpyUpdater very well so future people can follow understand why this is required.