CovertLab / wcEcoli

Whole Cell Model of E. coli
Other
19 stars 4 forks source link

Optional Features Failure #829

Open tahorst opened 4 years ago

tahorst commented 4 years ago

@prismofeverything mind investigating why arrow failed?

Git hash: 9e9f3bb066

Command:

DESC="Causality Network" BUILD_CAUSALITY_NETWORK=1 N_GENS=2 SEED=6691 \
  PARALLEL_PARCA=1 SINGLE_DAUGHTERS=1 COMPRESS_OUTPUT=1 RAISE_ON_TIME_LIMIT=1 \
  WC_ANALYZE_FAST=1 \
  python runscripts/fireworks/fw_queue.py

Trace:


 3848.91    470.53        1.443        1.445        1.459        1.431     1.462
Traceback (most recent call last):
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/fireworks/core/rocket.py", line 262, in run
    m_action = t.run_task(my_spec)
  File "/scratch/groups/mcovert/jenkins/workspace/wholecell/fireworks/firetasks/simulationDaughter.py", line 78, in run_task
    sim.run()
  File "/scratch/groups/mcovert/jenkins/workspace/wholecell/sim/simulation.py", line 235, in run
    self.run_incremental(self._lengthSec + self.initialTime())
  File "/scratch/groups/mcovert/jenkins/workspace/wholecell/sim/simulation.py", line 267, in run_incremental
    self._evolveState(processes)
  File "/scratch/groups/mcovert/jenkins/workspace/wholecell/sim/simulation.py", line 326, in _evolveState
    process.calculateRequest()
  File "/scratch/groups/mcovert/jenkins/workspace/models/ecoli/processes/complexation.py", line 59, in calculateRequest
    result = self.system.evolve(self._sim.timeStepSec(), moleculeCounts)
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/arrow/arrow.py", line 134, in evolve
    steps, time, events, outcome = self.obsidian.evolve(duration, state)
TypeError: 'NoneType' object is not iterable
arrow.obsidian.evolve - failed to allocate memory: 12
Simulation finished:
 - Sim length: 0:24:07
 - Sim end time: 1:04:10
 - Runtime: 0:49:42
prismofeverything commented 4 years ago

Yeah, it failed to allocate memory which means the python process hit some kind of memory limit. Was this on sherlock? Do you know what the memory limit for processes is there?

Is this an intermittent issue, or does it happen the same way each time?

tahorst commented 4 years ago

Was this on sherlock?

Yes, the Jenkins builds are on Sherlock.

Do you know what the memory limit for processes is there?

It's currently 48 GB shared between all running jobs.

Is this an intermittent issue, or does it happen the same way each time?

This is the first failure I've seen so I was hoping you could explore to confirm it's reproducible and identify the problem.

prismofeverything commented 4 years ago

Sure, I'll run it outside of sherlock and see if it fails in the same way. Though based on the error this is a memory failure and arrow just happened to be the one trying to allocate memory when it hit the limit. If the 48GB are shared then I expect this error to be transient, unless it is hitting a separate per-process limit in which case it should fail at the same place each time until the memory is raised.

1fish2 commented 4 years ago

It's terrific that the native code handled the allocation error gracefully and got the right error information out! That makes it easy to look into getting more memory for the process or optimizing the code to reduce memory usage.

Two details to further help: Cause a MemoryError rather than a TypeError and at the base of the stack catch any MemoryError and print additional memory stats like the total amount of process memory in use.

prismofeverything commented 4 years ago

Status on this: it is a similar failure to https://github.com/CovertLab/arrow/issues/39, the repeated multiplication of large numbers is overflowing the 64-bit floating point register. This happens when there are large numbers of simultaneous elements in the stoichiometry for a reaction that also has large counts. A few possible improvements have come up in conversations with @tahorst:

This is happening on generation 2 of seed 6691 if anyone wants to replicate. As code changes, this condition may no longer trigger. We haven't seen this with any other seed/generation combinations so far, but they are far from exhaustively tested.