CovertLab / wcEcoli

Whole Cell Model of E. coli
Other
18 stars 4 forks source link

sim seed 28: GLP_INFEAS: infeasible #635

Open 1fish2 opened 5 years ago

1fish2 commented 5 years ago
python runscripts/manual/runSim.py -s 28

prints

{'Arguments': {'d_period_division': False,
               'generations': 1,
               'growth_rate_noise': False,
               'init_sims': 1,
               'length_sec': 10800,
               'mass_distribution': True,
               'require_variants': False,
               'seed': 28,
               'sim_dir': None,
               'sim_path': '/Users/jerry/dev/wcEcoli/out/manual',
               'timeline': '0 minimal',
               'timestep_max': 0.9,
               'timestep_safety_frac': 1.3,
               'timestep_update_freq': 5,
               'total_gens': 1,
               'translation_supply': True,
               'trna_charging': False,
               'variable_elongation_transcription': False,
               'variable_elongation_translation': False,
               'variant': ['wildtype', '0', '0'],
               'verbose': False}}
Fri Aug  9 20:18:07 2019: Creating sim_data for Variant: wildtype, Index: 0
Variant short name: wildtype
Fri Aug  9 20:18:09 2019: Running simulation
Time (s)  Dry mass     Dry mass      Protein          RNA    Small mol     Expected
              (fg)  fold change  fold change  fold change  fold change  fold change
========  ========  ===========  ===========  ===========  ===========  ===========
    0.00    358.11        1.000        1.000        1.000        1.000        1.000
Warning - converting 'rnaIds' attribute from ndarray to list for JSON serialization.
Warning - converting 'rnaIds' attribute from ndarray to list for JSON serialization.
Warning - converting 'rnaIds' attribute from ndarray to list for JSON serialization.
Warning - converting 'uncharged_trna_ids' attribute from ndarray to list for JSON serialization.
    0.20    358.19        1.000        1.000        1.000        1.000        1.000
    0.40    358.25        1.000        1.000        1.000        1.001        1.000
    0.60    358.29        1.001        1.000        1.000        1.001        1.000
    0.80    358.34        1.001        1.000        1.000        1.001        1.000
    1.00    358.52        1.001        1.000        1.000        1.002        1.000
    1.89    358.57        1.001        1.001        1.001        1.002        1.000
    2.79    358.82        1.002        1.001        1.001        1.004        1.001
    3.68    358.93        1.002        1.001        1.001        1.004        1.001
    4.57    359.10        1.003        1.001        1.001        1.005        1.001
    5.46    359.23        1.003        1.002        1.001        1.005        1.001
    6.36    359.45        1.004        1.002        1.001        1.007        1.002
    7.25    359.50        1.004        1.002        1.002        1.007        1.002
    8.14    359.73        1.005        1.002        1.002        1.008        1.002
    9.04    359.79        1.005        1.002        1.002        1.008        1.002
    9.93    360.01        1.005        1.003        1.002        1.009        1.003
Warning: numerical instability (primal simplex, phase I)
Warning: numerical instability (primal simplex, phase II)    <-- 4860 times
...
Warning: could not find solution with primal method, switching to dual

Simulation finished:
 - Sim length: 0:00:11
 - Sim end time: 0:00:11
 - Runtime: 0:00:15

Traceback (most recent call last):
  File "runscripts/manual/runSim.py", line 278, in <module>
    script.cli()
  File "/Users/jerry/dev/wcEcoli/wholecell/utils/scriptBase.py", line 295, in cli
    self.run(args)
  File "runscripts/manual/runSim.py", line 273, in run
    task.run_task({})
  File "/Users/jerry/dev/wcEcoli/wholecell/fireworks/firetasks/simulation.py", line 49, in run_task
    sim.run()
  File "/Users/jerry/dev/wcEcoli/wholecell/sim/simulation.py", line 224, in run
    self.run_incremental(self._lengthSec + self.initialTime())
  File "/Users/jerry/dev/wcEcoli/wholecell/sim/simulation.py", line 247, in run_incremental
    self._evolveState()
  File "/Users/jerry/dev/wcEcoli/wholecell/sim/simulation.py", line 311, in _evolveState
    process.evolveState()
  File "/Users/jerry/dev/wcEcoli/models/ecoli/processes/metabolism.py", line 355, in evolveState
    deltaMetabolites = (1 / countsToMolar) * (CONC_UNITS * self.fba.getOutputMoleculeLevelsChange())
  File "/Users/jerry/dev/wcEcoli/wholecell/utils/modular_fba.py", line 1101, in getOutputMoleculeLevelsChange
    flowRates = self._solver.getFlowRates(stoich.viewkeys())
  File "/Users/jerry/dev/wcEcoli/wholecell/utils/_netflow/nf_glpk.py", line 334, in getFlowRates
    self._solve()
  File "/Users/jerry/dev/wcEcoli/wholecell/utils/_netflow/nf_glpk.py", line 471, in _solve
    raise RuntimeError(self.status_string)
RuntimeError: GLP_INFEAS: infeasible

In Sisyphus, this triggered a cascade of problems. It retried this task many times across many workers, they all uploaded a log to the same storage file, this exceeded a Cloud Storage API rate-limit, causing another exception. Eventually I deleted the 17 workers that were still trying to run this task.

1fish2 commented 5 years ago

Temporary, fragile workaround in wholecell/sim/simulation.py:

if self._seed == 28:  # temp workaround for bug #635
    self._seed += 200

Note: Seed 1028 gets stuck in a hard loop after time 1757.47! It doesn't even respond to ^C! Is it stuck in a C library? That also needs debugging if the fix for 28 doesn't fix it.

prismofeverything commented 5 years ago

Interesting. As a point of data, it does not do the same thing with seed 28 in the release-paper branch. Are certain seeds known to be unstable? I feel like I've heard suggestion of this at some point, but not come across it myself.

Either way, good we can reproduce. I would rather catch this error directly than add a list of bad seeds to the code, especially if they seem to be transient to code changes.

As for the retry loop, I have identified it as Sisyphus sending a "complete" event after sending its "error" event, so the workflow tries to continue. Easy enough to fix, testing the solution now.

1fish2 commented 5 years ago

Since it's a random number seed, the impact is highly sensitive to the code. Any code or data change anywhere in the project and libraries that alters the number of calls to the random number generator before the problem occurs will change which seeds trigger the problem.

That's why the +200 change is a fleetingly temporary workaround.

The only thing we can say about "bad seeds" is with the current code + the temporary workaround, seeds [0 .. 99] are OK.

Symptoms to debug:

  1. 4861 times Warning: numerical instability trying to run step 17
  2. could not find solution with primal method, switching to dual
  3. declared Simulation finished early, at time 0:00:11
  4. RuntimeError: GLP_INFEAS: infeasible (from modular_fba calling nf_glpk).
  5. Seed 1028 gets stuck in a hard loop after time 1757.47! It doesn't even respond to ^C!

What would catching the exception do besides print and exit, which it already does?

Here's the full log: simulation_Var0_Seed28_Gen0_Cell0.log

prismofeverything commented 5 years ago

That's why the +200 change is a fleetingly temporary workaround.

If we did 500 sims, wouldn't we end up doing 228 twice in this case?

What would catching the exception do besides print and exit, which it already does?

Right, not an exception, I mean detect the numerical instability condition and only retry .... 10 times? instead of 4000+. I guess we have to pick a number, but 4000 is maybe too many.

1fish2 commented 5 years ago

If we did 500 sims, wouldn't we end up doing 228 twice in this case?

Yes. Don't do that. Debug the problem and fix it before running lots of seeds.

I'll add symptom 5 to that list of symptoms: Seed 1028 freezes the entire process.

I mean detect the numerical instability condition and only retry .... 10 times? instead of 4000+. I guess we have to pick a number, but 4000 is maybe too many.

Its a GLPK problem. Would you like to parse stderr to detect it?

prismofeverything commented 5 years ago

Its a GLPK problem. Would you like to parse stderr to detect it?

It looks like something we are setting: https://github.com/CovertLab/wcEcoli/blob/master/wholecell/utils/_netflow/nf_glpk.py#L111

Though I suppose there was some rationale behind setting this value to 10000 at some point?

tahorst commented 5 years ago

It looks like something we are setting: https://github.com/CovertLab/wcEcoli/blob/master/wholecell/utils/_netflow/nf_glpk.py#L111 Though I suppose there was some rationale behind setting this value to 10000 at some point?

289 - default is basically an infinite loop, any lower and we might not get an accurate result on some time points.

tahorst commented 5 years ago

Maybe we should set up CI to test random seeds in addition to one consistent one? That way we'll have a better idea of the robustness of the model and maybe catch some more issues that need to be fixed

prismofeverything commented 5 years ago

Maybe we should set up CI to test random seeds in addition to one consistent one?

This is a great idea. Have we only been testing seed 0 this whole time? Yeah seed 0 + random seed each time sounds like a good plan.

1fish2 commented 5 years ago

I like this idea.

In past experience, how much did these failures also vary with the inherited state?

Does CI need to run many seeds x multiple gens to locate each day's game-of-battleship mines? Or is the main point to do sample testing to measure robustness?

Since the sim init creates several random generators, only a fraction of the code affects any particular solver's random numbers. So it's less fickle than I feared.

tahorst commented 5 years ago

Have we only been testing seed 0 this whole time?

We've only done seed 0 except for when we run larger sets like for the paper which is generally where we've seen some issues come up.

In past experience, how much did these failures also vary with the inherited state?

I'm not sure how much failures depend on the inherited state but from what I remember in the paper runs, there was a rather equal distribution among the different generations.

Does CI need to run many seeds x multiple gens to locate each day's game-of-battleship mines? Or is the main point to do sample testing to measure robustness?

We could probably pick a new 4 random seeds a day and run 4 gens each or something like that with the idea being we'll eventually discover some new bug that might eventually creep into sims we might want to run. That way we can do a better job ensuring the model remains robust and won't have to troubleshoot a bunch of bugs we find if we scale up runs for a paper. I don't think we need to red flag certain seeds for each commit but just understand failure modes a little better.

Since the sim init creates several random generators, only a fraction of the code affects any particular solver's random numbers. So it's less fickle than I feared.

The problem is that most of these will depend on the molecules available so as soon as one random generator is affected, it will likely affect the molecules present and therefore the other generators.

tahorst commented 5 years ago

For clarification, seed 1028 gets stuck after 1838.73 sec (your log must not have flushed with it getting stuck in the loop).

It happens in the calculateRequest function in complexation with the call to arrow: https://github.com/CovertLab/wcEcoli/blob/5503af00e7b33e93861c0a6c19b7850d7c5388fa/models/ecoli/processes/complexation.py#L59

prismofeverything commented 5 years ago

It happens in the calculateRequest function in complexation with the call to arrow:

Ah okay, a new corner case for arrow! I'll take a look.

1fish2 commented 5 years ago

Please do try to debug why it didn't respond to ^C.

Does the Python interface catch every Exception then retry? That'd do it, which is why catching Exception without re-raising it is a no-no.

prismofeverything commented 5 years ago

Please do try to debug why it didn't respond to ^C.

Looking into it now. It's funny because CPU is not pegged, actually it is flat. So it is not doing anything, just sitting there not responding.

Does the Python interface catch every Exception then retry?

Nope, no try's at all in the python.