Open 1fish2 opened 5 years ago
Temporary, fragile workaround in wholecell/sim/simulation.py:
if self._seed == 28: # temp workaround for bug #635
self._seed += 200
Note: Seed 1028 gets stuck in a hard loop after time 1757.47! It doesn't even respond to ^C! Is it stuck in a C library? That also needs debugging if the fix for 28 doesn't fix it.
Interesting. As a point of data, it does not do the same thing with seed 28 in the release-paper
branch. Are certain seeds known to be unstable? I feel like I've heard suggestion of this at some point, but not come across it myself.
Either way, good we can reproduce. I would rather catch this error directly than add a list of bad seeds to the code, especially if they seem to be transient to code changes.
As for the retry loop, I have identified it as Sisyphus sending a "complete" event after sending its "error" event, so the workflow tries to continue. Easy enough to fix, testing the solution now.
Since it's a random number seed, the impact is highly sensitive to the code. Any code or data change anywhere in the project and libraries that alters the number of calls to the random number generator before the problem occurs will change which seeds trigger the problem.
That's why the +200 change is a fleetingly temporary workaround.
The only thing we can say about "bad seeds" is with the current code + the temporary workaround, seeds [0 .. 99] are OK.
Symptoms to debug:
Warning: numerical instability
trying to run step 17could not find solution with primal method, switching to dual
Simulation finished
early, at time 0:00:11
RuntimeError: GLP_INFEAS: infeasible
(from modular_fba
calling nf_glpk
).What would catching the exception do besides print and exit, which it already does?
Here's the full log: simulation_Var0_Seed28_Gen0_Cell0.log
That's why the +200 change is a fleetingly temporary workaround.
If we did 500 sims, wouldn't we end up doing 228 twice in this case?
What would catching the exception do besides print and exit, which it already does?
Right, not an exception, I mean detect the numerical instability condition and only retry .... 10 times? instead of 4000+. I guess we have to pick a number, but 4000 is maybe too many.
If we did 500 sims, wouldn't we end up doing 228 twice in this case?
Yes. Don't do that. Debug the problem and fix it before running lots of seeds.
I'll add symptom 5 to that list of symptoms: Seed 1028 freezes the entire process.
I mean detect the numerical instability condition and only retry .... 10 times? instead of 4000+. I guess we have to pick a number, but 4000 is maybe too many.
Its a GLPK problem. Would you like to parse stderr to detect it?
Its a GLPK problem. Would you like to parse stderr to detect it?
It looks like something we are setting: https://github.com/CovertLab/wcEcoli/blob/master/wholecell/utils/_netflow/nf_glpk.py#L111
Though I suppose there was some rationale behind setting this value to 10000 at some point?
It looks like something we are setting: https://github.com/CovertLab/wcEcoli/blob/master/wholecell/utils/_netflow/nf_glpk.py#L111 Though I suppose there was some rationale behind setting this value to 10000 at some point?
Maybe we should set up CI to test random seeds in addition to one consistent one? That way we'll have a better idea of the robustness of the model and maybe catch some more issues that need to be fixed
Maybe we should set up CI to test random seeds in addition to one consistent one?
This is a great idea. Have we only been testing seed 0 this whole time? Yeah seed 0 + random seed each time sounds like a good plan.
I like this idea.
In past experience, how much did these failures also vary with the inherited state?
Does CI need to run many seeds x multiple gens to locate each day's game-of-battleship mines? Or is the main point to do sample testing to measure robustness?
Since the sim init creates several random generators, only a fraction of the code affects any particular solver's random numbers. So it's less fickle than I feared.
Have we only been testing seed 0 this whole time?
We've only done seed 0 except for when we run larger sets like for the paper which is generally where we've seen some issues come up.
In past experience, how much did these failures also vary with the inherited state?
I'm not sure how much failures depend on the inherited state but from what I remember in the paper runs, there was a rather equal distribution among the different generations.
Does CI need to run many seeds x multiple gens to locate each day's game-of-battleship mines? Or is the main point to do sample testing to measure robustness?
We could probably pick a new 4 random seeds a day and run 4 gens each or something like that with the idea being we'll eventually discover some new bug that might eventually creep into sims we might want to run. That way we can do a better job ensuring the model remains robust and won't have to troubleshoot a bunch of bugs we find if we scale up runs for a paper. I don't think we need to red flag certain seeds for each commit but just understand failure modes a little better.
Since the sim init creates several random generators, only a fraction of the code affects any particular solver's random numbers. So it's less fickle than I feared.
The problem is that most of these will depend on the molecules available so as soon as one random generator is affected, it will likely affect the molecules present and therefore the other generators.
For clarification, seed 1028 gets stuck after 1838.73 sec (your log must not have flushed with it getting stuck in the loop).
It happens in the calculateRequest function in complexation with the call to arrow: https://github.com/CovertLab/wcEcoli/blob/5503af00e7b33e93861c0a6c19b7850d7c5388fa/models/ecoli/processes/complexation.py#L59
It happens in the calculateRequest function in complexation with the call to arrow:
Ah okay, a new corner case for arrow! I'll take a look.
Please do try to debug why it didn't respond to ^C.
Does the Python interface catch every Exception
then retry? That'd do it, which is why catching Exception
without re-raising it is a no-no.
Please do try to debug why it didn't respond to ^C.
Looking into it now. It's funny because CPU is not pegged, actually it is flat. So it is not doing anything, just sitting there not responding.
Does the Python interface catch every Exception then retry?
Nope, no try
's at all in the python.
prints
In Sisyphus, this triggered a cascade of problems. It retried this task many times across many workers, they all uploaded a log to the same storage file, this exceeded a Cloud Storage API rate-limit, causing another exception. Eventually I deleted the 17 workers that were still trying to run this task.