CovertLab / wcEcoli

Whole Cell Model of E. coli
Other
18 stars 4 forks source link

Random seg faults on Jenkins builds #764

Open tahorst opened 4 years ago

tahorst commented 4 years ago

Really weird problem for some builds over night. Minimal, with AA and optional features builds all failed with a seg fault - two during multigen analysis and one during a sim.

Minimal email at 5:19.

2020-01-08 03:18:55,056 INFO Added a workflow.

AA email at 3:10.

2020-01-08 01:00:04,467 INFO Added a workflow.

Optional features email at 5:42:

2020-01-08 03:31:53,546 INFO Added a workflow.

Minimal:

22773.45    442.20        1.209        1.208        1.205        1.210        1.220
22774.35    442.29        1.209        1.208        1.206        1.210        1runscripts/jenkins/ecoli-glucose-minimal.sh: line 22: 34414 Segmentation fault      PYTHONPATH=$PWD rlaunch rapidfire --nlaunches 0

AA:

2020-01-08 03:05:35,891 INFO Task started: {{wholecell.fireworks.firetasks.analysisMultiGen.AnalysisMultiGenTask}}.
INFO:rocket.launcher:Task started: {{wholecell.fireworks.firetasks.analysisMultiGen.AnalysisMultiGenTask}}.
/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/matplotlib/axes/_base.py:3443: UserWarning: Attempting to set identical bottom==top results
in singular transformations; automatically expanding.
bottom=24.63692038495188, top=24.63692038495188
  'bottom=%s, top=%s') % (bottom, top))
runscripts/jenkins/ecoli-with-aa.sh: line 22:  4601 Segmentation fault      PYTHONPATH=$PWD rlaunch rapidfire --nlaunches 0

Optional features:

2020-01-08 05:37:05,415 INFO Task started: {{wholecell.fireworks.firetasks.analysisMultiGen.AnalysisMultiGenTask}}.
INFO:rocket.launcher:Task started: {{wholecell.fireworks.firetasks.analysisMultiGen.AnalysisMultiGenTask}}.
/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/matplotlib/axes/_base.py:3443: UserWarning: Attempting to set identical bottom==top results
in singular transformations; automatically expanding.
bottom=24.63692038495188, top=24.63692038495188
  'bottom=%s, top=%s') % (bottom, top))
runscripts/jenkins/ecoli-optional-features.sh: line 22: 36098 Segmentation fault      PYTHONPATH=$PWD rlaunch rapidfire --nlaunches 0

I'm rerunning the builds now to see if it is reproducible but thought it would be good to document and see if anyone has ideas.

1fish2 commented 4 years ago

It looks like the faulthandler pip is a good place to start. Per faulthandler doc:

This module contains functions to dump Python tracebacks explicitly, on a fault, after a timeout, or on a user signal. Call faulthandler.enable() to install fault handlers for the SIGSEGV, SIGFPE, SIGABRT, SIGBUS, and SIGILL signals.

Once we enable it, future segfaults will print some stack traceback info, limited by the fact that it can't allocate memory amidst a catastrophic failure.

One experiment to try is to upgrade to python 2.7.17, which IIRC fixes some potential crashes. (Supposedly there's a 2.7.18 final bugfix release coming.) Confidence in whether that helped depends on whether the problem is reproducible.

tahorst commented 4 years ago

faulthandler looks like it could be useful here. The Jenkins builds failed again 2 hr and 10 min after starting but I was not able to reproduce the seg fault by just running the same command and workflow interactively from a compute node.

prismofeverything commented 4 years ago

I just restarted Jenkins to rule out bad startup state or whatever else that could mean.

tahorst commented 4 years ago

PR test for singleshot vs rapidfire showed the issue is when rapidfire runs for 2 hr and 10 min. singleshot on loop can execute the same workflow successfully. Jerry, when you installed the new version of Fireworks in the other pyenv, do you think it could have changed some configuration or shared file? Or maybe something weird going on with sleep or threading since 3 of the 4 threads are calling time.sleep()?

Stack trace with faulthandler:

Fatal Python error: Segmentation fault

Thread 0x00007fd6db375700 <rlaunch> (most recent call first):
  File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 359 in wait
  File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 614 in wait
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/fireworks/core/rocket.py", line 54 in ping_launch
  File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 754 in run
  File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 801 in __bootstrap_inner
  File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 774 in __bootstrap

Thread 0x00007fd6dab74700 <rlaunch> (most recent call first):
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/pymongo/periodic_executor.py", line 128 in _run
  File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 754 in run
  File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 801 in __bootstrap_inner
  File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 774 in __bootstrap

Thread 0x00007fd6da373700 <rlaunch> (most recent call first):
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/pymongo/periodic_executor.py", line 128 in _run
  File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 754 in run
  File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 801 in __bootstrap_inner
  File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 774 in __bootstrap

Current thread 0x00007fd6eb62a740 <rlaunch> (most recent call first):
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/matplotlib/cbook/__init__.py", line 317 in __init__
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/matplotlib/cm.py", line 192 in __init__
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/matplotlib/collections.py", line 123 in __init__
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/matplotlib/collections.py", line 1252 in __init__
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/matplotlib/collections.py", line 1422 in __init__
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/matplotlib/axes/_axes.py", line 1257 in eventplot
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/matplotlib/__init__.py", line 1855 in inner
  File "/scratch/groups/mcovert/jenkins/workspace/models/ecoli/analysis/multigen/transcriptionEvents.py", line 130 in do_plot
  File "/scratch/groups/mcovert/jenkins/workspace/models/ecoli/analysis/analysisPlot.py", line 117 in plot
  File "/scratch/groups/mcovert/jenkins/workspace/models/ecoli/analysis/analysisPlot.py", line 127 in main
  File "/scratch/groups/mcovert/jenkins/workspace/wholecell/fireworks/firetasks/analysisBase.py", line 163 in run_task
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/fireworks/core/rocket.py", line 262 in run
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/fireworks/core/rocket_launcher.py", line 58 in launch_rocket
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/fireworks/core/rocket_launcher.py", line 108 in rapidfire
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/fireworks/scripts/rlaunch_run.py", line 142 in rlaunch
  File "/home/groups/mcovert/pyenv/versions/wcEcoli2/bin/rlaunch", line 10 in <module>
runscripts/jenkins/ecoli-pull-request.sh: line 32: 148380 Segmentation fault      PYTHONPATH=$PWD rlaunch rapidfire --nlaunches 0
1fish2 commented 4 years ago

I do hope updating Fireworks in pyenv wcEcoli3 doesn't alter wcEcoli2 but can't rule it out. We could install the previous pip but that might not restore some collateral damage. We could build a new pyenv wcEcoli4.

These stack traces are a start but not super revealing. matplotlib/cbook/__init__.py:317 is just self._func_cid_map = {}, and the previous line is just self._cid = 0. I'm not concerned about the waiting threads.

Did/should you add the faulthandler pip to requirements.txt and to pyenv wcEcoli2?

tahorst commented 4 years ago

I installed it in wcEcoli2 but have not yet added it to requirements.txt. I thought I might remove it after this investigation is complete. We could consider adding it to requirements.txt (and the enable code to firetasks as well) if you think it's useful moving forward.

tahorst commented 4 years ago

With the change from rapidfire to singleshot in the Jenkins scripts (#770), most builds have passed but the minimal media build is seg faulting on multigen analysis. This time the seg fault is coming 1 hr and 20 minutes after the task starts even though launching with singleshot successfully completed multigen analysis with 1 hr and 31 minutes of execution time a few builds ago (before failing with an environment mix up). Maybe it's worth rebuilding an environment to test?

1fish2 commented 4 years ago

Sure. Give it a go or let me know and I'll do it. When is a minimally disruptive time for that?

tahorst commented 4 years ago

I'll create a new pyenv (wcEcoli-seg) now and we can use #768 to test it to prevent disruption for other builds that use wcEcoli2

tahorst commented 4 years ago

Update on tests that still haven't resolved the issue with a long sleep in KnowledgeBaseEcoli():

Thread 0x00007fb77b5ef700 (most recent call first): File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 359 in wait File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 614 in wait File "/home/groups/mcovert/pyenv/versions/wcEcoli-seg/lib/python2.7/site-packages/fireworks/core/rocket.py", line 54 in ping_launch File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 754 in run File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 801 in __bootstrap_inner File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 774 in __bootstrap

Thread 0x00007fb77adee700 (most recent call first): File "/home/groups/mcovert/pyenv/versions/wcEcoli-seg/lib/python2.7/site-packages/pymongo/periodic_executor.py", line 128 in _run File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 754 in run File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 801 in __bootstrap_inner File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 774 in __bootstrap

Thread 0x00007fb77a5ed700 (most recent call first): File "/home/groups/mcovert/pyenv/versions/wcEcoli-seg/lib/python2.7/site-packages/pymongo/periodic_executor.py", line 128 in _run File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 754 in run File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 801 in __bootstrap_inner File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 774 in __bootstrap

Current thread 0x00007fb78b8a4740 (most recent call first): File "/scratch/groups/mcovert/jenkins/workspace@3/reconstruction/spreadsheets.py", line 58 in next File "/scratch/groups/mcovert/jenkins/workspace@3/reconstruction/ecoli/knowledge_base_raw.py", line 128 in _load_tsv File "/scratch/groups/mcovert/jenkins/workspace@3/reconstruction/ecoli/knowledge_base_raw.py", line 108 in init File "/scratch/groups/mcovert/jenkins/workspace@3/wholecell/fireworks/firetasks/initRawData.py", line 18 in run_task File "/home/groups/mcovert/pyenv/versions/wcEcoli-seg/lib/python2.7/site-packages/fireworks/core/rocket.py", line 262 in run File "/home/groups/mcovert/pyenv/versions/wcEcoli-seg/lib/python2.7/site-packages/fireworks/core/rocket_launcher.py", line 58 in launch_rocket File "/home/groups/mcovert/pyenv/versions/wcEcoli-seg/lib/python2.7/site-packages/fireworks/scripts/rlaunch_run.py", line 155 in rlaunch File "/home/groups/mcovert/pyenv/versions/wcEcoli-seg/bin/rlaunch", line 8 in runscripts/jenkins/ecoli-pull-request.sh: line 28: 58924 Segmentation fault PYTHONPATH=$PWD rlaunch singleshot

- running manual script > 2 hr 10 min (`PYTHONPATH=$PWD python runscripts/manual/runParca.py`): seg fault

Fatal Python error: Segmentation fault

Current thread 0x00007f653a5bf740 (most recent call first): File "/home/groups/mcovert/pyenv/versions/wcEcoli-seg/lib/python2.7/site-packages/unum/init.py", line 116 in init File "/home/groups/mcovert/pyenv/versions/wcEcoli-seg/lib/python2.7/site-packages/unum/init.py", line 515 in coerceToUnum File "/home/groups/mcovert/pyenv/versions/wcEcoli-seg/lib/python2.7/site-packages/unum/init.py", line 423 in rdiv File "", line 1 in File "/scratch/groups/mcovert/jenkins/workspace/reconstruction/spreadsheets.py", line 59 in next File "/scratch/groups/mcovert/jenkins/workspace/reconstruction/ecoli/knowledge_base_raw.py", line 128 in _load_tsv File "/scratch/groups/mcovert/jenkins/workspace/reconstruction/ecoli/knowledge_base_raw.py", line 108 in init File "/scratch/groups/mcovert/jenkins/workspace/wholecell/fireworks/firetasks/initRawData.py", line 18 in run_task File "/scratch/groups/mcovert/jenkins/workspace/wholecell/fireworks/firetasks/parca.py", line 69 in run_task File "runscripts/manual/runParca.py", line 64 in run File "/scratch/groups/mcovert/jenkins/workspace/wholecell/utils/scriptBase.py", line 544 in cli File "runscripts/manual/runParca.py", line 69 in runscripts/jenkins/ecoli-pull-request.sh: line 28: 1885 Segmentation fault PYTHONPATH=$PWD python runscripts/manual/runParca.py



#774 did some work to cut the amount of time spent in multigen analysis in half for the minimal daily build and multigen has completed (still waiting for all other tasks to finish).  With that, it looks like all builds have a work around to succeed for now even though this problem still occurs.
tahorst commented 4 years ago

More info from the latest PR test that seems to point to an issue with FiretaskBase or ScriptBase in the context of Jenkins:

1fish2 commented 4 years ago

Good work on this hard debugging problem!

I wish I had more testable hypotheses. A weak one is to try FireWorks==1.9.5 (wcEcoli3 on Sherlock has it) since FiretaskBase is involved.

tahorst commented 4 years ago

A weak one is to try FireWorks==1.9.5 (wcEcoli3 on Sherlock has it) since FiretaskBase is involved.

Nice idea but this still seg faulted

prismofeverything commented 4 years ago

I bet it's the ulimit. Rapid-fire mode is retaining residual memory from previous runs, then we hit the ulimit at a predictable time. Sherlock may have been messing with ulimits over the break, or we started doing enough stuff that we just tipped over the previous ulimit. Current theory!