Open tahorst opened 4 years ago
It looks like the faulthandler pip is a good place to start. Per faulthandler doc:
This module contains functions to dump Python tracebacks explicitly, on a fault, after a timeout, or on a user signal. Call
faulthandler.enable()
to install fault handlers for the SIGSEGV, SIGFPE, SIGABRT, SIGBUS, and SIGILL signals.
Once we enable it, future segfaults will print some stack traceback info, limited by the fact that it can't allocate memory amidst a catastrophic failure.
One experiment to try is to upgrade to python 2.7.17, which IIRC fixes some potential crashes. (Supposedly there's a 2.7.18 final bugfix release coming.) Confidence in whether that helped depends on whether the problem is reproducible.
faulthandler looks like it could be useful here. The Jenkins builds failed again 2 hr and 10 min after starting but I was not able to reproduce the seg fault by just running the same command and workflow interactively from a compute node.
I just restarted Jenkins to rule out bad startup state or whatever else that could mean.
PR test for singleshot vs rapidfire showed the issue is when rapidfire runs for 2 hr and 10 min. singleshot on loop can execute the same workflow successfully. Jerry, when you installed the new version of Fireworks in the other pyenv, do you think it could have changed some configuration or shared file? Or maybe something weird going on with sleep or threading since 3 of the 4 threads are calling time.sleep()
?
Stack trace with faulthandler:
Fatal Python error: Segmentation fault
Thread 0x00007fd6db375700 <rlaunch> (most recent call first):
File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 359 in wait
File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 614 in wait
File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/fireworks/core/rocket.py", line 54 in ping_launch
File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 754 in run
File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 801 in __bootstrap_inner
File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 774 in __bootstrap
Thread 0x00007fd6dab74700 <rlaunch> (most recent call first):
File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/pymongo/periodic_executor.py", line 128 in _run
File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 754 in run
File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 801 in __bootstrap_inner
File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 774 in __bootstrap
Thread 0x00007fd6da373700 <rlaunch> (most recent call first):
File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/pymongo/periodic_executor.py", line 128 in _run
File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 754 in run
File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 801 in __bootstrap_inner
File "/home/groups/mcovert/pyenv/versions/2.7.16/lib/python2.7/threading.py", line 774 in __bootstrap
Current thread 0x00007fd6eb62a740 <rlaunch> (most recent call first):
File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/matplotlib/cbook/__init__.py", line 317 in __init__
File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/matplotlib/cm.py", line 192 in __init__
File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/matplotlib/collections.py", line 123 in __init__
File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/matplotlib/collections.py", line 1252 in __init__
File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/matplotlib/collections.py", line 1422 in __init__
File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/matplotlib/axes/_axes.py", line 1257 in eventplot
File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/matplotlib/__init__.py", line 1855 in inner
File "/scratch/groups/mcovert/jenkins/workspace/models/ecoli/analysis/multigen/transcriptionEvents.py", line 130 in do_plot
File "/scratch/groups/mcovert/jenkins/workspace/models/ecoli/analysis/analysisPlot.py", line 117 in plot
File "/scratch/groups/mcovert/jenkins/workspace/models/ecoli/analysis/analysisPlot.py", line 127 in main
File "/scratch/groups/mcovert/jenkins/workspace/wholecell/fireworks/firetasks/analysisBase.py", line 163 in run_task
File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/fireworks/core/rocket.py", line 262 in run
File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/fireworks/core/rocket_launcher.py", line 58 in launch_rocket
File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/fireworks/core/rocket_launcher.py", line 108 in rapidfire
File "/home/groups/mcovert/pyenv/versions/wcEcoli2/lib/python2.7/site-packages/fireworks/scripts/rlaunch_run.py", line 142 in rlaunch
File "/home/groups/mcovert/pyenv/versions/wcEcoli2/bin/rlaunch", line 10 in <module>
runscripts/jenkins/ecoli-pull-request.sh: line 32: 148380 Segmentation fault PYTHONPATH=$PWD rlaunch rapidfire --nlaunches 0
I do hope updating Fireworks in pyenv wcEcoli3
doesn't alter wcEcoli2
but can't rule it out. We could install the previous pip but that might not restore some collateral damage. We could build a new pyenv wcEcoli4
.
These stack traces are a start but not super revealing. matplotlib/cbook/__init__.py:317
is just self._func_cid_map = {}
, and the previous line is just self._cid = 0
. I'm not concerned about the waiting threads.
Did/should you add the faulthandler pip to requirements.txt and to pyenv wcEcoli2
?
I installed it in wcEcoli2
but have not yet added it to requirements.txt. I thought I might remove it after this investigation is complete. We could consider adding it to requirements.txt (and the enable code to firetasks as well) if you think it's useful moving forward.
With the change from rapidfire to singleshot in the Jenkins scripts (#770), most builds have passed but the minimal media build is seg faulting on multigen analysis. This time the seg fault is coming 1 hr and 20 minutes after the task starts even though launching with singleshot successfully completed multigen analysis with 1 hr and 31 minutes of execution time a few builds ago (before failing with an environment mix up). Maybe it's worth rebuilding an environment to test?
Sure. Give it a go or let me know and I'll do it. When is a minimally disruptive time for that?
I'll create a new pyenv (wcEcoli-seg
) now and we can use #768 to test it to prevent disruption for other builds that use wcEcoli2
Update on tests that still haven't resolved the issue with a long sleep in KnowledgeBaseEcoli()
:
Fatal Python error: Segmentation fault
Thread 0x00007fb77b5ef700
Thread 0x00007fb77adee700
Thread 0x00007fb77a5ed700
Current thread 0x00007fb78b8a4740
- running manual script > 2 hr 10 min (`PYTHONPATH=$PWD python runscripts/manual/runParca.py`): seg fault
Fatal Python error: Segmentation fault
Current thread 0x00007f653a5bf740
#774 did some work to cut the amount of time spent in multigen analysis in half for the minimal daily build and multigen has completed (still waiting for all other tasks to finish). With that, it looks like all builds have a work around to succeed for now even though this problem still occurs.
More info from the latest PR test that seems to point to an issue with FiretaskBase
or ScriptBase
in the context of Jenkins:
PYTHONPATH=$PWD python -c 'import reconstruction.ecoli.knowledge_base_raw as rd; rd.KnowledgeBaseEcoli()'
succeedsPYTHONPATH=$PWD python runscripts/manual/runParca.py
fails on the same function (KnowledgeBaseEcoli()
)Good work on this hard debugging problem!
I wish I had more testable hypotheses. A weak one is to try FireWorks==1.9.5
(wcEcoli3
on Sherlock has it) since FiretaskBase
is involved.
A weak one is to try FireWorks==1.9.5 (wcEcoli3 on Sherlock has it) since FiretaskBase is involved.
Nice idea but this still seg faulted
I bet it's the ulimit
. Rapid-fire mode is retaining residual memory from previous runs, then we hit the ulimit at a predictable time. Sherlock may have been messing with ulimit
s over the break, or we started doing enough stuff that we just tipped over the previous ulimit
. Current theory!
Really weird problem for some builds over night. Minimal, with AA and optional features builds all failed with a seg fault - two during multigen analysis and one during a sim.
All of them seg faulted on rlaunch rapidfire:
but interestingly, the anaerobic build which uses a singleshot loop instead succeeded.
They also all failed 2 hours and 10 minutes after adding a workflow so maybe it is a fireworks bug or some Sherlock configuration change.
Minimal email at 5:19.
AA email at 3:10.
Optional features email at 5:42:
I couldn't reproduce the sim seg fault with a manual run from the same generation
Clipped traces from Jenkins emails:
Minimal:
AA:
Optional features:
I'm rerunning the builds now to see if it is reproducible but thought it would be good to document and see if anyone has ideas.