Closed RickKessler closed 1 week ago
It sounds like 4 jobs died over the weekend. Were there other concurrent jobs (or jobs at other times during the weekend) that completed successfully?
There's nothing in the NERSC outage log to indicate that there was a problem. The first step is to look at the log and potentially use the job ids to see what condition caused the terminations. How many of the 4 really hung - versus died? And by "hung" it required the user to terminate the job or did it just terminate when the time expired?
While you can run these interactively, what would happen if you resubmitted the job to batch? Would it complete or would you see the same termination/hanging?
Where do these *.LOCK files get written and presumably removed?
Trying to figure out what I can from the logs, starting with CPU0004_JOBLIST_cori07.LOG
PT005_SPLIT005.DONE
Run merge_driver monitor task.
# ===========================================
# merge_driver: Begin at 2021-12-10 06:52:09 (24728)
# merge_driver: read SUBMIT.INFO
# merge_driver: Found existing BUSY_MERGE_CPU0019.LOCK --> exit merge process.
1
Wait for snlc_fit.exe if SNANA make is in progress
snlc_fit.exe exists -> continue
Wait for ../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0023/FITOPT000.FITRES
../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0023/FITOPT000.FITRES exists -> continue
and I see ../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0023/FITOPT000.FITRES ../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0023/FITOPT000.FITRES
exists
is it at that point that it hung? What was supposed to happen next?
Similarly for CPU0005_JOBLIST_cori07.LOG, I see the file mentioned at the end of the log:
../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0029/FITOPT000.FITRES
# merge_driver: Found existing BUSY_MERGE_CPU0014.LOCK --> exit merge process.
1
Wait for snlc_fit.exe if SNANA make is in progress
snlc_fit.exe exists -> continue
Wait for ../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0029/FITOPT000.FITRES
../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0029/FITOPT000.FITRES exists -> continue
Actually @RickKessler, @Catarina-Alves's job failures may be associated with exhausting her NERSC Compute allocation - she doesn't have any hours left to run batch jobs. I sent a note over the weekend, if needed, she can apply for more NERSC time:
If you need more compute time for DESC, there is a form on this Confluence page to make a request: https://confluence.slac.stanford.edu/display/LSSTDESC/How+To+Apply+For+NERSC+Computing+Time
Any request should be associated with an approved DESC Project which are listed in the PubDB. Please include an estimate of the computing needs for your project including how you determined how much more time you need. If you need any help putting in a request or have any other questions, just let us know.
Hi, Ayan here..for the jobs I ran/submitted I would like to mention that I have no allocation hour issue I think. I have/should have sufficient hours left still. Further to answer some of your concerns, overall I submitted the job 3 times, every time it hung at some point, not exactly the same point but hung nevertheless.
About your question as to how the LOG files ending should look like, something like this :
# merge_driver: Found existing BUSY_MERGE_CPU0020.LOCK --> exit merge process.
1
Wait for snlc_fit.exe if SNANA make is in progress
snlc_fit.exe exists -> continue
Wait for ../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0022/FITOPT000.FITRES
../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0022/FITOPT000.FITRES exists -> continue
Finished snlc_fit.exe -> create PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0022_FITOPT005_SPLIT005.DONE
Run merge_driver monitor task.
Overall, there was no crashing, no error message. Simply it expired with the time finished. And this happened 3 times in total, submitted with different wall time and other job configurations.
I used the term "hung" or "died" because I cannot distinguish the difference. To be a bit more clear, the CPU.CMD script never got control after the "snlc_fit.exe" program was executed. If the snlc_fit.exe program aborts or crashes, control returns to the CPU.CMD script which immediately calls the merge process that reports ABORT or unknown error. I don't know a mechanism for which the snlc_fit.exe program can fail in such a way that it also kills the CPU*.CMD script that launched it ... except for too-much-memory, but that should have left an error message. Regarding BUSY files, those are set only by the merge process; BUSY files can delay submitting jobs, but cannot interfere with a snlc_fit.exe job that is running.
I'm trying to go through your *BATCH and LOG files - I see there were 40 similar jobs run and I gather 4 failed and continue to fail even when resubmitted to batch - but those 4 jobs run from beginning to completion if run on a login node?
You can run just CPU0004 (for example) on its own for this test? I'm wondering if there's any possibility that running all the jobs concurrently causes some filesystem overload.
When resubmitted to batch - it doesn't sound like the jobs fail at precisely the same point... but that may also depend on how the output is flushed to the logs?
Is there any more detailed "debug" mode for snlc_fit.exe
?
Can you resubmit a job where you attempt to just rerun the specificsnlc_fit.exe
command that appears to fail for these jobs - I can locate those points for all but CPU0005.
The *.BATCH files look similar, using the shared queue and running on 1 Haswell node, and indicating 1 task
#SBATCH -J FIT_PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA.nml-CPU0000
#SBATCH -q shared
#SBATCH -A m1727
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 36:00:00
#SBATCH -L SCRATCH,project
#SBATCH -C haswell
#SBATCH --output=CPU0000_JOBLIST_cori07.LOG
#SBATCH --mem-per-cpu=2000
Just a note that likely has nothing to do with this issue, in your *.BATCH files, project
is no longer a valid license request - see: https://docs.nersc.gov/jobs/best-practices/#available-licenses-on-cori, likely you want cfs
- if you need CFS.
I'd probably try a run where you don't use shared
to see if that changes the result.
Interestingly CPU0004 and CPU0005 ran concurrently on the same node: nid00558 and none of your other jobs used that node. CPU0023 used nid00539 CPU0032 used nid00562 as did the successful job CPU0026
You can learn a little bit by using sacct -j <jobId>
and it doesn't look like a memory issue.
CPU0004 ran for 08:01:24 hours - so it didn't exceed the timelimit. The reported exit signal in all cases is 9.
Looking at the *CMD files, such as CPU0004_JOBLIST_cori07.CMD
- repeated calls to snlc_fit.exe
followed by those merge driver calls like /global/cfs/cdirs/lsst/groups/SN/snana/SNANA/util/submit_batch/submit_batch_jobs.py FIT_PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA.nml -m -t 80973 --cpunum 4
Focusing on CPU0004_JOBLIST_cori07.LOG
which terminates with
Wait for snlc_fit.exe if SNANA make is in progress
snlc_fit.exe exists -> continue
Wait for ../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0023/FITOPT000.FITRES
../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0023/FITOPT000.FITRES exists -> continue
Poking around in CPU0004_JOBLIST_cori07.CMD
: where this file is referenced, I see the job terminates while writing this log PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0023_FITOPT005_SPLIT005.LOG
, ending with:
MINUIT WARNING IN HESSE
============== Negative diagonal element 5 in Error Matrix
MINUIT WARNING IN HESSE
============== 0.37E+06 added to diagonal of error matrix
EIGENVALUES OF SECOND-DERIVATIVE MATRIX:
-0.1513E+02 0.9780E+00 0.9956E+00 0.1023E+01 0.1713E+02
MINUIT WARNING IN HESSE
============== MATRIX FORCED POS-DEF BY ADDING 15.143 TO DIAGONAL.
when you successfully run on the login nodes - do the logs look similar and then get beyond this point?
For CPU0005, I don't see any indication that the job was within a snlc_fit.exe
step, there are no partial output files at the point where I think the log is indicating it is failing:
echo '../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0029/FITOPT000.FITRES exists -> continue'
snlc_fit.exe FIT_PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA.nml \
VERSION_PHOTOMETRY PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0029 \
JOBSPLIT 6 40 \
TEXTFILE_PREFIX PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0029_FITOPT011_SPLIT006 \
ROOTFILE_OUT PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0029_FITOPT011_SPLIT006.ROOT \
MAGOBS_SHIFT_ZP 'Y .005' \
SNCID_LIST_FILE ../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0029/FITOPT000.FITRES \
OPT_SNCID_LIST 1 \
&> PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0029_FITOPT011_SPLIT006.LOG
touch PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0029_FITOPT011_SPLIT006.DONE
and CPU0032 also has a truncated log: PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0030_FITOPT004_SPLIT033.LOG
For CPU0023 I see the log where the job seemed to terminate:
PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0005_FITOPT004_SPLIT024.LOG
CALL MNFIT_DRIVER for ISN= 297 CID=1144567 NFITPAR= 9 iter=1
############################################################
MINUIT RELEASE 96.03 INITIALIZED. DIMENSIONS 100/110 EPSMAC= 0.89E-15
Set MINUIT STRATEGY = 2
------------------------------------------------
MNFIT_DRIVER: begin fit with MINIMIZE
MINUIT WARNING IN MIGRAD
============== Negative diagonal element 2 in Error Matrix
MINUIT WARNING IN MIGRAD
============== 0.10E+01 added to diagonal of error matrix
You could consider rerunning one of those jobs, and if possible running the particular execution of snlc_fit.exe
causing the problem consistently and attempting to log into the compute node where is running:
https://docs.nersc.gov/jobs/monitoring/#how-to-log-into-compute-nodes-running-your-jobs
I think that this was eventually traced to other jobs on the same node taking too much memory. Requesting one node solves the problem (for making alerts), but have to wait few days in the queue. Need a better solution longer-term.
This issue is stale because it has been 90 days since last activity.
If no further activities take place, this issue will be closed in 14 days.
Add label keep
to keep this issue.
Description A clear and concise description of what the issue is.
Batch jobs mysteriously hang or die. See batch log files below.
Choose all applicable topics by placing an 'X' between the [ ]:
To Reproduce Steps to reproduce the behavior:
Screenshots If applicable, add screenshots to help explain your problem.
cd /global/cscratch1/sd/kessler/PIPPIN_OUTPUT/PLASTICC_COMBINED_UNFINISHED/2_LCFIT/FIT_DATA_PHOTOZ_PLASTICC_SIMDATA/output/SPLIT_JOBS_LCFIT
CPU0004_JOBLIST_cori07.LOG CPU0005_JOBLIST_cori07.LOG CPU0023_JOBLIST_cori07.LOG CPU0032_JOBLIST_cori07.LOG
to see the actual BATCH jobs, replace ".LOG" suffix with ".BATCH" I ran two of the hung jobs interactively and they run fine. I watched memory with "top" to ensure no memory leak. This mystery batch pathology occurred with two different users over the weekend running different jobs: 1) light curve fitting jobs run by Ayan Mitra (shown above) and 2) simulation jobs run by C.Alves (not shown in this issue)