long jobs sometimes hang or die on batch node

RickKessler commented 2 years ago

Description A clear and concise description of what the issue is.

Batch jobs mysteriously hang or die. See batch log files below.

Choose all applicable topics by placing an 'X' between the [ ]:

[ ] jupyter
[ ] jupyter terminal
[ ] Cori interactive command line
[X ] Batch jobs
[ ] python
[ ] CSCRATCH
[ ] Community File System
[ ] HPSS (tape)
[ ] Data transfer and Globus
[ ] New User Account or account access problems

To Reproduce Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Screenshots If applicable, add screenshots to help explain your problem.

cd /global/cscratch1/sd/kessler/PIPPIN_OUTPUT/PLASTICC_COMBINED_UNFINISHED/2_LCFIT/FIT_DATA_PHOTOZ_PLASTICC_SIMDATA/output/SPLIT_JOBS_LCFIT

and here are batch logs for jobs that hung or died:

CPU0004_JOBLIST_cori07.LOG CPU0005_JOBLIST_cori07.LOG CPU0023_JOBLIST_cori07.LOG CPU0032_JOBLIST_cori07.LOG

to see the actual BATCH jobs, replace ".LOG" suffix with ".BATCH" I ran two of the hung jobs interactively and they run fine. I watched memory with "top" to ensure no memory leak. This mystery batch pathology occurred with two different users over the weekend running different jobs: 1) light curve fitting jobs run by Ayan Mitra (shown above) and 2) simulation jobs run by C.Alves (not shown in this issue)

heather999 commented 2 years ago

It sounds like 4 jobs died over the weekend. Were there other concurrent jobs (or jobs at other times during the weekend) that completed successfully?

There's nothing in the NERSC outage log to indicate that there was a problem. The first step is to look at the log and potentially use the job ids to see what condition caused the terminations. How many of the 4 really hung - versus died? And by "hung" it required the user to terminate the job or did it just terminate when the time expired?

heather999 commented 2 years ago

While you can run these interactively, what would happen if you resubmitted the job to batch? Would it complete or would you see the same termination/hanging?

Where do these *.LOCK files get written and presumably removed?

Trying to figure out what I can from the logs, starting with CPU0004_JOBLIST_cori07.LOG

PT005_SPLIT005.DONE
Run merge_driver monitor task.

# =========================================== 
# merge_driver: Begin at 2021-12-10 06:52:09 (24728)
# merge_driver: read SUBMIT.INFO

# merge_driver: Found existing BUSY_MERGE_CPU0019.LOCK --> exit merge process.
1
Wait for snlc_fit.exe if SNANA make is in progress
snlc_fit.exe exists -> continue
Wait for ../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0023/FITOPT000.FITRES
../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0023/FITOPT000.FITRES exists -> continue

and I see ../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0023/FITOPT000.FITRES ../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0023/FITOPT000.FITRES exists is it at that point that it hung? What was supposed to happen next?

Similarly for CPU0005_JOBLIST_cori07.LOG, I see the file mentioned at the end of the log: ../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0029/FITOPT000.FITRES

# merge_driver: Found existing BUSY_MERGE_CPU0014.LOCK --> exit merge process.
1
Wait for snlc_fit.exe if SNANA make is in progress
snlc_fit.exe exists -> continue
Wait for ../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0029/FITOPT000.FITRES
../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0029/FITOPT000.FITRES exists -> continue

heather999 commented 2 years ago

Actually @RickKessler, @Catarina-Alves's job failures may be associated with exhausting her NERSC Compute allocation - she doesn't have any hours left to run batch jobs. I sent a note over the weekend, if needed, she can apply for more NERSC time:

If you need more compute time for DESC, there is a form on this Confluence page to make a request: https://confluence.slac.stanford.edu/display/LSSTDESC/How+To+Apply+For+NERSC+Computing+Time

Any request should be associated with an approved DESC Project which are listed in the PubDB. Please include an estimate of the computing needs for your project including how you determined how much more time you need.  If you need any help putting in a request or have any other questions, just let us know.

am610 commented 2 years ago

Hi, Ayan here..for the jobs I ran/submitted I would like to mention that I have no allocation hour issue I think. I have/should have sufficient hours left still. Further to answer some of your concerns, overall I submitted the job 3 times, every time it hung at some point, not exactly the same point but hung nevertheless.

About your question as to how the LOG files ending should look like, something like this :

# merge_driver: Found existing BUSY_MERGE_CPU0020.LOCK --> exit merge process.
1
Wait for snlc_fit.exe if SNANA make is in progress
snlc_fit.exe exists -> continue
Wait for ../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0022/FITOPT000.FITRES
../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0022/FITOPT000.FITRES exists -> continue
Finished snlc_fit.exe -> create PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0022_FITOPT005_SPLIT005.DONE
Run merge_driver monitor task.

Overall, there was no crashing, no error message. Simply it expired with the time finished. And this happened 3 times in total, submitted with different wall time and other job configurations.

RickKessler commented 2 years ago

I used the term "hung" or "died" because I cannot distinguish the difference. To be a bit more clear, the CPU.CMD script never got control after the "snlc_fit.exe" program was executed. If the snlc_fit.exe program aborts or crashes, control returns to the CPU.CMD script which immediately calls the merge process that reports ABORT or unknown error. I don't know a mechanism for which the snlc_fit.exe program can fail in such a way that it also kills the CPU*.CMD script that launched it ... except for too-much-memory, but that should have left an error message. Regarding BUSY files, those are set only by the merge process; BUSY files can delay submitting jobs, but cannot interfere with a snlc_fit.exe job that is running.

heather999 commented 2 years ago

I'm trying to go through your *BATCH and LOG files - I see there were 40 similar jobs run and I gather 4 failed and continue to fail even when resubmitted to batch - but those 4 jobs run from beginning to completion if run on a login node?

You can run just CPU0004 (for example) on its own for this test? I'm wondering if there's any possibility that running all the jobs concurrently causes some filesystem overload.

When resubmitted to batch - it doesn't sound like the jobs fail at precisely the same point... but that may also depend on how the output is flushed to the logs?

Is there any more detailed "debug" mode for snlc_fit.exe?

Can you resubmit a job where you attempt to just rerun the specificsnlc_fit.exe command that appears to fail for these jobs - I can locate those points for all but CPU0005.

The *.BATCH files look similar, using the shared queue and running on 1 Haswell node, and indicating 1 task

#SBATCH -J FIT_PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA.nml-CPU0000
#SBATCH -q shared
#SBATCH -A m1727
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 36:00:00
#SBATCH -L SCRATCH,project
#SBATCH -C haswell
#SBATCH --output=CPU0000_JOBLIST_cori07.LOG
#SBATCH --mem-per-cpu=2000

Just a note that likely has nothing to do with this issue, in your *.BATCH files, project is no longer a valid license request - see: https://docs.nersc.gov/jobs/best-practices/#available-licenses-on-cori, likely you want cfs - if you need CFS.

I'd probably try a run where you don't use shared to see if that changes the result.

Interestingly CPU0004 and CPU0005 ran concurrently on the same node: nid00558 and none of your other jobs used that node. CPU0023 used nid00539 CPU0032 used nid00562 as did the successful job CPU0026

You can learn a little bit by using sacct -j <jobId> and it doesn't look like a memory issue. CPU0004 ran for 08:01:24 hours - so it didn't exceed the timelimit. The reported exit signal in all cases is 9.

Looking at the *CMD files, such as CPU0004_JOBLIST_cori07.CMD - repeated calls to snlc_fit.exe followed by those merge driver calls like /global/cfs/cdirs/lsst/groups/SN/snana/SNANA/util/submit_batch/submit_batch_jobs.py FIT_PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA.nml -m -t 80973 --cpunum 4

Focusing on CPU0004_JOBLIST_cori07.LOG which terminates with

Wait for snlc_fit.exe if SNANA make is in progress
snlc_fit.exe exists -> continue
Wait for ../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0023/FITOPT000.FITRES
../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0023/FITOPT000.FITRES exists -> continue

Poking around in CPU0004_JOBLIST_cori07.CMD: where this file is referenced, I see the job terminates while writing this log PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0023_FITOPT005_SPLIT005.LOG, ending with:

 MINUIT WARNING IN HESSE   
 ============== Negative diagonal element  5 in Error Matrix
 MINUIT WARNING IN HESSE   
 ==============     0.37E+06 added to diagonal of error matrix
 EIGENVALUES OF SECOND-DERIVATIVE MATRIX:
        -0.1513E+02  0.9780E+00  0.9956E+00  0.1023E+01  0.1713E+02
 MINUIT WARNING IN HESSE   
 ============== MATRIX FORCED POS-DEF BY ADDING   15.143     TO DIAGONAL.

when you successfully run on the login nodes - do the logs look similar and then get beyond this point?

For CPU0005, I don't see any indication that the job was within a snlc_fit.exe step, there are no partial output files at the point where I think the log is indicating it is failing:

echo '../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0029/FITOPT000.FITRES exists -> continue' 

snlc_fit.exe FIT_PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA.nml \
     VERSION_PHOTOMETRY PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0029 \
     JOBSPLIT 6 40 \
     TEXTFILE_PREFIX  PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0029_FITOPT011_SPLIT006 \
     ROOTFILE_OUT     PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0029_FITOPT011_SPLIT006.ROOT \
   MAGOBS_SHIFT_ZP 'Y .005' \
   SNCID_LIST_FILE ../PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0029/FITOPT000.FITRES \
   OPT_SNCID_LIST 1 \
  &>  PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0029_FITOPT011_SPLIT006.LOG
touch PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0029_FITOPT011_SPLIT006.DONE

and CPU0032 also has a truncated log: PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0030_FITOPT004_SPLIT033.LOG

For CPU0023 I see the log where the job seemed to terminate: PIP_PLASTICC_COMBINED_PLASTICC_SIMDATA-0005_FITOPT004_SPLIT024.LOG

   CALL MNFIT_DRIVER for ISN=  297   CID=1144567   NFITPAR=  9   iter=1

 ############################################################

  MINUIT RELEASE 96.03  INITIALIZED.   DIMENSIONS 100/110  EPSMAC=  0.89E-15
   Set MINUIT STRATEGY =            2

  ------------------------------------------------ 
    MNFIT_DRIVER: begin fit with MINIMIZE 
 MINUIT WARNING IN MIGRAD  
 ============== Negative diagonal element  2 in Error Matrix
 MINUIT WARNING IN MIGRAD  
 ==============     0.10E+01 added to diagonal of error matrix

heather999 commented 2 years ago

You could consider rerunning one of those jobs, and if possible running the particular execution of snlc_fit.exe causing the problem consistently and attempting to log into the compute node where is running: https://docs.nersc.gov/jobs/monitoring/#how-to-log-into-compute-nodes-running-your-jobs

RickKessler commented 2 years ago

I think that this was eventually traced to other jobs on the same node taking too much memory. Requesting one node solves the problem (for making alerts), but have to wait few days in the queue. Need a better solution longer-term.

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been 90 days since last activity. If no further activities take place, this issue will be closed in 14 days.
Add label keep to keep this issue.

LSSTDESC / desc-help

long jobs sometimes hang or die on batch node #75