job 7082041 terminated by root?

jchodera commented 8 years ago

Job 7082041 (started 18 Apr 2016) seems to have been killed with a Terminated message, but unlike normal jobs that exceed batch queue limits, no explanation was dumped to the end of the spool file. Instead, I have

[chodera@mskcc-ln1 ~]$ cat cluster-CK2.o7082041 
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
Mon Apr 18 14:48:05 EDT 2016
24
Mon Apr 18 14:48:05 EDT 2016
loading reference topology...
Initializing featurizer...
Obtaining file info: 100% (558/558) [##############################] eta 00:01 -There are 1006751 frames total in 558 trajectories.
Clustering...
18-04-16 14:48:44 pyemma.coordinates.clustering.uniform_time.UniformTimeClustering[0] INFO     number of threads obtained from env variable 'OMP_NUM_THREADS'=24
Elapsed time 1528.390 s
getting output of UniformTimeClustering:  20% (117/558) [#      ] eta 28:34:54 \Terminated

The logs suggest that the job was terminated at the request of root:

04/18/2016 22:55:25;0008;PBS_Server.12149;Job;7082041.hal-sched1.local;Job deleted at request of root@hal-sched1
04/18/2016 22:55:25;000d;PBS_Server.12149;Job;7082041.hal-sched1.local;preparing to send 'd' mail for job 7082041.hal-sched1.local to chodera@mskcc-ln1.local (Job deleted at request of root@hal-sched1

Any idea what happened?

tatarsky commented 8 years ago

Well, I'm the only person that would actually issue such a command and I did not. So let me look closer at the Moab logs.

tatarsky commented 8 years ago

Moab seems to feel it hit a few of the resource limits and was ended. I think if I'm reading this right the wallclock was the one that actually resulted in the "kill".

Do you have a spool file with the "e" prefix? You are showing the stdout above unless you combined them.

2016-04-18T22:55:25.205-0400    39633   TRACE1  MLimit.c:MLimitEnforceAll:524   0               job 7082041 violates requested MEM soft limit (3397 > 3072)
2016-04-18T22:55:25.205-0400    39633   TRACE1  MLimit.c:MLimitEnforceAll:524   0               job 7082041 violates requested MEM hard limit (3397 > 3379)
2016-04-18T22:55:25.205-0400    39633   TRACE1  MSys.c:MSysRegEvent:453 0               MSysRegEvent(JOBRESVIOLATION:  job '7082041' in state Running has exceeded MEM resource hard limit (3397 > 3379) (action CANCEL  will be taken)  job start time: Mon Apr 18 14:48:05,0,1)
MPBSJobCancel(7082041,MSKCC,CMsg,EMsg,job 7082041 exceeded MEM usage hard limit (3397 > 3379))
2016-04-18T22:55:25.255-0400    39633   INFO    MPBSI.c:MPBSJobCancel:6292      0x1000065       job:7082041,rm:MSKCC    Job 7082041 was canceled. 
2016-04-18T22:55:25.255-0400    39633   TRACE1  MObject.c:MOWriteEvent:61       0               MOWriteEvent(O,job,JOBCANCEL,job 7082041 exceeded MEM usage hard limit (3397 > 3379),FP,NULL)
0x1000065       job:7082041     Job 7082041 was canceled. job 7082041 exceeded MEM usage hard limit (3397 > 3379)
2016-04-18T22:55:25.256-0400    39633   INFO    MRMJob.c:MRMJobCancel:1060      0x1000065       job:7082041     Job 7082041 was canceled. job 7082041 exceeded MEM usage hard limit (3397 > 3379)
2016-04-18T22:55:25.256-0400    39633   TRACE1  MLimit.c:MLimitCancelJob:96     0               job '7082041' has been cancelled for exceeding its wallclock limit. Setting effective end time to current time to stop further charges.

jchodera commented 8 years ago

I believe I used the option that combines stdout and stderr into a single spool file, which is why I was surprised by the lack of a termination explanation message.

Thanks for finding the relevant section of the Torque log! I had grepped it, but must have missed those lines in reading through the result!

tatarsky commented 8 years ago

That actually is the Moab log. I will add a wiki section on accessing that on the hal-sched1 machine or I will do a similar replication of the log over to hal.

jchodera commented 8 years ago

Ah, OK!

jchodera commented 8 years ago

FYI, Torque log access documentation is here. Would be great to add Moab log access documentation there too once a scheme is implemented.

tatarsky commented 8 years ago

Please note just trying to figure out if I can tell moab to use a datestamp in the log filename. Would make it more generically useful.

cBio / cbio-cluster

job 7082041 terminated by root? #403