Closed jchodera closed 8 years ago
Well, I'm the only person that would actually issue such a command and I did not. So let me look closer at the Moab logs.
Moab seems to feel it hit a few of the resource limits and was ended. I think if I'm reading this right the wallclock was the one that actually resulted in the "kill".
Do you have a spool file with the "e" prefix? You are showing the stdout above unless you combined them.
2016-04-18T22:55:25.205-0400 39633 TRACE1 MLimit.c:MLimitEnforceAll:524 0 job 7082041 violates requested MEM soft limit (3397 > 3072)
2016-04-18T22:55:25.205-0400 39633 TRACE1 MLimit.c:MLimitEnforceAll:524 0 job 7082041 violates requested MEM hard limit (3397 > 3379)
2016-04-18T22:55:25.205-0400 39633 TRACE1 MSys.c:MSysRegEvent:453 0 MSysRegEvent(JOBRESVIOLATION: job '7082041' in state Running has exceeded MEM resource hard limit (3397 > 3379) (action CANCEL will be taken) job start time: Mon Apr 18 14:48:05,0,1)
MPBSJobCancel(7082041,MSKCC,CMsg,EMsg,job 7082041 exceeded MEM usage hard limit (3397 > 3379))
2016-04-18T22:55:25.255-0400 39633 INFO MPBSI.c:MPBSJobCancel:6292 0x1000065 job:7082041,rm:MSKCC Job 7082041 was canceled.
2016-04-18T22:55:25.255-0400 39633 TRACE1 MObject.c:MOWriteEvent:61 0 MOWriteEvent(O,job,JOBCANCEL,job 7082041 exceeded MEM usage hard limit (3397 > 3379),FP,NULL)
0x1000065 job:7082041 Job 7082041 was canceled. job 7082041 exceeded MEM usage hard limit (3397 > 3379)
2016-04-18T22:55:25.256-0400 39633 INFO MRMJob.c:MRMJobCancel:1060 0x1000065 job:7082041 Job 7082041 was canceled. job 7082041 exceeded MEM usage hard limit (3397 > 3379)
2016-04-18T22:55:25.256-0400 39633 TRACE1 MLimit.c:MLimitCancelJob:96 0 job '7082041' has been cancelled for exceeding its wallclock limit. Setting effective end time to current time to stop further charges.
I believe I used the option that combines stdout and stderr into a single spool file, which is why I was surprised by the lack of a termination explanation message.
Thanks for finding the relevant section of the Torque log! I had grepped it, but must have missed those lines in reading through the result!
That actually is the Moab log. I will add a wiki section on accessing that on the hal-sched1 machine or I will do a similar replication of the log over to hal.
Ah, OK!
FYI, Torque log access documentation is here. Would be great to add Moab log access documentation there too once a scheme is implemented.
Please note just trying to figure out if I can tell moab to use a datestamp in the log filename. Would make it more generically useful.
Job 7082041 (started 18 Apr 2016) seems to have been killed with a
Terminated
message, but unlike normal jobs that exceed batch queue limits, no explanation was dumped to the end of the spool file. Instead, I haveThe logs suggest that the job was terminated at the request of
root
:Any idea what happened?