Closed belforte closed 2 years ago
background info.
The example job reported in the e-log entry is clusterId 66825886 on vocms0106.
Belonging to 220316_151230:dmarckx_crab_RunIISummer20UL18MiniAOD-106X_upgrade2018_realistic_v11_L1v1-v1_singlelepton_MC_2018_ULHZY_pileupId
at https://cmsweb.cern.ch/crabserver/ui/task/220316_151230%3Admarckx_crab_RunIISummer20UL18MiniAOD-106X_upgrade2018_realistic_v11_L1v1-v1_singlelepton_MC_2018_ULHZY_pileupId
with CRAB id 21
and retry number 2.
the ads logged in SPOOL_DIR/job_log
are [1] and contain
RemoveReason = "removed by SYSTEM_PERIODIC_REMOVE due to ** CPU usage over limit"
The submission JDL contains
+PeriodicRemoveReason = ifThenElse(time() - EnteredCurrentStatus > 7*24*60*60 && isUndefined(MemoryUsage),
"Removed due to idle time limit",
ifThenElse(time() > x509UserProxyExpiration, "Removed job due to proxy expiration",
ifThenElse(MemoryUsage > RequestMemory, "Removed due to memory use",
ifThenElse(MaxWallTimeMinsRun*60 < time() - EnteredCurrentStatus, "Removed due to wall clock limit",
ifThenElse(DiskUsage > 20000000, "Removed due to disk usage",
ifThenElse(time() > CRAB_TaskEndTime, "Removed due to reached CRAB_TaskEndTime",
"Removed due to job being held"))))))
and it is indeed defined in DagmanCreator as Saqib indicated: https://github.com/dmwm/CRABServer/blob/7543003201d04d7292e557f445fc55af44dcda04/src/python/TaskWorker/Actions/DagmanCreator.py#L167
Indeed PostJob does not manage to understand why it failed and reports in https://cmsweb.cern.ch:8443/scheddmon/0106/dmarckx/220316_151230:dmarckx_crab_RunIISummer20UL18MiniAOD-106X_upgrade2018_realistic_v11_L1v1-v1_singlelepton_MC_2018_ULHZY_pileupId/postjob.21.2.txt
Thu, 17 Mar 2022 15:51:10 CET(+0100):DEBUG:PostJob Dashboard exit code already set on the worker node. Continuing normally.
Thu, 17 Mar 2022 15:51:10 CET(+0100):INFO:PostJob status: FAILED. ClassAd attributes to set are: {'CRAB_PostJobStatus': '"FAILED"'}
Thu, 17 Mar 2022 15:51:10 CET(+0100):INFO:PostJob -----> Started attempt 1 out of 5 -----
Thu, 17 Mar 2022 15:51:10 CET(+0100):INFO:PostJob -----> Finished attempt 1 out of 5 -----
Thu, 17 Mar 2022 15:51:10 CET(+0100):INFO:PostJob ====== Finished to update job ClassAd.
Thu, 17 Mar 2022 15:51:10 CET(+0100):INFO:PostJob ====== Finished to analyze job exit status.
Thu, 17 Mar 2022 15:51:10 CET(+0100):INFO:PostJob ====== Starting to prepare error report.
Thu, 17 Mar 2022 15:51:10 CET(+0100):DEBUG:PostJob Acquiring lock on error summary file.
Thu, 17 Mar 2022 15:51:10 CET(+0100):DEBUG:PostJob Acquired lock on error summary file.
Thu, 17 Mar 2022 15:51:10 CET(+0100):INFO:PostJob 'exitCode' key not found in the report
Thu, 17 Mar 2022 15:51:10 CET(+0100):INFO:PostJob Updating error summary for jobid 21 retry 2 with following information:
'exit code' = -1
'exit message' = The framework job report could be loaded, but no error message was found there.
Thu, 17 Mar 2022 15:51:10 CET(+0100):DEBUG:PostJob Writing error summary file
Thu, 17 Mar 2022 15:51:10 CET(+0100):DEBUG:PostJob Written error summary file
Thu, 17 Mar 2022 15:51:10 CET(+0100):INFO:PostJob ====== Finished to prepare error report.
This is quite strange, long time since I saw exit code = -1
last time !!
[1]
[crabtw@vocms0106 cluster66789620.proc0.subproc0]$ condor_q 66825886 -userlog ./job_log -l
ClusterId = 66825886
Cmd = "???"
CompletionDate = 0
EnteredCurrentStatus = 1647528666
GlobalJobId = "localhost#66825886.0#1647523686"
ImageSize = 5945756
JobCurrentStartDate = 1647528353
JOBGLIDEIN_CMSSite = "$$([ifThenElse(GLIDEIN_CMSSite is undefined, \"Unknown\", GLIDEIN_CMSSite)])"
JobPrio = 0
JobStatus = 3
LastJobStatus = 2
LastSuspensionTime = 0
LocalSysCpu = 0.0
LocalUserCpu = 0.0
MATCH_EXP_JOBGLIDEIN_CMSSite = "T2_US_Caltech"
MyType = "Job"
NumJobStarts = 1
NumSystemHolds = 0
Owner = "crabtw"
ProcId = 0
ProportionalSetSize = 1228637
QDate = 1647523686
RemoteHost = "<172.17.0.2:35779?CCBID=128.142.194.25:9618%3faddrs%3d128.142.194.25-9618+[2001-1458-301-cd--100-364]-9618%26alias%3dvocms0813.cern.ch%26noUDP%26sock%3dcollector9701#3236487%20131.225.189.210:9618%3faddrs%3d131.225.189.210-9618+[2620-6a-0-8421--cd]-9618%26alias%3dcmssrvz03.fnal.gov%26noUDP%26sock%3dcollector9677#1834863&PrivNet=cmspilot-2949160.0-blade-18.tier2&addrs=172.17.0.2-35779&alias=cmspilot-2949160.0-blade-18.tier2&noUDP>"
RemoteSysCpu = 11.0
RemoteUserCpu = 451.0
RemoteWallClockTime = 0.0
RemoveReason = "Job removed by SYSTEM_PERIODIC_REMOVE due to ** CPU usage over limit"
ResidentSetSize = 1240224
ServerTime = 1647533124
StartdIpAddr = "<172.17.0.2:35779?CCBID=128.142.194.25:9618%3faddrs%3d128.142.194.25-9618+[2001-1458-301-cd--100-364]-9618%26alias%3dvocms0813.cern.ch%26noUDP%26sock%3dcollector9701#3236487%20131.225.189.210:9618%3faddrs%3d131.225.189.210-9618+[2620-6a-0-8421--cd]-9618%26alias%3dcmssrvz03.fnal.gov%26noUDP%26sock%3dcollector9677#1834863&PrivNet=cmspilot-2949160.0-blade-18.tier2&addrs=172.17.0.2-35779&alias=cmspilot-2949160.0-blade-18.tier2&noUDP>"
SubmitHost = "<188.184.80.88:4080?addrs=188.184.80.88-4080+[2001-1458-201-e3--100-52]-4080&alias=vocms0106.cern.ch&noUDP&sock=schedd_3623348_ede7>"
TargetType = "Machine"
TotalSuspensions = 0
UserLog = "./job_log"
also the file job_fjr.21.2.json
has this very strange content
{"postjob": {"exitCode": 2, "exitMsg": ""}}
did PostJob write that ? Let's see what this line in PJ correspond to code-wise
INFO:PostJob Updating error summary for jobid 21 retry 2 with following information:
'exit code' = -1
after lookin at code, I think that this needs to be taken care of in RetryJob.py and a new exit code defined, similarly to https://github.com/dmwm/CRABServer/blob/db170c6254b493b0c3a17e1cb3792957bdec2fc9/src/python/TaskWorker/Actions/RetryJob.py#L184-L212
could this be linked to the old issue https://github.com/dmwm/CRABServer/issues/5807 ?
Indeed it can be dealt with in RetryJob.py We already have exit code 50663 defined
50663 - Application terminated by wrapper because using too much CPU time
I have found that RetryJob and PostJob do not print useful/correct information if period remove hits after the job terminated so that there is already an FJR, in that case the error is considered recoverable and job is retried as per
https://github.com/dmwm/CRABServer/blob/48be5e1722f4cbcfbe2191b2132410d77d919681/src/python/TaskWorker/Actions/RetryJob.py#L473-L478
I think we should catch also the other cases of job killed by system_periodic_remove
and have a clean place where to decide if to retry or not. For another issue...
with above commit https://github.com/belforte/CRABServer/commit/dbeb67de98c430fcbe5cb97aedcd917e9ce6f38a crab status is
1 jobs failed with exit code 50663:
1 jobs failed with following error message: (for example, job 1)
Not retrying job due CPU limit (using more CPU than Wall Clock)
see this e-log entry from Saqib: https://cms-logbook.cern.ch/elog/GlideInWMS/7950