dmwm / CRABServer

15 stars 38 forks source link

add support for new PeriodicRemoval reason CPU over limit #7141

Closed belforte closed 2 years ago

belforte commented 2 years ago

see this e-log entry from Saqib: https://cms-logbook.cern.ch/elog/GlideInWMS/7950

belforte commented 2 years ago

background info. The example job reported in the e-log entry is clusterId 66825886 on vocms0106. Belonging to 220316_151230:dmarckx_crab_RunIISummer20UL18MiniAOD-106X_upgrade2018_realistic_v11_L1v1-v1_singlelepton_MC_2018_ULHZY_pileupId at https://cmsweb.cern.ch/crabserver/ui/task/220316_151230%3Admarckx_crab_RunIISummer20UL18MiniAOD-106X_upgrade2018_realistic_v11_L1v1-v1_singlelepton_MC_2018_ULHZY_pileupId

with CRAB id 21 and retry number 2.

the ads logged in SPOOL_DIR/job_log are [1] and contain

RemoveReason = "removed by SYSTEM_PERIODIC_REMOVE due to ** CPU usage over limit"

The submission JDL contains

+PeriodicRemoveReason = ifThenElse(time() - EnteredCurrentStatus > 7*24*60*60 && isUndefined(MemoryUsage),
 "Removed due to idle time limit", 
 ifThenElse(time() > x509UserProxyExpiration, "Removed job due to proxy expiration",  
 ifThenElse(MemoryUsage > RequestMemory, "Removed due to memory use",
 ifThenElse(MaxWallTimeMinsRun*60 < time() - EnteredCurrentStatus, "Removed due to wall clock limit",
 ifThenElse(DiskUsage >  20000000, "Removed due to disk usage",
 ifThenElse(time() > CRAB_TaskEndTime, "Removed due to reached CRAB_TaskEndTime",
 "Removed due to job being held"))))))

and it is indeed defined in DagmanCreator as Saqib indicated: https://github.com/dmwm/CRABServer/blob/7543003201d04d7292e557f445fc55af44dcda04/src/python/TaskWorker/Actions/DagmanCreator.py#L167

Indeed PostJob does not manage to understand why it failed and reports in https://cmsweb.cern.ch:8443/scheddmon/0106/dmarckx/220316_151230:dmarckx_crab_RunIISummer20UL18MiniAOD-106X_upgrade2018_realistic_v11_L1v1-v1_singlelepton_MC_2018_ULHZY_pileupId/postjob.21.2.txt


Thu, 17 Mar 2022 15:51:10 CET(+0100):DEBUG:PostJob Dashboard exit code already set on the worker node. Continuing normally.
Thu, 17 Mar 2022 15:51:10 CET(+0100):INFO:PostJob status: FAILED. ClassAd attributes to set are: {'CRAB_PostJobStatus': '"FAILED"'}
Thu, 17 Mar 2022 15:51:10 CET(+0100):INFO:PostJob        -----> Started attempt 1 out of 5 -----
Thu, 17 Mar 2022 15:51:10 CET(+0100):INFO:PostJob        -----> Finished attempt 1 out of 5 -----
Thu, 17 Mar 2022 15:51:10 CET(+0100):INFO:PostJob ====== Finished to update job ClassAd.
Thu, 17 Mar 2022 15:51:10 CET(+0100):INFO:PostJob ====== Finished to analyze job exit status.
Thu, 17 Mar 2022 15:51:10 CET(+0100):INFO:PostJob ====== Starting to prepare error report.
Thu, 17 Mar 2022 15:51:10 CET(+0100):DEBUG:PostJob Acquiring lock on error summary file.
Thu, 17 Mar 2022 15:51:10 CET(+0100):DEBUG:PostJob Acquired lock on error summary file.
Thu, 17 Mar 2022 15:51:10 CET(+0100):INFO:PostJob 'exitCode' key not found in the report
Thu, 17 Mar 2022 15:51:10 CET(+0100):INFO:PostJob Updating error summary for jobid 21 retry 2 with following information:
'exit code' = -1
'exit message' = The framework job report could be loaded, but no error message was found there.
Thu, 17 Mar 2022 15:51:10 CET(+0100):DEBUG:PostJob Writing error summary file
Thu, 17 Mar 2022 15:51:10 CET(+0100):DEBUG:PostJob Written error summary file
Thu, 17 Mar 2022 15:51:10 CET(+0100):INFO:PostJob ====== Finished to prepare error report.

This is quite strange, long time since I saw exit code = -1 last time !!

[1]

[crabtw@vocms0106 cluster66789620.proc0.subproc0]$ condor_q 66825886 -userlog ./job_log -l
ClusterId = 66825886
Cmd = "???"
CompletionDate = 0
EnteredCurrentStatus = 1647528666
GlobalJobId = "localhost#66825886.0#1647523686"
ImageSize = 5945756
JobCurrentStartDate = 1647528353
JOBGLIDEIN_CMSSite = "$$([ifThenElse(GLIDEIN_CMSSite is undefined, \"Unknown\", GLIDEIN_CMSSite)])"
JobPrio = 0
JobStatus = 3
LastJobStatus = 2
LastSuspensionTime = 0
LocalSysCpu = 0.0
LocalUserCpu = 0.0
MATCH_EXP_JOBGLIDEIN_CMSSite = "T2_US_Caltech"
MyType = "Job"
NumJobStarts = 1
NumSystemHolds = 0
Owner = "crabtw"
ProcId = 0
ProportionalSetSize = 1228637
QDate = 1647523686
RemoteHost = "<172.17.0.2:35779?CCBID=128.142.194.25:9618%3faddrs%3d128.142.194.25-9618+[2001-1458-301-cd--100-364]-9618%26alias%3dvocms0813.cern.ch%26noUDP%26sock%3dcollector9701#3236487%20131.225.189.210:9618%3faddrs%3d131.225.189.210-9618+[2620-6a-0-8421--cd]-9618%26alias%3dcmssrvz03.fnal.gov%26noUDP%26sock%3dcollector9677#1834863&PrivNet=cmspilot-2949160.0-blade-18.tier2&addrs=172.17.0.2-35779&alias=cmspilot-2949160.0-blade-18.tier2&noUDP>"
RemoteSysCpu = 11.0
RemoteUserCpu = 451.0
RemoteWallClockTime = 0.0
RemoveReason = "Job removed by SYSTEM_PERIODIC_REMOVE due to ** CPU usage over limit"
ResidentSetSize = 1240224
ServerTime = 1647533124
StartdIpAddr = "<172.17.0.2:35779?CCBID=128.142.194.25:9618%3faddrs%3d128.142.194.25-9618+[2001-1458-301-cd--100-364]-9618%26alias%3dvocms0813.cern.ch%26noUDP%26sock%3dcollector9701#3236487%20131.225.189.210:9618%3faddrs%3d131.225.189.210-9618+[2620-6a-0-8421--cd]-9618%26alias%3dcmssrvz03.fnal.gov%26noUDP%26sock%3dcollector9677#1834863&PrivNet=cmspilot-2949160.0-blade-18.tier2&addrs=172.17.0.2-35779&alias=cmspilot-2949160.0-blade-18.tier2&noUDP>"
SubmitHost = "<188.184.80.88:4080?addrs=188.184.80.88-4080+[2001-1458-201-e3--100-52]-4080&alias=vocms0106.cern.ch&noUDP&sock=schedd_3623348_ede7>"
TargetType = "Machine"
TotalSuspensions = 0
UserLog = "./job_log"
belforte commented 2 years ago

also the file job_fjr.21.2.json has this very strange content

{"postjob": {"exitCode": 2, "exitMsg": ""}}

did PostJob write that ? Let's see what this line in PJ correspond to code-wise

INFO:PostJob Updating error summary for jobid 21 retry 2 with following information:
'exit code' = -1
belforte commented 2 years ago

after lookin at code, I think that this needs to be taken care of in RetryJob.py and a new exit code defined, similarly to https://github.com/dmwm/CRABServer/blob/db170c6254b493b0c3a17e1cb3792957bdec2fc9/src/python/TaskWorker/Actions/RetryJob.py#L184-L212

belforte commented 2 years ago

could this be linked to the old issue https://github.com/dmwm/CRABServer/issues/5807 ?

belforte commented 2 years ago

Indeed it can be dealt with in RetryJob.py We already have exit code 50663 defined

 50663 - Application terminated by wrapper because using too much CPU time 
belforte commented 2 years ago

I have found that RetryJob and PostJob do not print useful/correct information if period remove hits after the job terminated so that there is already an FJR, in that case the error is considered recoverable and job is retried as per https://github.com/dmwm/CRABServer/blob/48be5e1722f4cbcfbe2191b2132410d77d919681/src/python/TaskWorker/Actions/RetryJob.py#L473-L478 I think we should catch also the other cases of job killed by system_periodic_remove and have a clean place where to decide if to retry or not. For another issue...

belforte commented 2 years ago

with above commit https://github.com/belforte/CRABServer/commit/dbeb67de98c430fcbe5cb97aedcd917e9ce6f38a crab status is

1 jobs failed with exit code 50663:

    1 jobs failed with following error message: (for example, job 1)

        Not retrying job due CPU limit (using more CPU than Wall Clock)