dmwm / CRABServer

15 stars 37 forks source link

A perfect job considered as failed because of excessive memory use #4241

Closed pkonst closed 9 years ago

pkonst commented 10 years ago

There are probably other jobs like this but let's look at just one.

The task is 140311_175214_vocms20:pkonst_crab_postfeb_scale_PK_6h_01 and its Glidemon page is: http://glidemon.web.cern.ch/glidemon/jobs.php?taskname=140311_175214_vocms20%3Apkonst_crab_postfeb_scale_PK_6h_01&showResubmits=1

and the job is number 46.

There are two problems, possibly related.

Problem 1. Looking at the job stdout http://glidemon.web.cern.ch/glidemon/show.php?log=http://vocms20.cern.ch/mon/cms535/140311_175214_vocms20:pkonst_crab_postfeb_scale_PK_6h_01/job_out.46.0.txt one sees that it finished perfectly.

The the postjob considers it to be an unrecoverable failure because of too much memory used. Quoting from the postjob stdout at http://glidemon.web.cern.ch/glidemon/show.php?log=http://vocms20.cern.ch/mon/cms535/140311_175214_vocms20:pkonst_crab_postfeb_scale_PK_6h_01/postjob.46.0.txt:

Not retrying job due to excessive memory use (job killed by CRAB3 watchdog) 2014-03-12 04:39:13,619:INFO:PostJob The retry handler indicated this was a fatal error. 2014-03-12 04:39:13,619:INFO:PostJob Setting Dashboard state to Aborted.

And the job is not retried indeed.

Taking the condor jobid 863008.0 from Glidemon (hoping it is correct) and doing condor_history -l 863008.0 on vocms20 indeed we see RemoveReason = "Removed due to memory use" We also have: MATCH_EXP_JOB_GLIDEIN_Memory = "2500" MATCH_Memory = 2500 RequestMemory = 2000

Maybe the rounding of RequestMemory to achieve better job clustering for the matching is playing here. But anyway the job does not seem to be removed by condor according to its stdout.

Problem 2. Glidemon says the job ran at T2_US_MIT and this comes from job classads. condor_history -l 863008.0 shows: MATCH_GLIDEIN_Site = "MIT" MATCH_EXP_JOB_GLIDEIN_Entry_Name = "CMS_T2_US_MIT_ce02" MATCH_GLIDEIN_CMSSite = "T2_US_MIT" MATCH_GLIDEIN_Entry_Name = "CMS_T2_US_MIT_ce02" MATCH_EXP_JOB_GLIDEIN_CMSSite = "T2_US_MIT" MATCH_EXP_JOB_CMSSite = "T2_US_MIT" LastRemoteHost = "glidein_14760@T2BAT0216.CMSAF.MIT.EDU" MATCH_EXP_JOB_GLIDEIN_Site = "MIT" MATCH_EXP_JOB_Site = "MIT" MATCH_EXP_JOBGLIDEIN_CMSSite = "T2_US_MIT"

So far so good, but in the job stdout we see: Execution site: T2_US_Florida

and the local stageout goes to Florida too: Stage Out Successful: {'SEName': 'srm.ihepa.ufl.edu', 'LFN': '/store/temp/user/pkonst.0d6c0ae7ff2f278e23a483dc3e15d3995c788519/GenericTTbar/140311_175214_crab_postfeb_scale_PK_6h_01/140311_175214/0000/MyTTBarTauolaTest_46.root', 'StageOutCommand': 'cp', 'PFN': '/cms/data/store/temp/user/pkonst.0d6c0ae7ff2f278e23a483dc3e15d3995c788519/GenericTTbar/140311_175214_crab_postfeb_scale_PK_6h_01/140311_175214/0000/MyTTBarTauolaTest_46.root'}

In the classad printout on the WN site T2_FR_IPHC appears too: == JOB AD: JOB_GLIDEIN_Entry_Name = "CMS_T2_US_Florida_iogw1" == JOB AD: JOB_CMSSite = "T2_US_Florida" == JOB AD: MATCH_EXP_JOB_GLIDEIN_Entry_Name = "CMS_T2_FR_IPHC_sbgce2" == JOB AD: JOBGLIDEIN_CMSSite = "T2_US_Florida" == JOB AD: MATCH_GLIDEIN_CMSSite = "T2_FR_IPHC" == JOB AD: MATCH_GLIDEIN_Entry_Name = "CMS_T2_FR_IPHC_sbgce2" == JOB AD: MATCH_EXP_JOB_GLIDEIN_CMSSite = "T2_FR_IPHC"

So how does this happen?

We had a similar case before and thought classads from different retries were mixed but this time there is only one retry.

belforte commented 10 years ago

interesting If you look at the pointed to stdout the classAd clearly say: Florida.

== JOB AD: JOB_GLIDEIN_Entry_Name = "CMS_T2_US_Florida_iogw1" == JOB AD: CRAB_ReqName = "140311_175214_vocms20:pkonst_crab_postfeb_scale_PK_6h_01" == JOB AD: JOB_CMSSite = "T2_US_Florida"

but then there are more ADs

belforte@lxplus423/job> curl -s http://glidemon.web.cern.ch/glidemon/show.php?log=http://vocms20.cern.ch/mon/cms535/140311_175214_vocms20:pkonst_crab_postfeb_scale_PK_6h_01/job_out.46.0.txt|grep CMSSite == JOB AD: JOB_CMSSite = "T2_US_Florida" == JOB AD: JOBGLIDEIN_CMSSite = "T2_US_Florida" == JOB AD: MATCH_GLIDEIN_CMSSite = "T2_FR_IPHC" == JOB AD: MATCH_EXP_JOB_GLIDEIN_CMSSite = "T2_FR_IPHC" == JOB AD: JobAdInformationAttrs = "MATCH_EXP_JOBGLIDEIN_CMSSite, JOBGLIDEIN_CMSSite, RemoteSysCpu, RemoteUserCpu" == JOB AD: MATCH_EXP_JOB_CMSSite = "T2_FR_IPHC" == JOB AD: JOB_GLIDEIN_CMSSite = "T2_US_Florida" == JOB AD: Requirements = ( ( ( target.IS_GLIDEIN =!= true ) || ( target.GLIDEIN_CMSSite =!= undefined ) ) ) && ( ( Memory >= 1 ) && ( Disk >= 1 ) ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.HasFileTransfer ) == JOB AD: MATCH_EXP_JOBGLIDEIN_CMSSite = "T2_FR_IPHC" == ENV: GLIDEIN_CMSSite=T2_US_Florida belforte@lxplus423/job>

mmascher commented 10 years ago

Hi Preslav,

can I ask you to add a piece of information to this? What's the result of:

condor_q -debug -userlog job_log.NNN NNN

That's how the postjob get the job ad: https://github.com/juztas/CRABServer/blob/master/src/python/TaskWorker/Actions/RetryJob.py#L57

My gut tells me it will probably be different than the job.ad printed in the job stdout.

Thank's,

Marco

pkonst commented 10 years ago

Hi Marco,

Here is the result of:

cd  /data01/glidecondor/condor_local/spool/2767/0/cluster862767.proc0.subproc0
condor_q -debug -l -userlog job_log 863008

(slightly different from what you asked for, as there is no file job_log.863008 in the job working directory on the schedd now)

[vocms20] /data01/glidecondor/condor_local/spool/2767/0/cluster862767.proc0.subproc0 > condor_q -debug -l -userlog job_log 863008
03/13/14 12:08:33 Can't open directory "/config" as PRIV_UNKNOWN, errno: 2 (No such file or directory)
EnteredCurrentStatus = 1394595529
ImageSize = 2727460
LocalSysCpu = 0.0
NumJobStarts = 3
CompletionDate = 0
LocalUserCpu = 0.0
RemoteUserCpu = 4.000000
StartdIpAddr = "<18.12.8.216:56220?CCBID=169.228.38.39:9976#155538%20169.228.130.23:9976#45412&noUDP>"
LastVacateReason = "Assertion ERROR on (result)"
JobStatus = 3
Owner = "cms535"
ProportionalSetSize = -1
UserLog = "job_log"
MyType = "Job"
Cmd = "???"
JobCurrentStartDate = 1394595516
GlobalJobId = "localhost#863008.0#1394560535"
RemoteSysCpu = 0.0
LastSuspensionTime = 0
LastJobStatus = 2
TotalSuspensions = 0
RemoteHost = "<18.12.8.216:56220?CCBID=169.228.38.39:9976#155538%20169.228.130.23:9976#45412&noUDP>"
ResidentSetSize = 2018592
CurrentTime = time()
QDate = 1394560535
SubmitHost = "<128.142.229.43:4080?sock=3486_9443_16>"
LastVacateTime = 1394595217
ClusterId = 863008
JOBGLIDEIN_CMSSite = "$$([ifThenElse(GLIDEIN_CMSSite is undefined, \"Unknown\", GLIDEIN_CMSSite)])"
RemoteWallClockTime = 0.0
TargetType = "Machine"
NumSystemHolds = 0
RemoveReason = "Removed due to memory use"
JobPrio = 0
ProcId = 0
ServerTime = 1394708913
MATCH_EXP_JOBGLIDEIN_CMSSite = "T2_US_MIT"

Is you suspected, this is different from what we have on stdout: MATCH_EXP_JOBGLIDEIN_CMSSite = "T2_FR_IPHC"

And

condor_history -l 863008.0

has MATCH_EXP_JOBGLIDEIN_CMSSite = "T2_US_MIT"

And yet in fact the job ran in Florida.

Also note that in both the classads given by condor_history and those printed by the job to stdout we have:

ClusterId = 863008
DAGNodeName = "Job46"
CRAB_Workflow = "140311_175214_vocms20:pkonst_crab_postfeb_scale_PK_6h_01"

Could this be a result of a factory misconfiguration? I think factory operator can set up an entry that matches T2_US_MIT in the FE, but sends the pilot that announces itself as running at T2_US_MIT in its classads to a Florida CE and thus on the collector matches a job that has T2_US_MIT in its list of desired sites. Still, it would not explain the appearance of T2_FR_IPHC.

Preslav

mmascher commented 10 years ago

Thanks Preslav, I think this goes behind my area of expertise. The postjob is simply reading the information coming from this condor_q which does not match the job output. Seems to me it's not a CRAB3 bug and needs external help for the investigation. Maybe we can ask alison?

amccrea commented 10 years ago

My theory is that the job was matched to and started running on an MIT glidein, was booted for its image size being too big (2.7GB), and then was rematched to a Florida glidein, where indeed the normal nodes have 4GB (I believe). The job took the classad attributes from the MIT glidein where it was originally matched, and I guess once the job gets these attributes, they never change -- therefore the Job has "MIT" all over it but actually ran at Florida.

(P.S. It's seriously hard to tell these attributes apart, with all their similar names. I don't know where each comes from. I'll ask Igor at the meeting today if there is some document/person who can clarify exactly what each means.)

pkonst commented 10 years ago

Thanks, Alison, for the explanation.

I'm posting below the full output of condor_history -l 863008.0 for glidein experts to examine.

Note the telltale NumJobStarts = 3 Which would explain why T2_FR_IPHC is also involved. I guess the job was sent there the second time and also failed before finally succeeding at Florida.

The postjob read the RemoveReason = "Removed due to memory use" classad from the first attempt and considered the job a unrecoverable failure and did not retry it.

Preslav

CRAB_SaveLogsFlag = 0
JOB_GLIDEIN_ProcId = "$$(GLIDEIN_ProcId:Unknown)"
StreamOut = false
JOB_GLIDEIN_Entry_Name = "$$(GLIDEIN_Entry_Name:Unknown)"
CRAB_ReqName = "140311_175214_vocms20:pkonst_crab_postfeb_scale_PK_6h_01"
JOB_CMSSite = "$$(GLIDEIN_CMSSite:Unknown)"
CRAB_SiteBlacklist = {  }
CRAB_UserRole = undefined
TaskType = "Job"
NumRestarts = 0
MATCH_GLIDEIN_Schedd = "schedd_glideins1@gfactory-1.t2.ucsd.edu"
SubmitEventNotes = "DAG Node: Job46"
x509UserProxyVOName = "cms"
DAGParentNodeNames = ""
MATCH_GLIDEIN_Site = "MIT"
LastVacateTime = 1394600000
LocalUserCpu = 0.0
RemoteUserCpu = 4.000000
MATCH_GLIDEIN_Max_Walltime = 84600
MATCH_EXP_JOB_GLIDEIN_ClusterId = "1796851"
CRAB_StageoutPolicy = "local,remote"
CRAB_Workflow = "140311_175214_vocms20:pkonst_crab_postfeb_scale_PK_6h_01"
CurrentHosts = 0
MATCH_GLIDEIN_ProcId = 1
x509UserProxyExpiration = 1395165177
Iwd = "/data01/glidecondor/condor_local/spool/2767/0/cluster862767.proc0.subproc0"
MATCH_EXP_JOB_GLIDEIN_Entry_Name = "CMS_T2_US_MIT_ce02"
NumShadowStarts = 7
JobPrio = 10
DiskUsage = 1250000
StartdPrincipal = "execute-side@matchsession/18.12.8.216"
JOB_GLIDEIN_ToDie = "$$(GLIDEIN_ToDie:Unknown)"
JobRunCount = 7
CurrentTime = time()
MachineAttrSlotWeight0 = 1
JOB_Site = "$$(GLIDEIN_Site:Unknown)"
WantCheckpoint = false
RequestDisk = 1
TotalSuspensions = 0
DAGNodeName = "Job46"
LastPublicClaimId = "<18.12.8.216:56220>#1394574543#1#..."
PeriodicRemove = ( ( JobStatus =?= 5 ) && ( time() - EnteredCurrentStatus > 7 * 60 ) ) || ( ( JobStatus =?= 2 ) && ( ( MemoryUsage > RequestMemory ) || ( MaxWallTimeMins * 60 < time() - EnteredCurrentStatus ) ) )
JOBGLIDEIN_CMSSite = "$$([ifThenElse(GLIDEIN_CMSSite is undefined, \"Unknown\", GLIDEIN_CMSSite)])"
MATCH_GLIDEIN_CMSSite = "T2_US_MIT"
RemoteSysCpu = 0.0
CRAB_Retry = 0
MyType = "Job"
CRAB_JobType = "analysis"
PeriodicHold = false
ResidentSetSize_RAW = 2018592
JOB_GLIDEIN_Job_Max_Time = "$$(GLIDEIN_Job_Max_Time:Unknown)"
EnvDelim = ";"
MATCH_EXP_JOB_GLIDEIN_Memory = "2500"
Owner = "cms535"
MATCH_GLIDEIN_Entry_Name = "CMS_T2_US_MIT_ce02"
LastJobLeaseRenewal = 1394595529
MATCH_EXP_JOB_GLIDEIN_CMSSite = "T2_US_MIT"
CRAB_AdditionalOutputFiles = {  }
OnExitHold = false
CRAB_ASOURL = "https://cmsweb-testbed.cern.ch/couchdb"
AccountingGroup = "pkonst"
WantRemoteSyscalls = false
ExitStatus = 0
User = "cms535@cmsanalysis"
JobLeaseDuration = 1200
MATCH_GLIDEIN_SEs = "se01.cmsaf.mit.edu"
JOB_Gatekeeper = ifthenelse(substr(Used_Gatekeeper,0,1) =!= "$",Used_Gatekeeper,ifthenelse(MATCH_GLIDEIN_Gatekeeper =!= undefined,MATCH_GLIDEIN_Gatekeeper,"Unknown"))
MATCH_Memory = 2500
CompletionDate = 0
WhenToTransferOutput = "ON_EXIT"
RequestCpus = 1
ExecutableSize = 5
x509UserProxyFirstFQAN = "/cms/Role=NULL/Capability=NULL"
CommittedSuspensionTime = 0
MATCH_GLIDEIN_Factory = "SDSC"
GlobalJobId = "vocms20.cern.ch#863008.0#1394560535"
CRAB_ISB = "https://cmsweb.cern.ch/crabcache/file"
StreamErr = false
DAGManNodesLog = "/data01/glidecondor/condor_local/spool/2767/0/cluster862767.proc0.subproc0/RunJobs.dag.nodes.log"
Rank = 0.0
TransferInput = "CMSRunAnalysis.sh,cmscp.py,CMSRunAnalysis.tar.gz,sandbox.tar.gz"
JobUniverse = 5
MATCH_GLIDEIN_ClusterId = 1796851
PeriodicRelease = ( HoldReasonCode == 28 ) || ( HoldReasonCode == 30 ) || ( HoldReasonCode == 13 ) || ( HoldReasonCode == 6 )
MATCH_EXP_JOB_GLIDEIN_Job_Max_Time = "34800"
JobCurrentStartExecutingDate = 1394595516
CRAB_oneEventMode = 0
x509userproxy = "/data01/glidecondor/condor_local/spool/2767/0/cluster862767.proc0.subproc0/188cb4f967148c26081018493b593e73a67389e6"
MATCH_EXP_JOB_GLIDEIN_ToRetire = "1394623404"
MATCH_EXP_JOB_GLIDEIN_Factory = "SDSC"
JOB_GLIDEIN_SEs = "$$(GLIDEIN_SEs:Unknown)"
JobNotification = 0
CRAB_DBSURL = "https://cmsweb.cern.ch/dbs/prod/global/DBSReader"
RemoveReason = "Removed due to memory use"
ProcId = 0
JOB_GLIDEIN_MaxMemMBs = "$$(GLIDEIN_MaxMemMBs:Unknown)"
Used_Gatekeeper = "$$(GLIDEIN_Gatekeeper:Unknown)"
CondorVersion = "$CondorVersion: 8.0.4 Oct 19 2013 BuildID: 189770 $"
BytesRecvd = 7850727.000000
Arguments = "-a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache/file --jobNumber=46 --cmsswVersion=CMSSW_5_3_4 --scramArch=slc5_amd64_gcc462 --inputFile=[\"MCFakeFile\"] --runAndLumis={} --lheInputFiles=None --firstEvent=540001 --firstLumi=46 --lastEvent=552001 --firstRun=1 --seeding=AutomaticSeeding -o {}"
ShouldTransferFiles = "YES"
Out = "job_out.46"
JOB_GLIDEIN_Memory = "$$(Memory:Unknown)"
NumJobMatches = 7
CumulativeSlotTime = 33457.000000
OnExitRemove = true
ResidentSetSize = 2250000
JOB_GLIDEIN_Max_Walltime = "$$(GLIDEIN_Max_Walltime:Unknown)"
JobAdInformationAttrs = "MATCH_EXP_JOBGLIDEIN_CMSSite, JOBGLIDEIN_CMSSite, RemoteSysCpu, RemoteUserCpu"
In = "/dev/null"
LastJobStatus = 2
CumulativeSuspensionTime = 0
MemoryUsage = ( ( ResidentSetSize + 1023 ) / 1024 )
MATCH_EXP_JOB_CMSSite = "T2_US_MIT"
CRAB_TaskWorker = "vocms245"
OrigMaxHosts = 1
TransferIn = false
CRAB_Id = 46
JOB_GLIDEIN_Name = "$$(GLIDEIN_Name:Unknown)"
WantRemoteIO = true
MATCH_EXP_JOB_GLIDEIN_MaxMemMBs = "2500"
MATCH_GLIDEIN_ToRetire = 1394623404
ImageSize = 2750000
ExecutableSize_RAW = 5
JobCurrentStartDate = 1394595478
x509userproxysubject = "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=pkonst/CN=592300/CN=Preslav Borislavov Konstantinov"
NumJobStarts = 3
AutoClusterAttrs = "DESIRED_Gatekeepers,DESIRED_SEs,DESIRED_Sites,DESIRES_HTPC,GLIDEIN_Is_HTPC,JOB_Is_ITB,LastVacateTime,MaxWallTimeMins,NormMaxWallTimeMins,Owner,RequestMemory,JobUniverse,LastCheckpointPlatform,NumCkpts,ImageSize,DynamicSlot,PartitionableSlot,Slot1_TotalTimeClaimedBusy,Slot1_TotalTimeUnclaimedIdle,x509userproxyfirstfqan,x509userproxysubject,Slot2_TotalTimeClaimedBusy,Slot3_TotalTimeClaimedBusy,Slot4_TotalTimeClaimedBusy,Slot5_TotalTimeClaimedBusy,Slot6_TotalTimeClaimedBusy,Slot7_TotalTimeClaimedBusy,Slot8_TotalTimeClaimedBusy,Slot2_TotalTimeUnclaimedIdle,Slot3_TotalTimeUnclaimedIdle,Slot4_TotalTimeUnclaimedIdle,Slot5_TotalTimeUnclaimedIdle,Slot6_TotalTimeUnclaimedIdle,Slot7_TotalTimeUnclaimedIdle,Slot8_TotalTimeUnclaimedIdle,CMS_ALLOW_OVERFLOW,MemoryUsage,RequestDisk,Requirements,NiceUser,ConcurrencyLimits"
Cmd = "/data01/glidecondor/condor_local/spool/2767/0/cluster862767.proc0.subproc0/gWMS-CMSRunAnalysis.sh"
JobStartDate = 1394561501
LastMatchTime = 1394595478
MATCH_EXP_JOB_GLIDEIN_ToDie = "1394658204"
JOB_GLIDEIN_CMSSite = "$$(GLIDEIN_CMSSite:Unknown)"
CoreSize = 0
MATCH_EXP_JOB_GLIDEIN_Schedd = "schedd_glideins1@gfactory-1.t2.ucsd.edu"
TargetType = "Machine"
TransferOutput = "jobReport.json.46"
job_ad_information_attrs = MATCH_GLIDEIN_Gatekeeper
CommittedSlotTime = 0
JobStatus = 3
x509UserProxyEmail = "Preslav.Konstantinov@cern.ch"
DAGManJobId = 862767
RemoteWallClockTime = 33457.000000
NumSystemHolds = 0
CRAB_UserDN = "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=pkonst/CN=592300/CN=Preslav Borislavov Konstantinov"
LastRemoteHost = "glidein_14760@T2BAT0216.CMSAF.MIT.EDU"
MATCH_EXP_JOB_GLIDEIN_Name = "v3_0"
JOB_GLIDEIN_Site = "$$(GLIDEIN_Site:Unknown)"
Requirements = ( ( ( target.IS_GLIDEIN =!= true ) || ( target.GLIDEIN_CMSSite =!= undefined ) ) ) && ( ( Memory >= 1 ) && ( Disk >= 1 ) ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.HasFileTransfer )
CRAB_EDMOutputFiles = { "MyTTBarTauolaTest.root" }
DESIRED_SITES = "T3_US_PuertoRico,T2_FI_HIP,T2_UK_SGrid_RALPP,T2_FR_GRIF_LLR,T3_US_Baylor,T3_UK_London_QMUL,T3_TW_NTU_HEP,T3_US_Omaha,T2_KR_KNU,T2_RU_SINP,T3_US_UMD,T2_CH_CERN_AI,T1_TW_ASGC,T3_US_Colorado,T3_US_UB,T1_UK_RAL_Disk,T3_IT_Napoli,T3_NZ_UOA,T2_TH_CUNSTDA,T3_US_Kansas,T3_US_ParrotTest,T3_GR_IASA,T3_US_Parrot,T2_IT_Bari,T2_US_UCSD,T1_RU_JINR,T3_US_Vanderbilt_EC2,T2_RU_IHEP,T2_RU_RRC_KI,T2_CH_CERN,T3_BY_NCPHEP,T3_US_TTU,T3_GR_Demokritos,T3_US_UTENN,T3_US_UCR,T3_TW_NCU,T2_CH_CSCS,T2_UA_KIPT,T3_RU_FIAN,T2_RU_PNPI,T2_IN_TIFR,T3_UK_London_UCL,T3_US_Brown,T3_US_UCD,T3_CO_Uniandes,T3_KR_KNU,T2_FR_IPHC,T3_US_OSU,T3_US_TAMU,T1_US_FNAL,T2_IT_Rome,T2_UK_London_Brunel,T3_IN_PUHEP,T3_IT_Trieste,T2_EE_Estonia,T3_UK_ScotGrid_ECDF,T2_CN_Beijing,T2_US_Florida,T3_US_Princeton_ICSE,T3_IT_MIB,T3_US_FNALXEN,T3_US_Rutgers,T1_DE_KIT,T3_IR_IPM,T2_US_Wisconsin,T2_HU_Budapest,T2_DE_RWTH,T2_US_Vanderbilt,T2_BR_SPRACE,T3_UK_SGrid_Oxford,T3_US_NU,T2_BR_UERJ,T3_MX_Cinvestav,T3_US_FNALLPC,T1_US_FNAL_Disk,T3_US_UIowa,T3_IT_Firenze,T3_US_Cornell,T2_ES_IFCA,T3_US_UVA,T3_ES_Oviedo,T3_US_NotreDame,T2_DE_DESY,T1_UK_RAL,T2_US_Caltech,T3_FR_IPNL,T2_TW_Taiwan,T3_US_NEU,T3_UK_London_RHUL,T0_CH_CERN,T1_RU_JINR_Disk,T3_CN_PKU,T2_UK_London_IC,T2_US_Nebraska,T2_ES_CIEMAT,T3_US_Princeton,T2_PK_NCP,T2_CH_CERN_T0,T3_US_FSU,T3_KR_UOS,T3_IT_Perugia,T1_IT_CNAF,T3_US_Minnesota,T2_TR_METU,T2_AT_Vienna,T2_US_Purdue,T3_US_Rice,T3_HR_IRB,T2_BE_UCL,T3_US_FIT,T2_UK_SGrid_Bristol,T2_PT_NCG_Lisbon,T1_ES_PIC,T3_US_JHU,T2_IT_Legnaro,T2_RU_INR,T3_US_FIU,T3_EU_Parrot,T2_RU_JINR,T2_IT_Pisa,T2_GR_Ioannina,T3_UK_ScotGrid_GLA,T3_US_MIT,T2_CH_CERN_HLT,T2_MY_UPM_BIRUNI,T1_FR_CCIN2P3,T2_FR_GRIF_IRFU,T3_US_UMiss,T2_FR_CCIN2P3,T2_PL_Warsaw,T3_AS_Parrot,T2_US_MIT,T2_BE_IIHE,T2_RU_ITEP,T1_CH_CERN,T3_CH_PSI,T3_IT_Bologna"
NumCkpts = 0
CRAB_AsyncDestSE = "stormfe1.pi.infn.it"
DiskUsage_RAW = 1076296
DAGManNodesMask = "0,1,2,4,5,7,9,10,11,12,13,16,17,24,27"
MATCH_EXP_JOB_GLIDEIN_ProcId = "1"
CRAB_localOutputFiles = "MyTTBarTauolaTest.root=MyTTBarTauolaTest_46.root"
MaxHosts = 1
CRAB_UserHN = "pkonst"
MATCH_EXP_JOB_GLIDEIN_Max_Walltime = "84600"
MATCH_EXP_JOB_GLIDEIN_SEs = "se01.cmsaf.mit.edu"
CRAB_InputData = "/GenericTTbar/HC-CMSSW_5_3_1_START53_V5-v1/GEN-SIM-RECO"
CondorPlatform = "$CondorPlatform: x86_64_RedHat5 $"
MATCH_GLIDEIN_Job_Max_Time = 34800
ImageSize_RAW = 2727460
MATCH_EXP_Used_Gatekeeper = "ce02.cmsaf.mit.edu:2119/jobmanager-condor"
LocalSysCpu = 0.0
LastSuspensionTime = 0
MATCH_GLIDEIN_Gatekeeper = "ce02.cmsaf.mit.edu:2119/jobmanager-condor"
MATCH_EXP_JOB_GLIDEIN_Site = "MIT"
UserLog = "/data01/glidecondor/condor_local/spool/2767/0/cluster862767.proc0.subproc0/job_log"
CRAB_DataBlock = "MCFackBlock"
Env = "CRAB_TASKMANAGER_TARBALL=local;SCRAM_ARCH=slc5_amd64_gcc462;CRAB_RUNTIME_TARBALL=local"
CRAB_BlacklistT1 = 0
JOB_GLIDEIN_Factory = "$$(GLIDEIN_Factory:Unknown)"
TransferInputSizeMB = 2
MachineAttrCpus0 = 1
CRAB_JobArch = "slc5_amd64_gcc462"
QDate = 1394560535
CRAB_OutputData = "140311_175214_crab_postfeb_scale_PK_6h_01-ed42d010850f24b9614161fd19b49e93"
CRAB_PublishDBSURL = undefined
x509UserProxyFQAN = "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=pkonst/CN=592300/CN=Preslav Borislavov Konstantinov,/cms/Role=NULL/Capability=NULL"
Err = "job_err.46"
CRAB_SiteWhitelist = {  }
CRAB_Destination = "srm://stormfe1.pi.infn.it:8444/srm/managerv2?SFN=/cms/store/user/pkonst/GenericTTbar/140311_175214_crab_postfeb_scale_PK_6h_01/140311_175214/0000/log/cmsRun_46.log.tar.gz, srm://stormfe1.pi.infn.it:8444/srm/managerv2?SFN=/cms/store/user/pkonst/GenericTTbar/140311_175214_crab_postfeb_scale_PK_6h_01/140311_175214/0000/MyTTBarTauolaTest_46.root"
CRAB_TFileOutputFiles = {  }
AutoClusterId = 1410
ExitBySignal = false
CRAB_UserGroup = undefined
PeriodicRemoveReason = ifThenElse(MemoryUsage > RequestMemory,"Removed due to memory use",ifThenElse(MaxWallTimeMins * 60 < time() - EnteredCurrentStatus,"Removed due to wall clock limit","Removed due to job being held"))
MATCH_EXP_JOB_Site = "MIT"
BufferBlockSize = 32768
CRAB_AsyncDest = "T2_IT_Pisa"
ClusterId = 863008
BytesSent = 2356900.000000
CRAB_Dest = "/store/temp/user/pkonst.0d6c0ae7ff2f278e23a483dc3e15d3995c788519/GenericTTbar/140311_175214_crab_postfeb_scale_PK_6h_01/140311_175214"
MATCH_EXP_JOBGLIDEIN_CMSSite = "T2_US_MIT"
JobLastStartDate = 1394595409
MATCH_GLIDEIN_MaxMemMBs = 2500
RequestMemory = 2000
EnteredCurrentStatus = 1394595529
JOB_GLIDEIN_ToRetire = "$$(GLIDEIN_ToRetire:Unknown)"
CRAB_JobSW = "CMSSW_5_3_4"
BufferSize = 524288
JOB_GLIDEIN_Schedd = "$$(GLIDEIN_Schedd:Unknown)"
MaxWallTimeMins = 1315
LeaveJobInQueue = false
NumCkpts_RAW = 0
JobFinishedHookDone = 1394595529
MinHosts = 1
NumShadowExceptions = 3
JOB_GLIDEIN_ClusterId = "$$(GLIDEIN_ClusterId:Unknown)"
MATCH_GLIDEIN_Name = "v3_0"
MATCH_GLIDEIN_ToDie = 1394658204
NiceUser = false
RootDir = "/"
LastVacateTime_RAW = 1394595529
CommittedTime = 0
belforte commented 10 years ago

there is a "knwon" stickiness of some matching class ad when job is restarted, but this was all solved for Crab2. Could please Crab3 have a look at what was done and avoid restarting from scratch ? I am getting a bit annoyed of all not clear here, not properly reported there, discrepancy with dashboard in another place. Since Crab2 uses glideinWms and all those problems have been solved long ago, I would like Crab3 to deal with things silently, copy as proper, criticize Crab2 as useful, and simply gave us same informations. OTOH it is good that others learn the condor/gwms internals and limitations.

Also MIT and Florida could be a matter of overlow, where we let jobs match Florida even if they had requsted MIT etc. What is really puzzling to me is the presence of classads referring to the French site.

belforte commented 10 years ago

@pkonst about this "The postjob read the RemoveReason = "Removed due to memory use" classad from the first attempt and considered the job a unrecoverable failure and did not retry it." But you say it was restarted 3 times. RemoveReason should not be looked at unless the last job status is "removed".

by the way, in Crab2 we sue Job_Gatekeeper to tell the execution site, somehow I do not see it in your history dump. https://github.com/dmwm/ProdCommon/blob/master/src/python/ProdCommon/BossLite/Scheduler/SchedulerRemoteglidein.py#L485

belforte commented 10 years ago

also this will tell useful things: grep 863008 job_log in particular: root@vocms20/cluster862767.proc0.subproc0> grep 863008 job_log|grep executing 001 (863008.000.000) 03/11 19:21:10 Job executing on host: 10.13.69.39:46104?CCBID=169.228.38.39:9893#156259%20169.228.130.23:9893#44557&noUDP 001 (863008.000.000) 03/12 04:37:02 Job executing on host: 131.169.161.235:44115?CCBID=169.228.38.39:9811#156183%20169.228.130.23:9811#45472&noUDP 001 (863008.000.000) 03/12 04:38:36 Job executing on host: 18.12.8.216:56220?CCBID=169.228.38.39:9976#155538%20169.228.130.23:9976#45412&noUDP root@vocms20/cluster862767.proc0.subproc0> so job stared execution on 10.13.69.39 (dunno where it is), then shadow disconnected (network, WN died.. whatever) and restarted on 131.169.161.235 = grid-wn0356.desy.de where there was a shadow exception so it was restarted again on 18.12.8.216 = t2bat0216.cmsaf.mit.edu

Only at this point condor sort of caught up and notices that when job was restarted the first time it has used 2.7GB of memory and decided to remove it.

So... this seems a different job from the one that succeeded in Florida. About the remove becasue of too much memory, we already know by now that ImageSize is not the right thing to use (it is VSS, not RSS) and in any case I think we should review this thing that condor will not restart failed jobs if they used more then X GB of RAM's since this is taken care of by crab watchdog.

Main worry for me is .. where's Florida ?

belforte commented 10 years ago

about this line in initial postin:

Not retrying job due to excessive memory use (job killed by CRAB3 watchdog)

is this printed by Crab ? It seems a bug, there is likely no indication that CRAB3 watchdog kicked in

belforte commented 10 years ago

I went through this on the phone with Marco, let me recap. By the way: 10.13.69.39 seems to be Florida (inside NAT of course) since the stdout "of the jon which ran in Florida" has: JOB AD: StartdIpAddr = "<10.13.69.39

so here's history from job_log i.e. condor log, times are CERN i.e. timestamps in condor log, of course would be easier if all time stamps all over would be using same time zone (UTC). We keep failing at it since ever :-(

03/11 19:21:10 job starts in FLorida job uses a bigger and bigger executable: until 03/12 02:07:02 Image size of job updated: 2727460 then image size stops updating, I guess stageout happened, and worked, and that's how Preslav got the "good stdout" but before condor could finally close the job: 03/12 04:33:37 Job disconnected, attempting to reconnect 03/12 04:33:37 Job reconnection failed so condor (the schedd) decided to restart the job but then things happened very fast in less then a minute:

03/12 04:37:02 Job executing on host: <131.169.161 i.e. Desy 03/12 04:37:48 Shadow exception! 03/12 04:38:36 Job executing on host: <18.12.8.216 i.e. MIT 03/12 04:38:49 Job was aborted by the user.

the last line means killed by condor, not by preslav, and IMHO it referes to schedd removing the job because it noticed that it had used 2.7GB and did not like it.

So far ... so good. The important things are:

  1. the job in FLorida did NOT finish correctly. crab wrapper did, but condor dit not, it does not matter which file is on SE, on disk.. final "all OK" message did not come, and job has to be resent. I know users do not like, but that's it, all OK or nothing.
  2. so the MOST IMPORTANT thing is that you make sure that everyody: crab status, glidemon, dashboard, reported the job as FAILED. Preslav ?
  3. somehow I would have expected that condor removes the job before starting it at Desy then MIT. But maybe here it is just optimization, schedd starts the jobs quickly to save time, only later checks start expression and decides no good and take it away.
  4. It seems bad to me that Job_Gatekeeper is missing in the log, as it is usually the thing that best tracks where jobs really started last time it went Running.
  5. finally there's the fact that Preslav's job was NOT using 2.7GB of RAM and this is what Brian had independently discovered 3 days ago: ImageSIze used by our start expression is measuring virtual size, not real size. Thanks to Brian's reporting the start expression was fixed by James.

for the lucky of you who are not reading all the messages on glideinWms support list, here's the original messages about the problem 5.:

Hi Stefano,

The job doesn't fail, but it never runs again after it is interrupted (think network disconnect). Seems to happen at about a 1% level. See:

root@submit-4 ~# condor_q -const 'JobStatus =?= 1 && ImageSize > 1024_1024_2' | wc -l 952

I suspect these users just get frustrated by "indefinitely idle jobs" (like I did - that's why I looked into it 1) and resubmit.

Brian

On Mar 10, 2014, at 11:39 AM, Stefano Belforte stefano.belforte@cern.ch wrote:

odd that it did not bite earlier. I am sure we have plenty of jobs with large VSS's. Wondering what could have happened to those jobs,did they ever run ? Have we been ignoring lots of failures ? stefano

On 03/10/2014 05:25 PM, Igor Sfiligoi wrote:

Indeed, wrong list.

Re-posting it on the CMS AnaOps FE list.

Igor

On 03/10/2014 06:40 AM, Brian Bockelman wrote:

Hi,

(perpetually confused about when to email gfactory versus FE...)

I noticed the following in CMS glidein START expressions:

( ImageSize<= ( GLIDEIN_MaxMemMBs * 1024 ) )

This is wrong - ImageSize is VSS, which is fairly meaningless here. [In fact, CMSSW adopted a version of malloc which causes ImageSize to grow unbounded... in the case of my jobs, it is 12GB for a job using .75GB RAM].

Might I suggest the following:

ifThenElse(MemoryUsage isnt null, MemoryUsage<= GLIDEIN_MaxMemMBs, true)

MemoryUsage may be undefined before the job is run the first time; however, that case is also covered by an existing clause ( RequestMemory<= GLIDEIN_MaxMemMBs ).

How quickly can this be fixed? Seems to be nothing I can do from my side except to periodically condor_qedit jobs.

Thanks,

Brian


Osg-gfactory-support mailing list Osg-gfactory-support@physics.ucsd.edu https://physics-mail.ucsd.edu/mailman/listinfo/osg-gfactory-support


Cms-wms-support mailing list Cms-wms-support@physics.ucsd.edu https://physics-mail.ucsd.edu/mailman/listinfo/cms-wms-support

pkonst commented 10 years ago

Hi Stefano,

Thanks for the analysis of the sequence of events. Furthermore in the condor_history we have

NumShadowStarts = 7
JobRunCount = 7
NumJobMatches = 7

about this line in initial postin:

Not retrying job due to excessive memory use (job killed by CRAB3 watchdog)

is this printed by Crab ? It seems a bug, there is likely no indication that CRAB3 watchdog kicked in

This is printed on the stdout of the postjob (see http://glidemon.web.cern.ch/glidemon/show.php?log=http://vocms20.cern.ch/mon/cms535/140311_175214_vocms20:pkonst_crab_postfeb_scale_PK_6h_01/postjob.46.0.txt) based on the value of the RemoveReason="Removed due to memory use" classad. It comes from here: https://github.com/dmwm/CRABServer/blob/master/src/python/TaskWorker/Actions/RetryJob.py#L141

So condor restarted the job several times (I wonder whether 3 or 7) at different sites but for Dagman it ran only once and was not retried. Glidemon shows the same.

This is considered a unrecoverable error and the postjob exits in a way so that dagman does not retry the job.

so the MOST IMPORTANT thing is that you make sure that everyody: crab status, glidemon, dashboard, reported the job as FAILED. Preslav ?

Yes, the job is reported as failed in all these places. The only discrepancy is that dashboard shows it failed at T2_DE_DESY and glidemon and crab status -l show T2_US_MIT.

When I saw the job listed as 'failed' in Glidemon, I opened its stdout to try to understand what went wrong. And there I saw the job ran at Florida without any problem at all. That's where the confusion came from.

Not sure how such confusion can be avoided. Maybe the postjob could make a more thorough analisys of the history from job_log as you did.

Cheers, Preslav

belforte commented 10 years ago

thanks Preslav, let me only comment on the bottom lines:

On 03/13/2014 06:04 PM, pkonst wrote:

Yes, the job is reported as failed in all these places. The only discrepancy is that dashboard shows it failed at T2_DE_DESY and glidemon and crab status -l show T2_US_MIT.

this means that job wrapper started in DESY and sent a message to dashboard, then the one at MIT was killed by condor before crab wrapper reached that point. Given that time stamps are few tens of secons apart it is quite possible. So everybody did as expected, but condor. We should follow up with condor to understand if it is OK that job was restarted even if the START expression evaluates to false. Then I fear whoever parses condor_history output needs to get more sophisticated.

When I saw the job listed as 'failed' in Glidemon, I opened its stdout to try to understand what went wrong. And there I saw the job ran at Florida without any problem at all. That's where the confusion came from.

Yes. I think now we understand that. "crab" should make it clear to users that in cases like this:

  1. the job failed. period.
  2. the job was tried to run a few times.
  3. the stdout if any on disk may not refer to the last attempt.

This is not something new. Crab2 always had a fraction of jobs aborted but with exit code=0, more with glite, less with glideinwms, but the important thing is that users see ABORTED first and outmost.

stefano

belforte commented 10 years ago

this is simply a wrong line. I've opened a git issue https://github.com/dmwm/CRABServer/issues/4243

On 03/13/2014 06:04 PM, pkonst wrote:

This is printed on the stdout of the postjob (see http://glidemon.web.cern.ch/glidemon/show.php?log=http://vocms20.cern.ch/mon/cms535/140311_175214_vocms20:pkonst_crab_postfeb_scale_PK_6h_01/postjob.46.0.txt) based on the value of the RemoveReason="Removed due to memory use" classad. It comes from here: https://github.com/dmwm/CRABServer/blob/master/src/python/TaskWorker/Actions/RetryJob.py#L141

bbockelm commented 10 years ago

Hi,

I think the underlying issue is that the way we have setup the killing due to memory limit - no stdout is brought back when this triggers.

If stdout was returned, it would have overwritten the stdout from Florida from the incomplete job with stdout from MIT.

We would also know whether or not the job sent in appropriate info to the dashboard (which should have overwritten the info from DESY).

There's no human readable form, but the full job history (including Florida, DESY, and MIT in this case) should be available from the output of "crab status -l --json".

Brian

bbockelm commented 10 years ago

If we can't figure out a way to get stdout, another thing we could do is prevent jobs from having multiple starts and have DAG do a full retry.

belforte commented 10 years ago

@bbockelm memory limit goes into START expression to prevent job from running, will never returns a stdout. We do something ugly. Prevent job from rerunning instead of calling it failed to begin with. This is a hack invented by Igor when he had no other way. We should remove it and rely on the watchdog.

pkonst commented 10 years ago

Indeed "crab status -l --json" has the full job history. For this job I get:

'46': {
  'Retries': 0,
  'WallDurations': [33147.0, 46.0, 14.0],
  'StartTimes': [1394562070.0, 1394595422.0, 1394595516.0],
  'SubmitTimes': [1394560535.0, -1, -1],
  'JobIds': ['863008.0', '863008.0', '863008.0'],
  'EndTimes': [1394595217.0, 1394595468.0],
  'Restarts': 2,
  'RecordedSite': True,
  'State': 'failed',
  'ResidentSetSize': [1940796, 0, 2046344],
  'TotalUserCpuTimeHistory': [24343.0, 24343.0, 4.0],
  'SiteHistory': ['T2_US_Florida', 'T2_DE_DESY', 'T2_US_MIT'],
  'TotalSysCpuTimeHistory': [332.0, 332.0, 0.0]
}
belforte commented 10 years ago

ah.. this is really nice, thanks ! I love that SiteHistory. Am curious how could crab (Brian ?) get it, since it took me to compare IP in crab log with IP from stdout.

bbockelm commented 10 years ago

Looks like it got the CPU usage wrong :(

I'll think about how we could make this user friendly. Maybe open a new ticket?

Sent from my iPhone

On Mar 14, 2014, at 2:40 PM, Stefano Belforte notifications@github.com wrote:

ah.. this is really nice, thanks ! I love that SiteHistory. Am curious how could crab (Brian ?) get it, since it took me to compare IP in crab log with IP from stdout. — Reply to this email directly or view it on GitHub.

belforte commented 10 years ago

at this level of detail and for such rare cases I think it is enough to have it expert friendly. I.e. do not bother to do better until it becomes a common problem.