dmwm / CRABServer

16 stars 38 forks source link

when job wrapper can't find python, should exit with 10034, not -1 #5029

Closed belforte closed 8 years ago

belforte commented 9 years ago

see e.g. https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/1002.html where job stdout has

Error: Unable to find a CMS version of python.
CRAB3 requires the CMS version of python to function.
sh: ./DashboardFailure.sh: No such file or directory

the wrapper appears particularly schizophrenic here since a few lines later

Found python2.6 at:
/cvmfs/cms.cern.ch/COMP/slc6_amd64_gcc481/external/python/2.6.8-comp9/bin/python2.6
belforte commented 9 years ago

since this is not critical, may be looked at when info from that task is long gone from disk. So for reference here's one jobs stdout

======== gWMS-CMSRunAnalysis.sh STARTING at Tue Oct  6 14:53:40 GMT 2015 on compute-2-2.local ========
Local time : Tue Oct  6 10:53:40 EDT 2015
Current system : Linux compute-2-2.local 2.6.32-504.30.3.el6.centos.plus.x86_64 #1 SMP Wed Jul 15 00:52:19 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Arguments are -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=10 --cmsswVersion=CMSSW_7_1_12_patch1 --scramArch=slc6_amd64_gcc481 --inputFile=job_input_file_list_10.txt --runAndLumis=job_lumis_10.json --lheInputFiles=True --firstEvent=18001 --firstLumi=181 --lastEvent=20001 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=runScript.sh --eventsPerLumi=100 --scriptArgs=[] -o {}
SCRAM_ARCH=slc6_amd64_gcc481
======== HTCONDOR JOB SUMMARY at Tue Oct  6 14:53:40 GMT 2015 START ========
CRAB ID: 10
Execution site: T3_US_FIT
Current hostname: compute-2-2.local
Destination temp dir: /store/temp/user/mhamer.622048dbeda2b51850112c460bc3f99a18e01eaf/CRAB_PrivateMC/LLG_1000_100_Spring15_v1/151006_140607
Output files: step0.root=step0_10.root
==== HTCONDOR JOB AD CONTENTS START ====
== JOB AD: MaxWallTimeMins_RAW = 1400
== JOB AD: RemoteSlotID = 1
== JOB AD: CRAB_SaveLogsFlag = 0
== JOB AD: JOB_GLIDEIN_ProcId = "0"
== JOB AD: StreamOut = false
== JOB AD: JOB_GLIDEIN_Entry_Name = "CMS_T3_US_FIT_uscms1"
== JOB AD: CRAB_ReqName = "151006_140607:mhamer_crab_LLG_1000_100_Spring15_v1"
== JOB AD: use_x509userproxy = true
== JOB AD: JOB_CMSSite = "T3_US_FIT"
== JOB AD: ShadowBday = 1444143205
== JOB AD: CRAB_SiteBlacklist = {  }
== JOB AD: CRAB_UserRole = undefined
== JOB AD: MyAddress = "<128.142.177.81:4080?addrs=128.142.177.81-4080&noUDP&sock=1715668_1000_64115>"
== JOB AD: TaskType = "Job"
== JOB AD: ShadowVersion = "$CondorVersion: 8.3.8 Aug 27 2015 BuildID: 338845 $"
== JOB AD: NumRestarts = 0
== JOB AD: SubmitEventNotes = "DAG Node: Job10"
== JOB AD: x509UserProxyVOName = "cms"
== JOB AD: DAGParentNodeNames = ""
== JOB AD: LocalUserCpu = 0.0
== JOB AD: RemoteUserCpu = 0.0
== JOB AD: JOB_GLIDEIN_SiteWMS_Queue = "uscms1.fltech-grid3.fit.edu"
== JOB AD: CRAB_StageoutPolicy = "local,remote"
== JOB AD: CRAB_Workflow = "151006_140607:mhamer_crab_LLG_1000_100_Spring15_v1"
== JOB AD: CurrentHosts = 1
== JOB AD: OrigCmd = "/data/srv/glidecondor/condor_local/spool/5701/0/cluster14265701.proc0.subproc0/gWMS-CMSRunAnalysis.sh"
== JOB AD: x509UserProxyExpiration = 1444745318
== JOB AD: Iwd = "/var/lib/condor/execute/dir_3218/glide_hmb5Y3/execute/dir_24398"
== JOB AD: NumShadowStarts = 1
== JOB AD: JobPrio = 10
== JOB AD: DiskUsage = 125000
== JOB AD: CRAB_ASOTimeout = 86400
== JOB AD: StartdPrincipal = "execute-side@matchsession/163.118.42.1"
== JOB AD: JOB_GLIDEIN_ToDie = "1444310798"
== JOB AD: JobRunCount = 1
== JOB AD: CurrentTime = time()
== JOB AD: MachineAttrSlotWeight0 = 1
== JOB AD: JOB_Site = "FIT"
== JOB AD: WantCheckpoint = false
== JOB AD: RequestDisk = 100000
== JOB AD: TotalSuspensions = 0
== JOB AD: DAGNodeName = "Job10"
== JOB AD: RequestDisk_RAW = 1
== JOB AD: PeriodicRemove = ( ( JobStatus =?= 5 ) && ( time() - EnteredCurrentStatus > 7 * 60 ) ) || ( ( JobStatus =?= 1 ) && ( time() - EnteredCurrentStatus > 7 * 24 * 60 * 60 ) ) || ( ( JobStatus =?= 2 ) && ( ( MemoryUsage > RequestMemory ) || ( MaxWallTimeMins * 60 < time() - EnteredCurrentStatus ) || ( DiskUsage > 100000000 ) ) ) || ( ( JobStatus =?= 1 ) && ( time() > ( x509UserProxyExpiration + 86400 ) ) )
== JOB AD: JOBGLIDEIN_CMSSite = "T3_US_FIT"
== JOB AD: RemoteSysCpu = 0.0
== JOB AD: CRAB_Retry = 2
== JOB AD: MyType = "Job"
== JOB AD: CRAB_JobType = "analysis"
== JOB AD: PeriodicHold = false
== JOB AD: JOB_GLIDEIN_Job_Max_Time = "34800"
== JOB AD: EnvDelim = ";"
== JOB AD: CRAB_RestHost = "cmsweb.cern.ch"
== JOB AD: Owner = "cms1774"
== JOB AD: JOB_GLIDEIN_SiteWMS_JobId = "985690.0"
== JOB AD: MemoryProvisioned = 2500
== JOB AD: RemoteHost = "glidein_3222_21085887@compute-2-2.local"
== JOB AD: LastJobLeaseRenewal = 1444143208
== JOB AD: CRAB_AdditionalOutputFiles = {  }
== JOB AD: OnExitHold = false
== JOB AD: CRAB_ASOURL = "https://cmsweb.cern.ch/couchdb"
== JOB AD: CRAB_NumAutomJobRetries = 2
== JOB AD: AccountingGroup = "analysis.mhamer"
== JOB AD: WantRemoteSyscalls = false
== JOB AD: ExitStatus = 0
== JOB AD: User = "cms1774@cms"
== JOB AD: JobLeaseDuration = 2400
== JOB AD: JOB_Gatekeeper = ifthenelse(substr(Used_Gatekeeper,0,1) =!= "$",Used_Gatekeeper,ifthenelse(MATCH_GLIDEIN_Gatekeeper =!= undefined,MATCH_GLIDEIN_Gatekeeper,"Unknown"))
== JOB AD: DESIRED_OpSyses = "LINUX"
== JOB AD: DiskProvisioned = 12915636
== JOB AD: CompletionDate = 0
== JOB AD: WhenToTransferOutput = "ON_EXIT_OR_EVICT"
== JOB AD: RequestCpus = 1
== JOB AD: ExecutableSize = 7
== JOB AD: x509UserProxyFirstFQAN = "/cms/Role=NULL/Capability=NULL"
== JOB AD: CommittedSuspensionTime = 0
== JOB AD: PreJobPrio1 = 1
== JOB AD: GlobalJobId = "crab3test-3@vocms096.cern.ch#14271084.0#1444143195"
== JOB AD: CRAB_ISB = "https://cmsweb.cern.ch/crabcache"
== JOB AD: StreamErr = false
== JOB AD: TransferSocket = "<128.142.177.81:4080?addrs=128.142.177.81-4080&noUDP&sock=1715668_1000_64115>"
== JOB AD: DAGManNodesLog = "/data/srv/glidecondor/condor_local/spool/5701/0/cluster14265701.proc0.subproc0/RunJobs.dag.nodes.log"
== JOB AD: Rank = 0.0
== JOB AD: JOB_GLIDEIN_SiteWMS = "HTCondor"
== JOB AD: TransferInput = "CMSRunAnalysis.sh,cmscp.py,CMSRunAnalysis.tar.gz,sandbox.tar.gz,run_and_lumis.tar.gz,input_files.tar.gz"
== JOB AD: JobUniverse = 5
== JOB AD: PeriodicRelease = ( HoldReasonCode == 28 ) || ( HoldReasonCode == 30 ) || ( HoldReasonCode == 13 ) || ( HoldReasonCode == 6 )
== JOB AD: CRAB_oneEventMode = 0
== JOB AD: x509userproxy = "/data/srv/glidecondor/condor_local/spool/5701/0/cluster14265701.proc0.subproc0/391d3289e9138330eb9a6b8c62cdf184825093bc"
== JOB AD: JOB_GLIDEIN_SEs = "uscms1-se.fltech-grid3.fit.edu"
== JOB AD: JobNotification = 0
== JOB AD: CRAB_DBSURL = "https://cmsweb.cern.ch/dbs/prod/global/DBSReader"
== JOB AD: ProcId = 0
== JOB AD: JOB_GLIDEIN_MaxMemMBs = "2500"
== JOB AD: DESIRED_CMSDataLocations = "T3_US_PuertoRico,T2_FI_HIP,T2_UK_SGrid_RALPP,T2_FR_GRIF_LLR,T3_US_Baylor,T3_UK_London_QMUL,T3_TW_NTU_HEP,T3_US_Omaha,T2_KR_KNU,T2_RU_SINP,T3_US_UMD,T2_CH_CERN_AI,T3_US_Colorado,T3_IT_Napoli,T3_NZ_UOA,T2_TH_CUNSTDA,T3_US_Kansas,T3_GR_IASA,T2_IT_Bari,T2_US_UCSD,T1_RU_JINR,T3_US_Vanderbilt_EC2,T2_RU_IHEP,T2_RU_RRC_KI,T2_CH_CERN,T3_BY_NCPHEP,T2_US_Vanderbilt,T3_GR_Demokritos,T3_US_UTENN,T3_US_UCR,T3_TW_NCU,T2_CH_CSCS,T2_UA_KIPT,T3_IT_Firenze,T2_RU_PNPI,T3_UK_ScotGrid_ECDF,T3_UK_London_UCL,T3_US_Brown,T3_US_UCD,T3_CO_Uniandes,T3_KR_KNU,T2_FR_IPHC,T3_US_OSU,T3_US_TAMU,T1_US_FNAL,T2_IT_Rome,T2_UK_London_Brunel,T3_US_FIU,T3_IN_PUHEP,T3_IT_Trieste,T2_EE_Estonia,T2_IN_TIFR,T2_CN_Beijing,T2_US_Florida,T3_US_Princeton_ICSE,T3_IT_MIB,T3_US_FNALXEN,T3_US_Rutgers,T1_DE_KIT,T3_IR_IPM,T2_US_Wisconsin,T2_HU_Budapest,T2_DE_RWTH,T2_BR_SPRACE,T3_UK_SGrid_Oxford,T3_US_NU,T3_US_UMiss,T2_BR_UERJ,T3_MX_Cinvestav,T3_US_FNALLPC,T3_US_UIowa,T3_RU_FIAN,T3_US_Cornell,T2_ES_IFCA,T3_US_UVA,T3_ES_Oviedo,T3_US_NotreDame,T2_DE_DESY,T3_KR_KISTI,T3_HU_Debrecen,T2_US_Caltech,T3_FR_IPNL,T3_UK_London_RHUL,T3_CN_PKU,T2_UK_London_IC,T2_US_Nebraska,T2_ES_CIEMAT,T3_US_Princeton,T2_PK_NCP,T3_US_FSU,T3_KR_UOS,T3_IT_Perugia,T1_IT_CNAF,T3_US_Minnesota,T2_TR_METU,T2_AT_Vienna,T2_US_Purdue,T3_US_Rice,T3_HR_IRB,T2_BE_UCL,T3_US_FIT,T2_UK_SGrid_Bristol,T2_PT_NCG_Lisbon,T1_ES_PIC,T3_US_JHU,T2_IT_Legnaro,T2_RU_INR,T3_US_SDSC,T1_UK_RAL,T2_RU_JINR,T2_IT_Pisa,T2_GR_Ioannina,T3_UK_ScotGrid_GLA,T3_US_MIT,T3_US_Princeton_ARM,T3_US_UCSB,T2_MY_UPM_BIRUNI,T1_FR_CCIN2P3,T2_FR_GRIF_IRFU,T3_US_NERSC,T2_FR_CCIN2P3,T2_CH_CERN_HLT,T2_PL_Warsaw,T3_US_TTU,T2_US_MIT,T2_BE_IIHE,T2_RU_ITEP,T2_PL_Swierk,T3_CH_PSI,T3_IT_Bologna"
== JOB AD: PublicClaimId = "<10.1.255.242:52379>#1444141059#1#..."
== JOB AD: ShadowIpAddr = "<128.142.177.81:4080?addrs=128.142.177.81-4080&noUDP&sock=1715668_1000_64115>"
== JOB AD: Used_Gatekeeper = "uscms1.fltech-grid3.fit.edu:2119/jobmanager-condor"
== JOB AD: CondorVersion = "$CondorVersion: 8.3.8 Aug 27 2015 BuildID: 338845 $"
== JOB AD: Arguments = "-a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=10 --cmsswVersion=CMSSW_7_1_12_patch1 --scramArch=slc6_amd64_gcc481 --inputFile=job_input_file_list_10.txt --runAndLumis=job_lumis_10.json --lheInputFiles=True --firstEvent=18001 --firstLumi=181 --lastEvent=20001 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=runScript.sh --eventsPerLumi=100 --scriptArgs=[] -o {}"
== JOB AD: ShouldTransferFiles = "YES"
== JOB AD: Out = "_condor_stdout"
== JOB AD: JOB_GLIDEIN_Memory = "2500"
== JOB AD: NumJobMatches = 1
== JOB AD: CumulativeSlotTime = 0
== JOB AD: OnExitRemove = true
== JOB AD: SpoolOnEvict = false
== JOB AD: JOB_GLIDEIN_Max_Walltime = "171000"
== JOB AD: JobAdInformationAttrs = "MATCH_EXP_JOBGLIDEIN_CMSSite, JOBGLIDEIN_CMSSite, RemoteSysCpu, RemoteUserCpu"
== JOB AD: In = "/dev/null"
== JOB AD: LastJobStatus = 1
== JOB AD: CumulativeSuspensionTime = 0
== JOB AD: ServerTime = 1444143205
== JOB AD: CRAB_TaskWorker = "vocms052"
== JOB AD: OrigMaxHosts = 1
== JOB AD: TransferIn = false
== JOB AD: CRAB_Id = 10
== JOB AD: JOB_GLIDEIN_Name = "gfactory_instance"
== JOB AD: WantRemoteIO = true
== JOB AD: OrigIwd = "/data/srv/glidecondor/condor_local/spool/5701/0/cluster14265701.proc0.subproc0"
== JOB AD: ImageSize = 7
== JOB AD: ExecutableSize_RAW = 7
== JOB AD: JobCurrentStartDate = 1444143205
== JOB AD: x509userproxysubject = "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=mhamer/CN=681642/CN=Matthias Hamer"
== JOB AD: NumJobStarts = 0
== JOB AD: DESIRED_Overflow_Region = regexps("_US_",DESIRED_Sites,"US")
== JOB AD: Cmd = "/data/srv/glidecondor/condor_local/spool/5701/0/cluster14265701.proc0.subproc0/gWMS-CMSRunAnalysis.sh"
== JOB AD: JobStartDate = 1444143205
== JOB AD: LastMatchTime = 1444143205
== JOB AD: JOB_GLIDEIN_CMSSite = "T3_US_FIT"
== JOB AD: CoreSize = -1
== JOB AD: TargetType = "Machine"
== JOB AD: TransferOutput = "jobReport.json.10"
== JOB AD: job_ad_information_attrs = MATCH_GLIDEIN_Gatekeeper
== JOB AD: CommittedSlotTime = 0
== JOB AD: JobStatus = 2
== JOB AD: x509UserProxyEmail = "Matthias.Hamer@cern.ch"
== JOB AD: DAGManJobId = 14265701
== JOB AD: RemoteWallClockTime = 0.0
== JOB AD: NumSystemHolds = 0
== JOB AD: CRAB_UserDN = "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=mhamer/CN=681642/CN=Matthias Hamer"
== JOB AD: JOB_GLIDEIN_Site = "FIT"
== JOB AD: AcctGroup = "analysis"
== JOB AD: Requirements = ( ( ( target.IS_GLIDEIN =!= true ) || ( target.GLIDEIN_CMSSite =!= undefined ) ) && ( GLIDEIN_REQUIRED_OS =?= "rhel6" || OpSysMajorVer =?= 6 ) ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.HasFileTransfer )
== JOB AD: CRAB_EDMOutputFiles = { "step0.root" }
== JOB AD: DESIRED_SITES = "T3_US_PuertoRico,T2_FI_HIP,T2_UK_SGrid_RALPP,T2_FR_GRIF_LLR,T3_IT_Trieste,T3_US_Omaha,T2_KR_KNU,T3_US_UMD,T3_US_Colorado,T3_NZ_UOA,T2_TH_CUNSTDA,T3_US_Kansas,T2_IT_Bari,T2_US_UCSD,T1_RU_JINR,T3_US_Vanderbilt_EC2,T2_RU_IHEP,T2_CH_CERN,T3_BY_NCPHEP,T3_US_TTU,T3_GR_Demokritos,T3_US_UTENN,T3_US_UCR,T3_TW_NCU,T2_CH_CSCS,T2_UA_KIPT,T2_IN_TIFR,T3_UK_London_UCL,T3_US_Brown,T3_CO_Uniandes,T3_KR_KNU,T2_FR_IPHC,T3_US_OSU,T3_US_TAMU,T1_US_FNAL,T2_IT_Rome,T2_UK_London_Brunel,T3_US_FIU,T3_IN_PUHEP,T2_US_Vanderbilt,T2_EE_Estonia,T3_UK_ScotGrid_ECDF,T2_CN_Beijing,T2_US_Florida,T3_IT_MIB,T3_US_FNALXEN,T3_US_Rutgers,T1_DE_KIT,T3_IR_IPM,T2_US_Wisconsin,T2_HU_Budapest,T2_DE_RWTH,T2_BR_SPRACE,T3_UK_SGrid_Oxford,T3_US_NU,T3_US_UMiss,T3_MX_Cinvestav,T3_US_FNALLPC,T3_KR_KISTI,T3_IT_Firenze,T3_US_Cornell,T2_ES_IFCA,T3_US_UVA,T3_ES_Oviedo,T2_DE_DESY,T3_US_UIowa,T3_HU_Debrecen,T2_US_Caltech,T3_FR_IPNL,T3_UK_London_RHUL,T3_CN_PKU,T2_UK_London_IC,T2_US_Nebraska,T2_ES_CIEMAT,T3_US_Princeton,T3_UK_ScotGrid_GLA,T3_US_FSU,T3_KR_UOS,T3_IT_Perugia,T1_IT_CNAF,T3_US_Minnesota,T2_AT_Vienna,T2_US_Purdue,T3_US_Rice,T3_HR_IRB,T2_BE_UCL,T3_US_FIT,T2_UK_SGrid_Bristol,T2_PT_NCG_Lisbon,T1_ES_PIC,T3_US_JHU,T2_IT_Legnaro,T3_US_SDSC,T1_UK_RAL,T2_RU_JINR,T2_IT_Pisa,T2_GR_Ioannina,T3_US_MIT,T3_US_Princeton_ARM,T3_US_UCSB,T1_FR_CCIN2P3,T2_FR_GRIF_IRFU,T3_US_NERSC,T2_FR_CCIN2P3,T2_US_MIT,T2_BE_IIHE,T2_PL_Swierk,T3_CH_PSI,T3_IT_Bologna"
== JOB AD: NumCkpts = 0
== JOB AD: StartdIpAddr = "<10.1.255.242:52379?CCBID=131.225.207.128:9718%3faddrs%3d131.225.207.128-9718#456024%20128.142.141.17:9718%3faddrs%3d128.142.141.17-9718#972651&noUDP>"
== JOB AD: CMS_ALLOW_OVERFLOW = "True"
== JOB AD: RequestMemory_RAW = 2000
== JOB AD: DiskUsage_RAW = 100710
== JOB AD: DAGManNodesMask = "0,1,2,4,5,7,9,10,11,12,13,16,17,24,27"
== JOB AD: CRAB_localOutputFiles = "step0.root=step0_10.root"
== JOB AD: MaxHosts = 1
== JOB AD: CRAB_UserHN = "mhamer"
== JOB AD: JOB_GLIDEIN_SiteWMS_Slot = "slot7@compute-2-2.local"
== JOB AD: CondorPlatform = "$CondorPlatform: X86_64-RedHat_6.6 $"
== JOB AD: AcctGroupUser = "cms1774"
== JOB AD: ImageSize_RAW = 7
== JOB AD: JOB_GLIDECLIENT_Name = "CMSG-v1_0.main"
== JOB AD: LocalSysCpu = 0.0
== JOB AD: LastSuspensionTime = 0
== JOB AD: CpusProvisioned = 1
== JOB AD: UserLog = "/data/srv/glidecondor/condor_local/spool/5701/0/cluster14265701.proc0.subproc0/job_log"
== JOB AD: CRAB_TransferOutputs = 1
== JOB AD: CRAB_DataBlock = "MCFackBlock"
== JOB AD: Env = "CRAB_TASKMANAGER_TARBALL=local;SCRAM_ARCH=slc6_amd64_gcc481;CRAB_RUNTIME_TARBALL=local"
== JOB AD: JOB_GLIDEIN_Factory = "OSGGOC"
== JOB AD: TransferInputSizeMB = 98
== JOB AD: MachineAttrCpus0 = 1
== JOB AD: CRAB_RestURInoAPI = "/crabserver/prod"
== JOB AD: CRAB_JobArch = "slc6_amd64_gcc481"
== JOB AD: QDate = 1444143195
== JOB AD: CRAB_PublishGroupName = 0
== JOB AD: CRAB_PublishDBSURL = "https://cmsweb.cern.ch/dbs/prod/phys03/DBSWriter"
== JOB AD: x509UserProxyFQAN = "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=mhamer/CN=681642/CN=Matthias Hamer,/cms/Role=NULL/Capability=NULL"
== JOB AD: Err = "_condor_stderr"
== JOB AD: CRAB_SiteWhitelist = {  }
== JOB AD: CRAB_Destination = "srm://dcache-se-cms.desy.de:8443/srm/managerv2?SFN=/pnfs/desy.de/cms/tier2/store/user/mhamer/CRAB_PrivateMC/LLG_1000_100_Spring15_v1/151006_140607/0000/log/cmsRun_10.log.tar.gz, srm://dcache-se-cms.desy.de:8443/srm/managerv2?SFN=/pnfs/desy.de/cms/tier2/store/user/mhamer/CRAB_PrivateMC/LLG_1000_100_Spring15_v1/151006_140607/0000/step0_10.root"
== JOB AD: CRAB_RetryOnASOFailures = 1
== JOB AD: CRAB_TFileOutputFiles = {  }
== JOB AD: accounting_group = analysis
== JOB AD: PostJobPrio1 = -1444140574
== JOB AD: ExitBySignal = false
== JOB AD: CRAB_UserGroup = undefined
== JOB AD: PostJobPrio2 = 2
== JOB AD: PeriodicRemoveReason = ifThenElse(time() - EnteredCurrentStatus > 7 * 24 * 60 * 60 && isUndefined(MemoryUsage),"Removed due to idle time limit",ifThenElse(time() > x509UserProxyExpiration,"Removed job due to proxy expiration",ifThenElse(MemoryUsage > RequestMemory,"Removed due to memory use",ifThenElse(MaxWallTimeMins * 60 < time() - EnteredCurrentStatus,"Removed due to wall clock limit",ifThenElse(DiskUsage > 100000000,"Removed due to disk usage","Removed due to job being held")))))
== JOB AD: BufferBlockSize = 32768
== JOB AD: CRAB_AsyncDest = "T2_DE_DESY"
== JOB AD: ClusterId = 14271084
== JOB AD: CRAB_PublishName = "LLG_1000_100_Spring15_v1-00000000000000000000000000000000"
== JOB AD: DelegatedProxyExpiration = 1444745318
== JOB AD: UidDomain = "cms"
== JOB AD: CRAB_Publish = 1
== JOB AD: CRAB_Dest = "/store/temp/user/mhamer.622048dbeda2b51850112c460bc3f99a18e01eaf/CRAB_PrivateMC/LLG_1000_100_Spring15_v1/151006_140607"
== JOB AD: ProvisionedResources = "Cpus Memory Disk Swap"
== JOB AD: RequestMemory = 2000
== JOB AD: EnteredCurrentStatus = 1444143205
== JOB AD: JOB_GLIDEIN_ToRetire = "1444275998"
== JOB AD: CRAB_JobSW = "CMSSW_7_1_12_patch1"
== JOB AD: BufferSize = 524288
== JOB AD: JOB_GLIDEIN_Schedd = "schedd_glideins5@glidein.grid.iu.edu"
== JOB AD: MaxWallTimeMins = 1400
== JOB AD: LeaveJobInQueue = false
== JOB AD: EncryptExecuteDirectory = false
== JOB AD: NumCkpts_RAW = 0
== JOB AD: DESIRED_Archs = "X86_64"
== JOB AD: MinHosts = 1
== JOB AD: DESIRED_OpSysMajorVers = "6"
== JOB AD: JOB_GLIDEIN_ClusterId = "2450170"
== JOB AD: NiceUser = false
== JOB AD: DESIRED_CMSDataset = "/CRAB_PrivateMC"
== JOB AD: RootDir = "/"
== JOB AD: CommittedTime = 0
==== HTCONDOR JOB AD CONTENTS FINISH ====
======== HTCONDOR JOB SUMMARY at Tue Oct  6 14:53:40 GMT 2015 FINISH ========
======== PROXY INFORMATION START at Tue Oct  6 14:53:40 GMT 2015 ========
subject   : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=mhamer/CN=681642/CN=Matthias Hamer/CN=1986129925/CN=408323091/CN=1709066773/CN=1548830320/CN=1815289915/CN=35592693
issuer    : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=mhamer/CN=681642/CN=Matthias Hamer/CN=1986129925/CN=408323091/CN=1709066773/CN=1548830320/CN=1815289915
identity  : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=mhamer/CN=681642/CN=Matthias Hamer/CN=1986129925/CN=408323091/CN=1709066773/CN=1548830320/CN=1815289915
type      : RFC compliant proxy
strength  : 1024 bits
path      : /var/lib/condor/execute/dir_3218/glide_hmb5Y3/execute/dir_24398/391d3289e9138330eb9a6b8c62cdf184825093bc
timeleft  : 167:14:58
key usage : Digital Signature, Key Encipherment
=== VO cms extension information ===
VO        : cms
subject   : /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=mhamer/CN=681642/CN=Matthias Hamer
issuer    : /DC=ch/DC=cern/OU=computers/CN=lcg-voms2.cern.ch
attribute : /cms/Role=NULL/Capability=NULL
timeleft  : 167:15:04
uri       : lcg-voms2.cern.ch:15002
======== PROXY INFORMATION FINISH at Tue Oct  6 14:53:40 GMT 2015 ========
======== CMSRunAnalysis.sh at Tue Oct  6 14:53:40 GMT 2015 STARTING ========
======== CMSRunAnalysis.sh STARTING at Tue Oct  6 14:53:40 GMT 2015 ========
Local time : Tue Oct  6 10:53:40 EDT 2015
Current system : Linux compute-2-2.local 2.6.32-504.30.3.el6.centos.plus.x86_64 #1 SMP Wed Jul 15 00:52:19 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
==== CMSSW pre-execution environment bootstrap STARTING ====
+ '[' -f /cmsset_default.sh ']'
+ echo 'LCG style'
LCG style
+ set +x
+ declare -a VERSIONS
+ VERSIONS=($(ls $VO_CMS_SW_DIR/$SCRAM_ARCH/external/python | egrep '2.[67]'))
++ ls /slc6_amd64_gcc481/external/python
++ egrep '2.[67]'
ls: cannot access /slc6_amd64_gcc481/external/python: No such file or directory
+ PY_PATH=/slc6_amd64_gcc481/external/python
+ echo 'python version: '
python version: 
+ set +x
==== CMSSW pre-execution environment bootstrap FINISHING at Tue Oct  6 14:53:40 GMT 2015 ====
==== Python discovery STARTING ====
Error: Unable to find a CMS version of python.
CRAB3 requires the CMS version of python to function.
sh: ./DashboardFailure.sh: No such file or directory

real    0m0.016s
user    0m0.004s
sys 0m0.007s
CMSRunAnalysis.sh complete at Tue Oct  6 14:53:40 GMT 2015 with (short) exit status 127
======== CMSRunAnalsysis.sh at Tue Oct  6 14:53:40 GMT 2015 FINISHING ========
======== python2.6 bootstrap for stageout at Tue Oct  6 14:53:40 GMT 2015 STARTING ========
+ '[' -f /cmsset_default.sh ']'
+ set +x
+ '[' -e /COMP/slc6_amd64_gcc481/external/python/2.6.8-comp9/etc/profile.d/init.sh ']'
+ '[' -e //cmssoft/cms/COMP/slc6_amd64_gcc481/external/python/2.6.8-comp9/etc/profile.d/init.sh ']'
+ set +x
+ command -v python2.6
+ rc=0
+ set +x
Found python2.6 at:
/cvmfs/cms.cern.ch/COMP/slc6_amd64_gcc481/external/python/2.6.8-comp9/bin/python2.6
======== python2.6 bootstrap for stageout at Tue Oct  6 14:53:41 GMT 2015 FINISHING ========
======== Attempting to notify HTCondor of file stageout ========
    Error: 22 (Invalid argument)
======== Stageout at Tue Oct  6 14:53:41 GMT 2015 STARTING ========
Traceback (most recent call last):
  File "./cmscp.py", line 19, in <module>
    from ServerUtilities import cmd_exist
ImportError: No module named ServerUtilities
======== ERROR: Unable to initialize WMCore at Tue Oct  6 14:53:41 GMT 2015 ========
sh: ./DashboardFailure.sh: No such file or directory
AndresTanasijczuk commented 8 years ago

When python is not found in the WN, the job wrapper exists with 10043 (will change it to 10034). What happened in this case is that the script DashboardFailure.sh that is used to report the exit code to dashboard was not found, and therefore the exec utility exited with 127 instead. The reason why the user saw exit code -1 in the 'crab status' output is because the stageout wrapper (i.e. cmscp) also failed (it failed to import ServerUtilities). And it is cmscp who writes the exit code of the job wrapper into the JSON job report, from where then the post-job reads it and writes it into the error_summary.json file, from where 'crab status' reads it. IMO, we should add a check that the script DashboardFailure.sh exists before trying to execute it, and if it doesn't exist make sure the job wrapper exits with the exit code it was supposed to exit and not 127. @belforte What exit code should we use in case python is found, but we got an error sourcing the python environment setup?

juztas commented 8 years ago

for python: 10043: "Unable to bootstrap WMCore libraries (most likely site python is broken).", # (CRAB3). for file check: yes, 127, it is same for command not found or no such file or directory Also check this if you are missing something: https://github.com/dmwm/WMCore/pull/6427/files

AndresTanasijczuk commented 8 years ago

The PR referencing this issue was merged. But actually I am not sure that PR addresses this issue completely. Because I think in this issue there are two things that failed: 1) The job wrapper failed to find DashboardFailure.sh and therefore the job wrapper exit code 10043 (which in the PR I changed to 10034) was replaced by 127. 2) Later cmscp failed to import ServerUtilities. The merged PR fixes 1): It checks for the existence of DashboardFailure.sh before trying to execute it, avoiding this way the change of the exit code to 127 and exiting the job wrapper with 10034. But since failure 2) would have occurred anyway, that PR is not enough to fix the current issue. So what could we do about 2)? Maybe move from ServerUtilities import cmd_exist, parseJobAd to the only two places where cmd_exist and parseJobAd are used, since if the import fails we anyway can continue executing. This sounds a bit hacky, I don't know.

mmascher commented 8 years ago

Agreed with Stefano to close this as it seems that what is left is really a corner case. Let's see if it happens again in the future.