dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Stage out failure doesn't show correct error in couch #1381

Closed cinquo closed 12 years ago

cinquo commented 13 years ago

The problem on the WN seems to be related to a permission error (), while in couch for the stageOut step I see something different (*),

(*) INFO:root:Beginning report processing for step cmsRun1 ERROR:root:Direct to Merge failed due to no mergedLFNBase in output Storage Resource Manager (SRM) implementation version 2.0.6 Copyright (c) 2002-2008 Fermi National Accelerator Laboratory Specification Version 2.0 by SRM Working Group (http://sdm.lbl.gov/srm-wg) Tue Apr 05 14:21:00 CEST 2011: rs.state = Failed rs.error = at Tue Apr 05 14:20:56 CEST 2011 state Pending : created RequestFileStatus#-2134488784 failed with error:[ at Tue Apr 05 14:20:57 CEST 2011 state Failed : path does not exist and user has no permissions to create it]

Tue Apr 05 14:21:00 CEST 2011: ====> fileStatus state ==Failed java.io.IOException: rs.state = Failed rs.error = at Tue Apr 05 14:20:56 CEST 2011 state Pending : created RequestFileStatus#-2134488784 failed with error:[ at Tue Apr 05 14:20:57 CEST 2011 state Failed : path does not exist and user has no permissions to create it]

    at gov.fnal.srm.util.SRMPutClientV1.start(SRMPutClientV1.java:333)
    at gov.fnal.srm.util.SRMDispatcher.work(SRMDispatcher.java:795)
    at gov.fnal.srm.util.SRMDispatcher.main(SRMDispatcher.java:374)

srm copy of at least one file failed or not completed srm client error: java.rmi.RemoteException: srm advisoryDelete failed; nested exception is: java.lang.RuntimeException: advisoryDelete(User [name=cms001, uid=4051, gid=4050, root=/],/pnfs/iihe/cms/store/user/mmascher//0000/4067E353-7B5F-E011-BD53-D8D385AE85C0.root) Error file does not exist, cannot delete

(**) details Error in StageOut: 99109 StageOutFailure Message: Failure for fallback stage out:

TargetPFN : srm://ingrid-se02.cism.ucl.ac.be:8443/srm/managerv1?SFN=/pnfs/cism.ucl.ac.be/data/cms/sca06//store/user/mmascher//0000/103BCB72-805F-E011-9003-D8D385AE85A6.root ErrorCode : 60311 ModuleName : WMCore.Storage.StageOutError MethodName : init LFN : /store/user/mmascher//0000/103BCB72-805F-E011-9003-D8D385AE85A6.root ClassInstance : None FileName : /scratch/288755.cream01.iihe.ac.be/CREAM596157229/job/WMCore/Storage/StageOutError.py ClassName : None Command : srm LineNumber : 32 InputPFN : /scratch/288755.cream01.iihe.ac.be/CREAM596157229/job/WMTaskSpace/cmsRun1/output.root ErrorNr : 0 ErrorType : GeneralStageOutFailure

Traceback: Traceback (most recent call last): File "/scratch/288755.cream01.iihe.ac.be/CREAM596157229/job/WMCore/Storage/StageOutMgr.py", line 250, in fallbackStageOut impl(fbParams['command'], localPfn, pfn, fbParams.get("option", None)) File "/scratch/288755.cream01.iihe.ac.be/CREAM596157229/job/WMCore/Storage/StageOutImpl.py", line 215, in call time.sleep(self.retryPause) File "/scratch/288755.cream01.iihe.ac.be/CREAM596157229/job/WMCore/WMSpec/Steps/Executors/LogArchive.py", line 39, in alarmHandler raise Alarm Alarm

type Misc. StageOut error: 99109

DMWMBot commented 13 years ago

mnorman: Extended the alarm timeouts and attached the timeout message to the error.

What bothers me is the 99109 error - we should be getting an actual error number in the report.

sfoulkes commented 13 years ago

sfoulkes: (In 87052676a37ae3f60ebcd4085deec0129046d391) Fix a couple issues with stageout timeouts and error reporting. Fixes #1361, #1381, #1390.

From: Matt Norman mnorman@fnal.gov Signed-off-by: Steve Foulkes sfoulkes@fnal.gov

sfoulkes commented 13 years ago

sfoulkes: (In 12268) Fix a couple issues with stageout timeouts and error reporting. Fixes #1361, #1381, #1390.

From: Matt Norman mnorman@fnal.gov Signed-off-by: Steve Foulkes sfoulkes@fnal.gov