dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Submission to T1_CH_CERN (and T2_CH_CERN) #3649

Closed amaltaro closed 12 years ago

amaltaro commented 12 years ago

Trying to inject and run a workflow at CERN, I see that all Merge and LogCollect jobs are failing.

RequestName: amaltaro_RV_TEST_taskChain_ZEE_120425_152357_7966 Using cmsweb testbed and the agent is at cmssrv113.

Talking to the guys in the office, they said it's likely due to the different protocol (xrootd) used at CERN. Samir also point me to a fix for the T0 WMA (#3311), however in this case it works only for CERN and will fail for all other sites.

Is there any patch available for this issue? Is that possible to run at T2_CH_CERN using EOS instead of castor?

Thanks, Alan.

amaltaro commented 12 years ago

amaltaro: I have additional information for this case:

The outputPFN for a production job (23466) was: /castor/cern.ch/cms/store/unmerged/...

However the merge jobs (24116)are looking files at: root://eoscms//eos/cms/store/unmerged/...

That's why it's not totally working. From my side, it would be good having everything stageIn/Out at EOS.

amaltaro commented 12 years ago

amaltaro: Hi Steve, Matt, do you have any hint on what is needed to get these jobs running fine at CERN?

hufnagel commented 12 years ago

hufnagel: Non-Tie0CERN production should not use the t0export pool. That's all the #3311 patch fixes, so ignore it, it's not relevant for this use case.

What is the full output LFN ? You cut off too early, TFC rules are different for different directories further down.

hufnagel commented 12 years ago

hufnagel: Checked a bit more, the different LFN are not a problem, but a feature. Stageout for anything but

/store/unmerged/relval/

goes to castor. Reading back goes all through EOS, for the files that are not there it should fall back to the corresponding castor location.

Given this, you have to provide more information to debug this, all you have shown so far looks exactly as it should be.

Where are you staging the files to and what are the errors messages from the failing Merge and LogCollect jobs ?

amaltaro commented 12 years ago

amaltaro: The logs were already cleaned up and I have to inject a new request in order to get this information. Once the testbed deployment is done, I'll do that.

Checked a bit more, the different LFN are not a problem, but a feature. Indeed! But the problem is having different PFN. The production jobs are staging at /castor/* and the merge jobs are looking for those unmerged files at EOS.

Reading back goes all through EOS, for the files that are not there it should fall back to the corresponding castor location. Then the merge jobs are looking files at the right place, but the fall back is not being properly handled by WMA.

Come back later with more debug information. Thanks for looking at it!

hufnagel commented 12 years ago

hufnagel: I mistyped, the different PFN are the feature ! (Almost) all reads at CERN go via EOS now, if EOS cannot find the file on it's disk servers it reroutes to castor. That happens automatically and is a site thing, has nothing to do with WMA code. So stageout to castor and reading back via EOS is not a bug, it's a feature.

It has been a long time since we have run non-Tie0production at CERN though, so not sure we ever tried this with unmerged files.

We already override stageout for relval to go to EOS, we might have to do the same for unmerged (or adjust the read rules).

But do your test first, would be good to see the error message.

amaltaro commented 12 years ago

amaltaro: Hi Dirk,

first some details about the '''production job''' that succeeded: "lfn": "/store/unmerged/600pre1_TESTtaskChain/RelValZEEFS/GEN-SIM-DIGI-RECO/v4/0000/747D1900-1093-E111-8DEB-003048F23C0E.root", "dataset": { "applicationName": "cmsRun", "applicationVersion": "CMSSW_6_0_0_pre1", "processedDataset": "600pre1_TESTtaskChain-v4", "dataTier": "GEN-SIM-DIGI-RECO", "primaryDataset": "RelValZEEFS" }, "InputPFN": "/pool/grid/cmsprd/home_cream_233714504/CREAM233714504/glide_Tc2102/execute/dir_5478/job/WMTaskSpace/cmsRun1/FEVTDEBUGHLToutput.root", (... etc etc ...) "pfn": "/pool/grid/cmsprd/home_cream_233714504/CREAM233714504/glide_Tc2102/execute/dir_5478/job/WMTaskSpace/cmsRun1/FEVTDEBUGHLToutput.root", "catalog": "", "module_label": "FEVTDEBUGHLToutput", "inputPath": null, "StageOutCommand": "rfcp-CERN", (... etc ...) "OutputPFN": "/castor/cern.ch/cms/store/unmerged/600pre1_TESTtaskChain/RelValZEEFS/GEN-SIM-DIGI-RECO/v4/0000/747D1900-1093-E111-8DEB-003048F23C0E.root", merge jobs (extracted from Futon in the local couchDB):

Now details for the '''merge job''' that failed: "input": { "input_source_class": "PoolSource", "input_type": "primaryFiles", "lfn": "/store/unmerged/600pre1_TESTtaskChain/RelValZEEFS/GEN-SIM-DIGI-RECO/v4/0000/F4A89CAD-1093-E111-9363-003048C940A2.root", "pfn": "root://eoscms//eos/cms/store/unmerged/600pre1_TESTtaskChain/RelValZEEFS/GEN-SIM-DIGI-RECO/v4/0000/F4A89CAD-1093-E111-9363-003048C940A2.root?svcClass=default", "catalog": "", "module_label": "source", "guid": "F4A89CAD-1093-E111-9363-003048C940A2",

And here is the '''error message''' for the merge job (from summary page):

Error in StageOut: 99109 StageOutFailure Message: Failure for local stage out: StageOutError Message: Cannot parse directory out of targetPFN ErrorCode : 60311 ModuleName : WMCore.Storage.StageOutError MethodName : init ErrorType : GeneralStageOutFailure ClassInstance : None FileName : /pool/grid/cmsprd/home_cream_026629231/CREAM026629231/glide_ZZ7292/execute/dir_6924/job/WMCore.zip/WMCore/Storage/StageOutError.py ClassName : None LineNumber : 32 ErrorNr : 0

Traceback: Traceback (most recent call last): File "/pool/grid/cmsprd/home_cream_026629231/CREAM026629231/glide_ZZ7292/execute/dir_6924/job/WMCore.zip/WMCore/Storage/StageOutImpl.py", line 169, in call self.createOutputDirectory(targetPFN) File "/pool/grid/cmsprd/home_cream_026629231/CREAM026629231/glide_ZZ7292/execute/dir_6924/job/WMCore.zip/WMCore/Storage/Backends/RFCPCERNImpl.py", line 74, in createOutputDirectory raise StageOutError("Cannot parse directory out of targetPFN") StageOutError: StageOutError Message: Cannot parse directory out of targetPFN ErrorCode : 60311 ModuleName : WMCore.Storage.StageOutError MethodName : init ErrorType : GeneralStageOutFailure ClassInstance : None FileName : /pool/grid/cmsprd/home_cream_026629231/CREAM026629231/glide_ZZ7292/execute/dir_6924/job/WMCore.zip/WMCore/Storage/StageOutError.py ClassName : None LineNumber : 32 ErrorNr : 0

TargetPFN : rfio://castorcms//castor/cern.ch/cms/store/backfill/1/600pre1_TEST_taskChain_/RelValZEEFS/DQM/v4/0000/26A72003-2093-E111-82C9-00237DDBE74C.root?svcClass=t0export&stageHost=castorcms
ErrorCode : 60311
ModuleName : WMCore.Storage.StageOutError
MethodName : __init__
LFN : /store/backfill/1/600pre1_TEST_taskChain_/RelValZEEFS/DQM/v4/0000/26A72003-2093-E111-82C9-00237DDBE74C.root
ClassInstance : None
FileName : /pool/grid/cmsprd/home_cream_026629231/CREAM026629231/glide_ZZ7292/execute/dir_6924/job/WMCore.zip/WMCore/Storage/StageOutError.py
ClassName : None
Command : rfcp-CERN
LineNumber : 32
InputPFN : /pool/grid/cmsprd/home_cream_026629231/CREAM026629231/glide_ZZ7292/execute/dir_6924/job/WMTaskSpace/cmsRun1/Merged.root
Protocol : stageout
ErrorNr : 0
ErrorType : GeneralStageOutFailure

Traceback: Traceback (most recent call last): File "/pool/grid/cmsprd/home_cream_026629231/CREAM026629231/glide_ZZ7292/execute/dir_6924/job/WMCore.zip/WMCore/Storage/StageOutMgr.py", line 297, in localStageOut impl(protocol, localPfn, pfn, options)

hufnagel commented 12 years ago

hufnagel: Ok, this has nothing to do with the merge job not being able to read it's input file. The merge job cannot stageout it's output files. Reason for that is that you are using an old wmagent that does not support stageout at CERN. You need wmagent 0.8.27 or newer.

Alternatively, you could try to apply the patch in #3280, but I am not sure it'll apply cleanly against your wmagent version.

amaltaro commented 12 years ago

amaltaro: Thanks Dirk! Do you know if this change "break" workflows injected to other sites? Or is it harmless and everything should keep running fine?

hufnagel commented 12 years ago

hufnagel: The patch in #3280 changes a generic interface and therefore touches all stageout implementations, not just the one for CERN.

That being said, it only adds a new parameter that has a default and is unused except for stageout at CERN, so it should not affect stageout to other sites in any way.

amaltaro commented 12 years ago

amaltaro: Thanks Dirk, but this patch didn't work on my 0.8.26.pre1 version. I intend to upgrade it to a newer version (maybe *44). Just in case, this was the failure:

Traceback: Traceback (most recent call last): File "/pool/grid/cmsprd/home_cream_185562844/CREAM185562844/glide_JF5756/execute/dir_10833/job/WMCore.zip/WMCore/Storage/StageOutImpl.py", line 169, in call self.createOutputDirectory(targetPFN) File "/pool/grid/cmsprd/home_cream_185562844/CREAM185562844/glide_JF5756/execute/dir_10833/job/WMCore.zip/WMCore/Storage/Backends/RFCPCERNImpl.py", line 83, in createOutputDirectory self.createDir(targetDir) File "/pool/grid/cmsprd/home_cream_185562844/CREAM185562844/glide_JF5756/execute/dir_10833/job/WMCore.zip/WMCore/Storage/Backends/RFCPCERNImpl.py", line 217, in createDir execute(command) File "/pool/grid/cmsprd/home_cream_185562844/CREAM185562844/glide_JF5756/execute/dir_10833/job/WMCore.zip/WMCore/Storage/Execute.py", line 155, in execute raise StageOutError(msg, Command = command, ExitCode = exitCode) StageOutError: StageOutError Message: Command exited non-zero ErrorCode : 60311 ModuleName : WMCore.Storage.StageOutError MethodName : init ErrorType : GeneralStageOutFailure ClassInstance : None FileName : /pool/grid/cmsprd/home_cream_185562844/CREAM185562844/glide_JF5756/execute/dir_10833/job/WMCore.zip/WMCore/Storage/StageOutError.py ClassName : None Command : nsmkdir -p "/castor/cern.ch/cms/store/backfill/1/524_TEST_CERNWMA/RelValZEEFS/GEN-SIM-DIGI-RECO/v3/0000" LineNumber : 32 ErrorNr : 0 ExitCode : 1

hufnagel commented 12 years ago

hufnagel: It fails to create the output directory on castor. Could be a permission issue. I see you are running under the cmsprd account, which I think is the grid account jobs with the production role are mapped to. Will check with Stephen to see if that account should be able to write to the output area

rfio://castorcms//castor/cern.ch/cms/store/backfill/1/...?svcClass=t0export&stageHost=castorcms

gowdy commented 12 years ago

gowdy: Dirk is correct. cmsprd can't write to that directory. It also can't write to that pool. Only phedex, relval and phedex have permission to write to most places. We could add cmsprd to the backfill area if it is really needed.

amaltaro commented 12 years ago

amaltaro: I'm commissioning RelVals on WMAgent/taskChain, so I'm running some tests on it. Then I'm gonna inject a new workflow and change it not to use backfill LFN (probably /storage/mc or /storage/relval). Thanks!

hufnagel commented 12 years ago

hufnagel: Ok, but you do no run as the relval user, you run as a normal grid user with production role. And as things are currently setup, you won't be able to stageout to very many places at CERN that way. Ops needs to decide which accounts needs to be able to write where, then Stephen can set the Castor/EOS permissions accordingly. We should stop the discussion here, this is really not a code issue.