dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
45 stars 106 forks source link

T0 Production Job Paused: StageOut error #11998

Open LinaresToine opened 1 month ago

LinaresToine commented 1 month ago

Impact of the bug T0 Production Agent

Describe the bug From the logs of the job, which I put in /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024D/FileOpenError , it seems that a job is failing during the stageOut, and the problem seems to be tied to the file root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2024D/SpecialZeroBias31/RAW-RECO/LogError-PromptReco-v1/000/380/932/00000/4b8b9111-14d0-48b9-b4c5-37a48f9f706d.root. This file is not being created properly in the previous step of the job. From what @jeyserma helped us understand, there is a job that generates such file and is being marked as successful, when in reality it should be moved to paused.

How to reproduce it Steps to reproduce the behavior: Will update this portion shortly

Additional context and error message

ocal Stage Out Catalog to be used is trivialcatalog_file:/cvmfs/cms.cern.ch/SITECONF/T0_CH_CERN/PhEDEx/storage.xml?protocol=eos
Trivial File Catalog has been loaded:
    lfn-to-pfn: protocol=eos path-match-re=/+store/t0streamer/(.*) result=root://eoscms.cern.ch//eos/cms/store/t0streamer/$1?eos.app=cmst0
    lfn-to-pfn: protocol=eos path-match-re=/+store/(.*) result=root://eoscms.cern.ch//eos/cms/tier0/store/$1?eos.app=cmst0
    lfn-to-pfn: protocol=xrootd path-match-re=/+store/(.*) result=root://xrootd-cms.infn.it//store/$1?eos.app=cmst0
    pfn-to-lfn: protocol=eos path-match-re=root://eoscms.cern.ch//eos/cms/tier0/store/(.*) result=/store/$1
    pfn-to-lfn: protocol=srmv2 path-match-re=.*\?SFN=(.*) result=$1 chain=direct
There are 0 fallback stage out definitions.

2024-05-21 21:33:55,243:INFO:StageOutMgr:==== Stageout configuration finish ====
2024-05-21 21:33:55,243:INFO:StageOut:Beginning report processing for step logArch1
2024-05-21 21:33:55,243:ERROR:StageOut:Cannot find report for step logArch1 in space /srv/job/WMTaskSpace/logArch1
2024-05-21 21:33:55,244:INFO:StageOut:Beginning report processing for step cmsRun1
2024-05-21 21:33:55,247:INFO:StageOutMgr:==>Working on file: /store/unmerged/data/Run2024D/SpecialZeroBias31/RAW-RECO/LogError-PromptReco-v1/000/380/932/00000/4b8b9111-14d0-48b9-b4c5-37a48f9f706d.root
2024-05-21 21:33:55,247:INFO:StageOutMgr:===> Attempting Local Stage Out.
2024-05-21 21:33:55,247:INFO:StageOutMgr:LFN to PFN match made:
LFN: /store/unmerged/data/Run2024D/SpecialZeroBias31/RAW-RECO/LogError-PromptReco-v1/000/380/932/00000/4b8b9111-14d0-48b9-b4c5-37a48f9f706d.root
PFN: root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2024D/SpecialZeroBias31/RAW-RECO/LogError-PromptReco-v1/000/380/932/00000/4b8b9111-14d0-48b9-b4c5-37a48f9f706d.root?eos.app=cmst0

2024-05-21 21:33:55,247:INFO:StageOutImpl:Creating output directory...
2024-05-21 21:33:55,253:INFO:StageOutImpl:Running the stage out...
2024-05-22 04:01:53,721:ERROR:StageOutImpl:Attempt 0 to stage out failed.
Automatically retrying in 300 secs
 Error details:
<@========== WMException Start ==========@>
Exception Class: StageOutError
Message: 
    ClassName : None
    ModuleName : WMCore.Storage.StageOutError
    MethodName : __init__
    ClassInstance : None
    FileName : /srv/job/WMCore.zip/WMCore/Storage/StageOutError.py
    LineNumber : 32
    ErrorNr : 0
    Command : LOCAL_SIZE=`stat -c%s "/srv/job/WMTaskSpace/cmsRun1/SKIMStreamLogError.root"`
echo "Local File Size is: $LOCAL_SIZE"
env XRD_WRITERECOVERY=0 xrdcp --force --nopbar --cksum adler32:fbad12d5  "/srv/job/WMTaskSpace/cmsRun1/SKIMStreamLogError.root"  "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2024D/SpecialZeroBias31/RAW-RECO/LogError-PromptReco-v1/000/380/932/00000/4b8b9111-14d0-48b9-b4c5-37a48f9f706d.root?eos.app=cmst0" 
REMOTE_SIZE=`xrdfs 'eoscms.cern.ch' stat '/eos/cms/tier0/store/unmerged/data/Run2024D/SpecialZeroBias31/RAW-RECO/LogError-PromptReco-v1/000/380/932/00000/4b8b9111-14d0-48b9-b4c5-37a48f9f706d.root' | grep Size | sed -r 's/.*Size:[ ]*([0-9]+).*/\1/'`
echo "Remote File Size is: $REMOTE_SIZE"
echo "Local File Checksum is: fbad12d5"
REMOTE_XS=`xrdfs 'eoscms.cern.ch' query checksum '/eos/cms/tier0/store/unmerged/data/Run2024D/SpecialZeroBias31/RAW-RECO/LogError-PromptReco-v1/000/380/932/00000/4b8b9111-14d0-48b9-b4c5-37a48f9f706d.root' | grep -i adler32 | sed -r 's/.*[adler|ADLER]32[ ]*([0-9a-fA-F]{8}).*/\1/'`
echo "Remote File Checksum is: $REMOTE_XS"
if [ $REMOTE_SIZE ] && [ $REMOTE_XS ] && [ $LOCAL_SIZE == $REMOTE_SIZE ] && [ 'fbad12d5' == $REMOTE_XS ]; then exit 0; else echo "ERROR: Size or Checksum Mismatch between local and SE"; xrdfs 'eoscms.cern.ch' rm /eos/cms/tier0/store/unmerged/data/Run2024D/SpecialZeroBias31/RAW-RECO/LogError-PromptReco-v1/000/380/932/00000/4b8b9111-14d0-48b9-b4c5-37a48f9f706d.root ; exit 60311 ; fi
    ExitCode : 60311
    ErrorCode : 60311
    ErrorType : GeneralStageOutFailure

Traceback: 
  File "/srv/job/WMCore.zip/WMCore/Storage/StageOutImpl.py", line 87, in executeCommand
    exitCode, output = runCommandWithOutput(command)

  File "/srv/job/WMCore.zip/WMCore/Storage/Execute.py", line 53, in runCommandWithOutput
    sout, serr = child.communicate()

  File "/cvmfs/cms.cern.ch/cc8_amd64_gcc9/external/python3/3.8.2-555576dae8d03b340ea3079ddb8abb0d/lib/python3.8/subprocess.py", line 1024, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)

  File "/cvmfs/cms.cern.ch/cc8_amd64_gcc9/external/python3/3.8.2-555576dae8d03b340ea3079ddb8abb0d/lib/python3.8/subprocess.py", line 1866, in _communicate
    ready = selector.select(timeout)

  File "/cvmfs/cms.cern.ch/cc8_amd64_gcc9/external/python3/3.8.2-555576dae8d03b340ea3079ddb8abb0d/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)

  File "/srv/job/WMCore.zip/WMCore/Algorithms/Alarm.py", line 39, in alarmHandler
    raise Alarm

<@---------- WMException End ----------@>

2024-05-22 04:06:53,823:INFO:StageOutImpl:Running the stage out...
2024-05-22 04:18:48,112:INFO:StageOutImpl:Command exited with status: 0
Output message: stdout: Local File Size is: 8332834
Remote File Size is: 8332834
Local File Checksum is: fbad12d5
Remote File Checksum is: fbad12d5

@jeyserma @germanfgv @amaltaro

jeyserma commented 1 month ago

Just want to add that it's the first time we have seen this so it must be quite rare due to some transient issue. We can't reproduce the issue.

amaltaro commented 1 month ago

@jeyserma @LinaresToine the log provided under /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024D/FileOpenError belongs to a Merge job, which fails to read one of the input files and returns:

CMSSW Return code: 8020

Does the log snippet above belong to the merge job? Or is it actually the processing job that produced the unmerged root file? If the later, can you please share those logs as well?

LinaresToine commented 1 month ago

Hello @amaltaro Thank you for your response. The job in charge of generating the file is a PromptReco job. I put the tarball with the logs in /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024D/FileOpenError/InputJob3173771/Job_2937895.tar.bz2

amaltaro commented 1 month ago

Thank you, Antonio. The following log records from wmagentJob.log look strange to me:

2024-05-21 21:33:55,247:INFO:StageOutImpl:Creating output directory...
2024-05-21 21:33:55,253:INFO:StageOutImpl:Running the stage out...
2024-05-22 04:01:53,721:ERROR:StageOutImpl:Attempt 0 to stage out failed.
Automatically retrying in 300 secs

followed by:

2024-05-22 04:06:53,823:INFO:StageOutImpl:Running the stage out...
2024-05-22 04:18:48,112:INFO:StageOutImpl:Command exited with status: 0
Output message: stdout: Local File Size is: 8332834
Remote File Size is: 8332834
Local File Checksum is: fbad12d5
Remote File Checksum is: fbad12d5

Observations are:

If I am reading the code correctly, this is the high layer where the stage out is setup: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/Steps/Executors/StageOut.py#L195-L215

which then calls https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Storage/StageOutMgr.py#L235-L267

and finally it goes here: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Storage/StageOutImpl.py#L169

jeyserma commented 1 week ago

Hi @amaltaro we have two more occurrences of this problem, they happened more or less at the same time (same run). The following files are missing, leading to a paused PromptReco job:

/eos/cms/tier0/store/unmerged/data/Run2024F/EGamma1/ALCARECO/HcalCalIsoTrkProducerFilter-PromptReco-v1/000/382/300/00000/85ba8d74-a4e6-4e3e-8d2a-b2d530c8e721.root /eos/cms/tier0/store/unmerged/data/Run2024F/EGamma0/RAW-RECO/EGMJME-PromptReco-v1/000/382/300/00001/f6163d72-cf0d-45b6-943e-edbb0398388e.root

I have put the logs of the jobs that should have produced these missing files here: /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024F/missingFiles

I don't see the same behavior from the logs as you noticed previously.

One thing to notice is that the 3 occurrences are all physics skims or AlcaReco skims. Maybe that can give us a hint to understand the problem.

Thanks, Jan

LinaresToine commented 1 week ago

Hi @amaltaro

Thank you for the analysis provided in your previous comment. After following your comments, I thought of making an intervention in https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Storage/StageOutImpl.py#L228 so that the job is moved to the paused state. Would you know how this can be done?

There are two problems:

I think that marking the job paused can help us debug future occurances.

It is also strange to me that the issue affects only a few jobs. Possibly not a wmcore issue in that case I believe.

amaltaro commented 4 days ago

@LinaresToine Antonio, I have briefly mentioned it to @anpicci today, but I created this debug patch: https://github.com/dmwm/WMCore/pull/12035

because I would like to test the signal handling/alarm configuration in the stage out package.

Could you please start a very small T0 replay, with this patch applied, such that we can try to reproduce this problem? Note that I expect all jobs to fail, as I am intentionally hitting the Alarm exception during the stage out process.

germanfgv commented 3 days ago

After checking with CERN IT folks, they found this:

240625 10:13:26 3416068 XrdLink: Unable to send to pcms02.5322:3023@c01-018-171.gridka.de; broken pipe
240625 10:13:26 3416068 XrootdXeq: pcms02.5322:3023@c01-018-171.gridka.de disc 0:01:18 (send failure)

240625 10:15:26 time=1719303326.373500 func=_close                   level=WARN  logid=9e0b8254-32ca-11ef-b77a-a4bf0179624c unit=fst@st-096-ii201a6a.cern.ch:1095 tid=00007f715aa69640 source=XrdFstOfsFile:2173             tident=pcms02.5322:3023@c01-018-171.gridka.de sec=      uid=103031 gid=1399 name=nobody geo="" info="deleting on close" fn=/eos/cms/tier0/store/unmerged/data/Run2024F/EGamma0/RAW-RECO/EGMJME-PromptReco-v1/000/382/300/00001/f6163d72-cf0d-45b6-943e-edbb0398388e.root fstpath=/data68/0006ff9b/11160a1f0 reason="client disconnect"
240625 10:15:26 time=1719303326.373622 func=_close                   level=INFO  logid=9e0b8254-32ca-11ef-b77a-a4bf0179624c unit=fst@st-096-ii201a6a.cern.ch:1095 tid=00007f715aa69640 source=XrdFstOfsFile:2263             tident=pcms02.5322:3023@c01-018-171.gridka.de sec=      uid=103031 gid=1399 name=nobody geo="" msg="done close" rc=-1 errc=5

File got cleaned because the connection from the client broke down without sending the close. From the logs, the timeline goes like this:

I guess we need to figure out how to deal with these situations.