T0 Production Job Paused: StageOut error

LinaresToine commented 6 months ago

Impact of the bug T0 Production Agent

Describe the bug From the logs of the job, which I put in /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024D/FileOpenError , it seems that a job is failing during the stageOut, and the problem seems to be tied to the file root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2024D/SpecialZeroBias31/RAW-RECO/LogError-PromptReco-v1/000/380/932/00000/4b8b9111-14d0-48b9-b4c5-37a48f9f706d.root. This file is not being created properly in the previous step of the job. From what @jeyserma helped us understand, there is a job that generates such file and is being marked as successful, when in reality it should be moved to paused.

How to reproduce it Steps to reproduce the behavior: Will update this portion shortly

Additional context and error message

ocal Stage Out Catalog to be used is trivialcatalog_file:/cvmfs/cms.cern.ch/SITECONF/T0_CH_CERN/PhEDEx/storage.xml?protocol=eos
Trivial File Catalog has been loaded:
    lfn-to-pfn: protocol=eos path-match-re=/+store/t0streamer/(.*) result=root://eoscms.cern.ch//eos/cms/store/t0streamer/$1?eos.app=cmst0
    lfn-to-pfn: protocol=eos path-match-re=/+store/(.*) result=root://eoscms.cern.ch//eos/cms/tier0/store/$1?eos.app=cmst0
    lfn-to-pfn: protocol=xrootd path-match-re=/+store/(.*) result=root://xrootd-cms.infn.it//store/$1?eos.app=cmst0
    pfn-to-lfn: protocol=eos path-match-re=root://eoscms.cern.ch//eos/cms/tier0/store/(.*) result=/store/$1
    pfn-to-lfn: protocol=srmv2 path-match-re=.*\?SFN=(.*) result=$1 chain=direct
There are 0 fallback stage out definitions.

2024-05-21 21:33:55,243:INFO:StageOutMgr:==== Stageout configuration finish ====
2024-05-21 21:33:55,243:INFO:StageOut:Beginning report processing for step logArch1
2024-05-21 21:33:55,243:ERROR:StageOut:Cannot find report for step logArch1 in space /srv/job/WMTaskSpace/logArch1
2024-05-21 21:33:55,244:INFO:StageOut:Beginning report processing for step cmsRun1
2024-05-21 21:33:55,247:INFO:StageOutMgr:==>Working on file: /store/unmerged/data/Run2024D/SpecialZeroBias31/RAW-RECO/LogError-PromptReco-v1/000/380/932/00000/4b8b9111-14d0-48b9-b4c5-37a48f9f706d.root
2024-05-21 21:33:55,247:INFO:StageOutMgr:===> Attempting Local Stage Out.
2024-05-21 21:33:55,247:INFO:StageOutMgr:LFN to PFN match made:
LFN: /store/unmerged/data/Run2024D/SpecialZeroBias31/RAW-RECO/LogError-PromptReco-v1/000/380/932/00000/4b8b9111-14d0-48b9-b4c5-37a48f9f706d.root
PFN: root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2024D/SpecialZeroBias31/RAW-RECO/LogError-PromptReco-v1/000/380/932/00000/4b8b9111-14d0-48b9-b4c5-37a48f9f706d.root?eos.app=cmst0

2024-05-21 21:33:55,247:INFO:StageOutImpl:Creating output directory...
2024-05-21 21:33:55,253:INFO:StageOutImpl:Running the stage out...
2024-05-22 04:01:53,721:ERROR:StageOutImpl:Attempt 0 to stage out failed.
Automatically retrying in 300 secs
 Error details:
<@========== WMException Start ==========@>
Exception Class: StageOutError
Message: 
    ClassName : None
    ModuleName : WMCore.Storage.StageOutError
    MethodName : __init__
    ClassInstance : None
    FileName : /srv/job/WMCore.zip/WMCore/Storage/StageOutError.py
    LineNumber : 32
    ErrorNr : 0
    Command : LOCAL_SIZE=`stat -c%s "/srv/job/WMTaskSpace/cmsRun1/SKIMStreamLogError.root"`
echo "Local File Size is: $LOCAL_SIZE"
env XRD_WRITERECOVERY=0 xrdcp --force --nopbar --cksum adler32:fbad12d5  "/srv/job/WMTaskSpace/cmsRun1/SKIMStreamLogError.root"  "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2024D/SpecialZeroBias31/RAW-RECO/LogError-PromptReco-v1/000/380/932/00000/4b8b9111-14d0-48b9-b4c5-37a48f9f706d.root?eos.app=cmst0" 
REMOTE_SIZE=`xrdfs 'eoscms.cern.ch' stat '/eos/cms/tier0/store/unmerged/data/Run2024D/SpecialZeroBias31/RAW-RECO/LogError-PromptReco-v1/000/380/932/00000/4b8b9111-14d0-48b9-b4c5-37a48f9f706d.root' | grep Size | sed -r 's/.*Size:[ ]*([0-9]+).*/\1/'`
echo "Remote File Size is: $REMOTE_SIZE"
echo "Local File Checksum is: fbad12d5"
REMOTE_XS=`xrdfs 'eoscms.cern.ch' query checksum '/eos/cms/tier0/store/unmerged/data/Run2024D/SpecialZeroBias31/RAW-RECO/LogError-PromptReco-v1/000/380/932/00000/4b8b9111-14d0-48b9-b4c5-37a48f9f706d.root' | grep -i adler32 | sed -r 's/.*[adler|ADLER]32[ ]*([0-9a-fA-F]{8}).*/\1/'`
echo "Remote File Checksum is: $REMOTE_XS"
if [ $REMOTE_SIZE ] && [ $REMOTE_XS ] && [ $LOCAL_SIZE == $REMOTE_SIZE ] && [ 'fbad12d5' == $REMOTE_XS ]; then exit 0; else echo "ERROR: Size or Checksum Mismatch between local and SE"; xrdfs 'eoscms.cern.ch' rm /eos/cms/tier0/store/unmerged/data/Run2024D/SpecialZeroBias31/RAW-RECO/LogError-PromptReco-v1/000/380/932/00000/4b8b9111-14d0-48b9-b4c5-37a48f9f706d.root ; exit 60311 ; fi
    ExitCode : 60311
    ErrorCode : 60311
    ErrorType : GeneralStageOutFailure

Traceback: 
  File "/srv/job/WMCore.zip/WMCore/Storage/StageOutImpl.py", line 87, in executeCommand
    exitCode, output = runCommandWithOutput(command)

  File "/srv/job/WMCore.zip/WMCore/Storage/Execute.py", line 53, in runCommandWithOutput
    sout, serr = child.communicate()

  File "/cvmfs/cms.cern.ch/cc8_amd64_gcc9/external/python3/3.8.2-555576dae8d03b340ea3079ddb8abb0d/lib/python3.8/subprocess.py", line 1024, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)

  File "/cvmfs/cms.cern.ch/cc8_amd64_gcc9/external/python3/3.8.2-555576dae8d03b340ea3079ddb8abb0d/lib/python3.8/subprocess.py", line 1866, in _communicate
    ready = selector.select(timeout)

  File "/cvmfs/cms.cern.ch/cc8_amd64_gcc9/external/python3/3.8.2-555576dae8d03b340ea3079ddb8abb0d/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)

  File "/srv/job/WMCore.zip/WMCore/Algorithms/Alarm.py", line 39, in alarmHandler
    raise Alarm

<@---------- WMException End ----------@>

2024-05-22 04:06:53,823:INFO:StageOutImpl:Running the stage out...
2024-05-22 04:18:48,112:INFO:StageOutImpl:Command exited with status: 0
Output message: stdout: Local File Size is: 8332834
Remote File Size is: 8332834
Local File Checksum is: fbad12d5
Remote File Checksum is: fbad12d5

@jeyserma @germanfgv @amaltaro

jeyserma commented 5 months ago

Just want to add that it's the first time we have seen this so it must be quite rare due to some transient issue. We can't reproduce the issue.

amaltaro commented 5 months ago

@jeyserma @LinaresToine the log provided under /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024D/FileOpenError belongs to a Merge job, which fails to read one of the input files and returns:

CMSSW Return code: 8020

Does the log snippet above belong to the merge job? Or is it actually the processing job that produced the unmerged root file? If the later, can you please share those logs as well?

LinaresToine commented 5 months ago

Hello @amaltaro Thank you for your response. The job in charge of generating the file is a PromptReco job. I put the tarball with the logs in /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024D/FileOpenError/InputJob3173771/Job_2937895.tar.bz2

amaltaro commented 5 months ago

Thank you, Antonio. The following log records from wmagentJob.log look strange to me:

2024-05-21 21:33:55,247:INFO:StageOutImpl:Creating output directory...
2024-05-21 21:33:55,253:INFO:StageOutImpl:Running the stage out...
2024-05-22 04:01:53,721:ERROR:StageOutImpl:Attempt 0 to stage out failed.
Automatically retrying in 300 secs

followed by:

2024-05-22 04:06:53,823:INFO:StageOutImpl:Running the stage out...
2024-05-22 04:18:48,112:INFO:StageOutImpl:Command exited with status: 0
Output message: stdout: Local File Size is: 8332834
Remote File Size is: 8332834
Local File Checksum is: fbad12d5
Remote File Checksum is: fbad12d5

Observations are:

this stage out process took around 6.5h
it is said to have exit status 0 (thus, a successful stage out)
from the job traceback, I understand it raised a timeout alarm
the timeout alarm is num_of_retries (3 in this job) * 7200 sec, so it should be around 6h

If I am reading the code correctly, this is the high layer where the stage out is setup: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/Steps/Executors/StageOut.py#L195-L215

which then calls https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Storage/StageOutMgr.py#L235-L267

and finally it goes here: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Storage/StageOutImpl.py#L169

jeyserma commented 5 months ago

Hi @amaltaro we have two more occurrences of this problem, they happened more or less at the same time (same run). The following files are missing, leading to a paused PromptReco job:

/eos/cms/tier0/store/unmerged/data/Run2024F/EGamma1/ALCARECO/HcalCalIsoTrkProducerFilter-PromptReco-v1/000/382/300/00000/85ba8d74-a4e6-4e3e-8d2a-b2d530c8e721.root /eos/cms/tier0/store/unmerged/data/Run2024F/EGamma0/RAW-RECO/EGMJME-PromptReco-v1/000/382/300/00001/f6163d72-cf0d-45b6-943e-edbb0398388e.root

I have put the logs of the jobs that should have produced these missing files here: /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024F/missingFiles

I don't see the same behavior from the logs as you noticed previously.

One thing to notice is that the 3 occurrences are all physics skims or AlcaReco skims. Maybe that can give us a hint to understand the problem.

Thanks, Jan

LinaresToine commented 5 months ago

Hi @amaltaro

Thank you for the analysis provided in your previous comment. After following your comments, I thought of making an intervention in https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Storage/StageOutImpl.py#L228 so that the job is moved to the paused state. Would you know how this can be done?

There are two problems:

The stage out is not successful
The job is marked successful

I think that marking the job paused can help us debug future occurances.

It is also strange to me that the issue affects only a few jobs. Possibly not a wmcore issue in that case I believe.

amaltaro commented 4 months ago

@LinaresToine Antonio, I have briefly mentioned it to @anpicci today, but I created this debug patch: https://github.com/dmwm/WMCore/pull/12035

because I would like to test the signal handling/alarm configuration in the stage out package.

Could you please start a very small T0 replay, with this patch applied, such that we can try to reproduce this problem? Note that I expect all jobs to fail, as I am intentionally hitting the Alarm exception during the stage out process.

germanfgv commented 4 months ago

After checking with CERN IT folks, they found this:

240625 10:13:26 3416068 XrdLink: Unable to send to pcms02.5322:3023@c01-018-171.gridka.de; broken pipe
240625 10:13:26 3416068 XrootdXeq: pcms02.5322:3023@c01-018-171.gridka.de disc 0:01:18 (send failure)

240625 10:15:26 time=1719303326.373500 func=_close                   level=WARN  logid=9e0b8254-32ca-11ef-b77a-a4bf0179624c unit=fst@st-096-ii201a6a.cern.ch:1095 tid=00007f715aa69640 source=XrdFstOfsFile:2173             tident=pcms02.5322:3023@c01-018-171.gridka.de sec=      uid=103031 gid=1399 name=nobody geo="" info="deleting on close" fn=/eos/cms/tier0/store/unmerged/data/Run2024F/EGamma0/RAW-RECO/EGMJME-PromptReco-v1/000/382/300/00001/f6163d72-cf0d-45b6-943e-edbb0398388e.root fstpath=/data68/0006ff9b/11160a1f0 reason="client disconnect"
240625 10:15:26 time=1719303326.373622 func=_close                   level=INFO  logid=9e0b8254-32ca-11ef-b77a-a4bf0179624c unit=fst@st-096-ii201a6a.cern.ch:1095 tid=00007f715aa69640 source=XrdFstOfsFile:2263             tident=pcms02.5322:3023@c01-018-171.gridka.de sec=      uid=103031 gid=1399 name=nobody geo="" msg="done close" rc=-1 errc=5

File got cleaned because the connection from the client broke down without sending the close. From the logs, the timeline goes like this:

10:12:08,415: StageOut copy command
10:13:48,787: We checked checksum a second later
10:15:26: The file was deleted a minute and a half later.

I guess we need to figure out how to deal with these situations.

LinaresToine commented 4 months ago

Hello @amaltaro

The replay that was deployed has all the jobs failing as expected. However, the corresponding logging message at https://github.com/dmwm/WMCore/pull/12035/files.

I have put a tarball of one job in /afs/cern.ch/user/c/cmst0/public/PausedJobs/stageoutReplay/vocms015.cern.ch-551-3-log.tar.gz

I will keep looking into this.

amaltaro commented 4 months ago

@LinaresToine Antonio, if I recall it correctly, there was a mistake in the replay and that is the actual reason why the debug error message does not appear in the logs. Please let me know if I am wrong/correct though and share any logs that you might have on this.

germanfgv commented 4 months ago

@amaltaro following a suggestion from IT, I looked into the code handling the xrdcp command. The command is constructed here: https://github.com/dmwm/WMCore/blob/df77c5c6ceb54507daf066e06ae4949e08627f7f/src/python/WMCore/Storage/Backends/XRDCPImpl.py#L151

I printed out a sample command from one of our jobs:

LOCAL_SIZE=`stat -c%s "/srv/job/WMTaskSpace/cmsRun1/write_L1ScoutingSelection_L1SCOUT.root"`
echo "Local File Size is: $LOCAL_SIZE"
xrdcp --force --nopbar --cksum adler32:2d8dc345  "/srv/job/WMTaskSpace/cmsRun1/write_L1ScoutingSelection_L1SCOUT.root"  "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2024F/L1ScoutingSelecti
on/L1SCOUT/v1/000/383/219/00000/16e6fec3-c3fa-422e-a474-f3f9724f4279.root?eos.app=cmst0" 
REMOTE_SIZE=`xrdfs 'eoscms.cern.ch' stat '/eos/cms/tier0/store/unmerged/data/Run2024F/L1ScoutingSelection/L1SCOUT/v1/000/383/219/00000/16e6fec3-c3fa-422e-a474-f3f9724f4279.root' | grep Size | sed -r 's/.*
Size:[ ]*([0-9]+).*/\1/'`
echo "Remote File Size is: $REMOTE_SIZE"
echo "Local File Checksum is: 2d8dc345"
REMOTE_XS=`xrdfs 'eoscms.cern.ch' query checksum '/eos/cms/tier0/store/unmerged/data/Run2024F/L1ScoutingSelection/L1SCOUT/v1/000/383/219/00000/16e6fec3-c3fa-422e-a474-f3f9724f4279.root' | grep -i adler32 
| sed -r 's/.*[adler|ADLER]32[ ]*([0-9a-fA-F]{8}).*/\1/'`
echo "Remote File Checksum is: $REMOTE_XS"
if [ $REMOTE_SIZE ] && [ $REMOTE_XS ] && [ $LOCAL_SIZE == $REMOTE_SIZE ] && [ '2d8dc345' == $REMOTE_XS ]; then exit 0; else echo "ERROR: Size or Checksum Mismatch between local and SE"; xrdfs 'eoscms.cern
.ch' rm /eos/cms/tier0/store/unmerged/data/Run2024F/L1ScoutingSelection/L1SCOUT/v1/000/383/219/00000/16e6fec3-c3fa-422e-a474-f3f9724f4279.root ; exit 60311 ; fi

As you can see, te code never checks the actual exit code of the xrdcp command. It assumes that, if the remote size and checksum are correct, then the copy was successful. This is probably the source of the issue, as the xrdcp command is likely reporting the error, but we are ignoring it.

germanfgv commented 4 months ago

As you can see, te code never checks the actual exit code of the xrdcp command. It assumes that, if the remote size and checksum are correct, then the copy was successful. This is probably the source of the issue, as the xrdcp command is likely reporting the error, but we are ignoring it.

I propose we store the exit code of the command and add an extra check in the final conditional

LinaresToine commented 3 months ago

Hello @amaltaro @anpicci. As discussed in this week's meeting, I have used the new feature file patchComponent.sh worked in https://github.com/dmwm/WMCore/pull/12074 by @todor-ivanov to successfully patch the production agent with Andrea's fix in https://github.com/dmwm/WMCore/pull/12058. Thanks again

dmwm / WMCore

T0 Production Job Paused: StageOut error #11998