Closed LinaresToine closed 3 months ago
Just want to add that it's the first time we have seen this so it must be quite rare due to some transient issue. We can't reproduce the issue.
@jeyserma @LinaresToine the log provided under /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024D/FileOpenError
belongs to a Merge job, which fails to read one of the input files and returns:
CMSSW Return code: 8020
Does the log snippet above belong to the merge job? Or is it actually the processing job that produced the unmerged root file? If the later, can you please share those logs as well?
Hello @amaltaro
Thank you for your response. The job in charge of generating the file is a PromptReco job. I put the tarball with the logs in /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024D/FileOpenError/InputJob3173771/Job_2937895.tar.bz2
Thank you, Antonio. The following log records from wmagentJob.log look strange to me:
2024-05-21 21:33:55,247:INFO:StageOutImpl:Creating output directory...
2024-05-21 21:33:55,253:INFO:StageOutImpl:Running the stage out...
2024-05-22 04:01:53,721:ERROR:StageOutImpl:Attempt 0 to stage out failed.
Automatically retrying in 300 secs
followed by:
2024-05-22 04:06:53,823:INFO:StageOutImpl:Running the stage out...
2024-05-22 04:18:48,112:INFO:StageOutImpl:Command exited with status: 0
Output message: stdout: Local File Size is: 8332834
Remote File Size is: 8332834
Local File Checksum is: fbad12d5
Remote File Checksum is: fbad12d5
Observations are:
If I am reading the code correctly, this is the high layer where the stage out is setup: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/Steps/Executors/StageOut.py#L195-L215
which then calls https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Storage/StageOutMgr.py#L235-L267
and finally it goes here: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Storage/StageOutImpl.py#L169
Hi @amaltaro we have two more occurrences of this problem, they happened more or less at the same time (same run). The following files are missing, leading to a paused PromptReco job:
/eos/cms/tier0/store/unmerged/data/Run2024F/EGamma1/ALCARECO/HcalCalIsoTrkProducerFilter-PromptReco-v1/000/382/300/00000/85ba8d74-a4e6-4e3e-8d2a-b2d530c8e721.root
/eos/cms/tier0/store/unmerged/data/Run2024F/EGamma0/RAW-RECO/EGMJME-PromptReco-v1/000/382/300/00001/f6163d72-cf0d-45b6-943e-edbb0398388e.root
I have put the logs of the jobs that should have produced these missing files here:
/afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024F/missingFiles
I don't see the same behavior from the logs as you noticed previously.
One thing to notice is that the 3 occurrences are all physics skims or AlcaReco skims. Maybe that can give us a hint to understand the problem.
Thanks, Jan
Hi @amaltaro
Thank you for the analysis provided in your previous comment. After following your comments, I thought of making an intervention in https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Storage/StageOutImpl.py#L228 so that the job is moved to the paused state. Would you know how this can be done?
There are two problems:
I think that marking the job paused can help us debug future occurances.
It is also strange to me that the issue affects only a few jobs. Possibly not a wmcore issue in that case I believe.
@LinaresToine Antonio, I have briefly mentioned it to @anpicci today, but I created this debug patch: https://github.com/dmwm/WMCore/pull/12035
because I would like to test the signal handling/alarm configuration in the stage out package.
Could you please start a very small T0 replay, with this patch applied, such that we can try to reproduce this problem? Note that I expect all jobs to fail, as I am intentionally hitting the Alarm exception during the stage out process.
After checking with CERN IT folks, they found this:
240625 10:13:26 3416068 XrdLink: Unable to send to pcms02.5322:3023@c01-018-171.gridka.de; broken pipe
240625 10:13:26 3416068 XrootdXeq: pcms02.5322:3023@c01-018-171.gridka.de disc 0:01:18 (send failure)
240625 10:15:26 time=1719303326.373500 func=_close level=WARN logid=9e0b8254-32ca-11ef-b77a-a4bf0179624c unit=fst@st-096-ii201a6a.cern.ch:1095 tid=00007f715aa69640 source=XrdFstOfsFile:2173 tident=pcms02.5322:3023@c01-018-171.gridka.de sec= uid=103031 gid=1399 name=nobody geo="" info="deleting on close" fn=/eos/cms/tier0/store/unmerged/data/Run2024F/EGamma0/RAW-RECO/EGMJME-PromptReco-v1/000/382/300/00001/f6163d72-cf0d-45b6-943e-edbb0398388e.root fstpath=/data68/0006ff9b/11160a1f0 reason="client disconnect"
240625 10:15:26 time=1719303326.373622 func=_close level=INFO logid=9e0b8254-32ca-11ef-b77a-a4bf0179624c unit=fst@st-096-ii201a6a.cern.ch:1095 tid=00007f715aa69640 source=XrdFstOfsFile:2263 tident=pcms02.5322:3023@c01-018-171.gridka.de sec= uid=103031 gid=1399 name=nobody geo="" msg="done close" rc=-1 errc=5
File got cleaned because the connection from the client broke down without sending the close. From the logs, the timeline goes like this:
10:12:08,415
: StageOut copy command10:13:48,787
: We checked checksum a second later 10:15:26
: The file was deleted a minute and a half later.I guess we need to figure out how to deal with these situations.
Hello @amaltaro
The replay that was deployed has all the jobs failing as expected. However, the corresponding logging message at https://github.com/dmwm/WMCore/pull/12035/files.
I have put a tarball of one job in /afs/cern.ch/user/c/cmst0/public/PausedJobs/stageoutReplay/vocms015.cern.ch-551-3-log.tar.gz
I will keep looking into this.
@LinaresToine Antonio, if I recall it correctly, there was a mistake in the replay and that is the actual reason why the debug error message does not appear in the logs. Please let me know if I am wrong/correct though and share any logs that you might have on this.
@amaltaro following a suggestion from IT, I looked into the code handling the xrdcp command. The command is constructed here: https://github.com/dmwm/WMCore/blob/df77c5c6ceb54507daf066e06ae4949e08627f7f/src/python/WMCore/Storage/Backends/XRDCPImpl.py#L151
I printed out a sample command from one of our jobs:
LOCAL_SIZE=`stat -c%s "/srv/job/WMTaskSpace/cmsRun1/write_L1ScoutingSelection_L1SCOUT.root"`
echo "Local File Size is: $LOCAL_SIZE"
xrdcp --force --nopbar --cksum adler32:2d8dc345 "/srv/job/WMTaskSpace/cmsRun1/write_L1ScoutingSelection_L1SCOUT.root" "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2024F/L1ScoutingSelecti
on/L1SCOUT/v1/000/383/219/00000/16e6fec3-c3fa-422e-a474-f3f9724f4279.root?eos.app=cmst0"
REMOTE_SIZE=`xrdfs 'eoscms.cern.ch' stat '/eos/cms/tier0/store/unmerged/data/Run2024F/L1ScoutingSelection/L1SCOUT/v1/000/383/219/00000/16e6fec3-c3fa-422e-a474-f3f9724f4279.root' | grep Size | sed -r 's/.*
Size:[ ]*([0-9]+).*/\1/'`
echo "Remote File Size is: $REMOTE_SIZE"
echo "Local File Checksum is: 2d8dc345"
REMOTE_XS=`xrdfs 'eoscms.cern.ch' query checksum '/eos/cms/tier0/store/unmerged/data/Run2024F/L1ScoutingSelection/L1SCOUT/v1/000/383/219/00000/16e6fec3-c3fa-422e-a474-f3f9724f4279.root' | grep -i adler32
| sed -r 's/.*[adler|ADLER]32[ ]*([0-9a-fA-F]{8}).*/\1/'`
echo "Remote File Checksum is: $REMOTE_XS"
if [ $REMOTE_SIZE ] && [ $REMOTE_XS ] && [ $LOCAL_SIZE == $REMOTE_SIZE ] && [ '2d8dc345' == $REMOTE_XS ]; then exit 0; else echo "ERROR: Size or Checksum Mismatch between local and SE"; xrdfs 'eoscms.cern
.ch' rm /eos/cms/tier0/store/unmerged/data/Run2024F/L1ScoutingSelection/L1SCOUT/v1/000/383/219/00000/16e6fec3-c3fa-422e-a474-f3f9724f4279.root ; exit 60311 ; fi
As you can see, te code never checks the actual exit code of the xrdcp
command. It assumes that, if the remote size and checksum are correct, then the copy was successful. This is probably the source of the issue, as the xrdcp
command is likely reporting the error, but we are ignoring it.
As you can see, te code never checks the actual exit code of the xrdcp command. It assumes that, if the remote size and checksum are correct, then the copy was successful. This is probably the source of the issue, as the
xrdcp
command is likely reporting the error, but we are ignoring it.
I propose we store the exit code of the command and add an extra check in the final conditional
Hello @amaltaro @anpicci. As discussed in this week's meeting, I have used the new feature file patchComponent.sh worked in https://github.com/dmwm/WMCore/pull/12074 by @todor-ivanov to successfully patch the production agent with Andrea's fix in https://github.com/dmwm/WMCore/pull/12058. Thanks again
Impact of the bug T0 Production Agent
Describe the bug From the logs of the job, which I put in
/afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024D/FileOpenError
, it seems that a job is failing during the stageOut, and the problem seems to be tied to the fileroot://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2024D/SpecialZeroBias31/RAW-RECO/LogError-PromptReco-v1/000/380/932/00000/4b8b9111-14d0-48b9-b4c5-37a48f9f706d.root
. This file is not being created properly in the previous step of the job. From what @jeyserma helped us understand, there is a job that generates such file and is being marked as successful, when in reality it should be moved to paused.How to reproduce it Steps to reproduce the behavior: Will update this portion shortly
Additional context and error message
@jeyserma @germanfgv @amaltaro