Open germanfgv opened 3 months ago
@ccaffy @amadio could you please describe your hypothesis about the cause of this issue?
Here's what we observed on EOS side:
A first trial to copy the file to EOS failed and xrdcp retried internally via its write recovery mechanism. The retry mechanism did not close the first trial and managed to succesfully copy the file. xrdcp therefore considered the transfer as a success and closed all the connections. The first connection was therefore closed but as it failed to transfer the file properly, the file got deleted.
Disabling the write recovery on xrdcp by setting the environment variable XRD_WRITERECOVERY=0 will fix the issue.
@ccaffy could you provide logs were you see this happening? It would be good EOS logs along with WMAgent logs to get the full picture.
Sure, for the file /eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root:
log=1fd2c41e-6570-11ef-9007-a4bf0179624c&path=/eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root&fstpath=/data29/00073003/118c38d44&ruid=103031&rgid=1399&td=cmsplt01.5642:8236@b9g20p8382&host=st-096-ii2019bd.cern.ch&lid=1048850&fid=4710436164&fsid=33377&ots=1724871893&otms=106&cts=1724871994&ctms=473&nrc=1295&nwc=1255&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=1315175702&wb_min=261398&wb_max=1048576&wb_sigma=22211.51&sfwdb=878632239104&sbwdb=25165824&sxlfwdb=878632239104&sxlbwdb=25165824&nfwds=1296&nbwds=1&nxlfwds=1296&nxlbwds=1&usage=98.52&iot=101366.961&idt=1501.294&lrt=0.000&lrvt=0.000&lwt=87318.619&ot=7.571&ct=12539.477&rt=5213.35&rvt=0.00&wt=4691.41&osize=0&csize=1357118742&delete_on_close=0&prio_c=2&prio_l=4&prio_d=1&forced_bw=0&ms_sleep=0&ior_err=0&iow_err=0&sec.prot=gsi&sec.name=cmst0&sec.host=b9g20p8382.cern.ch&sec.vorg=&sec.grps=&sec.role=&sec.info=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0&sec.app=cmst0
log=1fd2c41e-6570-11ef-9007-a4bf0179624c&path=/eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root&fstpath=/data11/00073003/118c38d44&ruid=103031&rgid=1399&td=1.213123:1617@st-096-ii2019bd&host=st-096-100gb-ip306-28341.cern.ch&lid=1048850&fid=4710436164&fsid=35398&ots=1724871893&otms=112&cts=1724871994&ctms=472&nrc=1295&nwc=1255&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=0&wb_min=0&wb_max=0&wb_sigma=0.00&sfwdb=878632239104&sbwdb=25165824&sxlfwdb=878632239104&sxlbwdb=25165824&nfwds=1296&nbwds=1&nxlfwds=1296&nxlbwds=1&usage=15.63&iot=101359.914&idt=85518.695&lrt=0.000&lrvt=0.000&lwt=8995.287&ot=0.558&ct=6845.374&rt=6362.97&rvt=0.00&wt=8501.17&osize=0&csize=1357118742&delete_on_close=0&prio_c=2&prio_l=4&prio_d=1&forced_bw=0&ms_sleep=0&ior_err=0&iow_err=0&sec.prot=gsi&sec.name=cmst0&sec.host=b9g20p8382.cern.ch&sec.vorg=&sec.grps=&sec.role=&sec.info=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0&sec.app=cmst0
log=1fd2c41e-6570-11ef-9007-a4bf0179624c&path=/eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root&fstpath=/data11/00073003/118c38d44&ruid=103031&rgid=1399&td=3.213123:2336@st-096-ii2019bd&host=st-096-100gb-ip306-28341.cern.ch&lid=1048850&fid=4710436164&fsid=35398&ots=1724871771&otms=690&cts=1724872036&ctms=999&nrc=0&nwc=49&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=0&wb_min=0&wb_max=0&wb_sigma=0.00&sfwdb=0&sbwdb=0&sxlfwdb=0&sxlbwdb=0&nfwds=0&nbwds=0&nxlfwds=0&nxlbwds=0&usage=0.12&iot=265308.750&idt=264987.750&lrt=0.000&lrvt=0.000&lwt=319.672&ot=0.617&ct=0.724&rt=0.00&rvt=0.00&wt=298.37&osize=0&csize=0&delete_on_close=1&prio_c=2&prio_l=4&prio_d=1&forced_bw=0&ms_sleep=0&ior_err=0&iow_err=0&sec.prot=gsi&sec.name=cmst0&sec.host=b9g20p8382.cern.ch&sec.vorg=&sec.grps=&sec.role=&sec.info=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0&sec.app=cmst0
log=1fd2c41e-6570-11ef-9007-a4bf0179624c&path=/eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root&fstpath=/data29/00073003/118c38d44&ruid=103031&rgid=1399&td=cmsplt01.5642:2031@b9g20p8382&host=st-096-ii2019bd.cern.ch&lid=1048850&fid=4710436164&fsid=33377&ots=1724871771&otms=689&cts=1724872036&ctms=999&nrc=0&nwc=49&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=51380224&wb_min=1048576&wb_max=1048576&wb_sigma=0.00&sfwdb=0&sbwdb=0&sxlfwdb=0&sxlbwdb=0&nfwds=0&nbwds=0&nxlfwds=0&nxlbwds=0&usage=99.94&iot=265310.062&idt=161.072&lrt=0.000&lrvt=0.000&lwt=114506.528&ot=2.208&ct=150640.255&rt=0.00&rvt=0.00&wt=189.50&osize=0&csize=0&delete_on_close=1&prio_c=2&prio_l=4&prio_d=1&forced_bw=0&ms_sleep=0&ior_err=0&iow_err=0&sec.prot=gsi&sec.name=cmst0&sec.host=b9g20p8382.cern.ch&sec.vorg=&sec.grps=&sec.role=&sec.info=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0&sec.app=cmst0
What is interesting for you is the ots
(open timestamp), cts
(close timestamp), td
(connection identifier), idt
(idle time which is the time in ms where nothing happened) and delete_on_close
which if equal to 1, the file will be deleted.
for completeness sake, this is the wmagentJob.log
extract for the stage out of this file:
2024-08-28 19:02:51,338:INFO:StageOutMgr:==>Working on file: /store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root
2024-08-28 19:02:51,339:INFO:StageOutMgr:===> Attempting 1 Stage Outs
2024-08-28 19:02:51,340:INFO:StageOutMgr:LFN to PFN match made:
LFN: /store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root
PFN: root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root?eos.app=cmst0
2024-08-28 19:02:51,341:INFO:StageOutImpl:Creating output directory...
2024-08-28 19:02:51,353:INFO:StageOutImpl:Running the stage out...
2024-08-28 19:06:34,882:INFO:StageOutImpl:Command exited with status: 0
Output message: stdout: Local File Size is: 1357118742
xrdcp exit code: 0
Remote File Size is: 1357118742
Local File Checksum is: d9b2f991
Remote File Checksum is: d9b2f991
stderr:
2024-08-28 19:06:34,883:INFO:StageOutMgr:attempting stageOut
2024-08-28 19:06:34,883:INFO:StageOutMgr:===> Stage Out Successful: {'LFN': '/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root', 'PFN': 'root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2
024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root?eos.app=cmst0', 'PNN': 'T0_CH_CERN_Disk', 'StageOutCommand': 'xrdcp', 'Checksums': {'adler32': 'd9b2f991', 'cksum': '2524514726'}, 'StageOutReport': []}
We are currently running a replay with XRD_WRITERECOVERY=0
:
https://monit-grafana.cern.ch/d/t_jr45h7k/cms-tier0-replayid-monitoring?orgId=11&refresh=1m&var-Bin=5m&var-ReplayID=240829104828&var-JobType=All&var-WorkflowType=All
So far, no issues
We are currently running a replay with
XRD_WRITERECOVERY=0
This is the PR disabling write recovery on the replay: https://github.com/germanfgv/WMCore/pull/16
That's great news, thanks !
@germanfgv do you know against which storage endpoints this issue is happening? By default, write-recovery should be disabled at least at T0_CH_CERN_Disk and T2_CH_CERN, see: https://gitlab.cern.ch/SITECONF/T0_CH_CERN/-/blob/master/JobConfig/site-local-config.xml#L14 https://gitlab.cern.ch/SITECONF/T2_CH_CERN/-/blob/master/JobConfig/site-local-config.xml?ref_type=heads#L16
however, we are missing it for T2_CH_CERN_P5 and T2_CH_CERN_HLT. It might be worth it reaching out to the SST to update those site-local-config files.
We are writing to T0 Disk, so it seems this is not working properly. I've checked the final xrdcp
command and it does not include the proper env variable. This is an example of the cp command that we usually generate:
xrdcp --force --nopbar --cksum adler32:2d8dc345 "/srv/job/WMTaskSpace/cmsRun1/write_L1ScoutingSelection_L1SCOUT.root" "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2024F/L1ScoutingSelecti
on/L1SCOUT/v1/000/383/219/00000/16e6fec3-c3fa-422e-a474-f3f9724f4279.root?eos.app=cmst0"
This is an example of the command in the running replay:
env XRD_WRITERECOVERY=0 xrdcp --force --nopbar "/srv/job/WMTaskSpace/logArch1/logArchive.tar.gz" "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/logs/prod/2024/8/29/PromptReco_Run382686_EGamma0_Tier0_REPLAY_2024_ID240829104828_v29104828/Reco/0000/0/a96807d0-7d81-4221-8229-9288d4e1a1d3-0-0-logArchive.tar.gz?eos.app=cmst0"
Impact of the bug T0 WMAgent, Probably WMAgent too
Describe the bug This is a resurgence of #11998. We see jobs that successfully stageout their output to EOS, but the files are deleted shortly after. Capturing the exit code of
xrdcp
with #12058 did not help as the commandis successful returning 0.How to reproduce it This seems to happen randomly, however, in the last 24h, T0 production has experience 31 occurrences, including during replays.
Additional context and error message Log of the successful stage out of a later deleted file: