dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Files deleted by EOS after successful stageOut at T0 #12087

Open germanfgv opened 2 months ago

germanfgv commented 2 months ago

Impact of the bug T0 WMAgent, Probably WMAgent too

Describe the bug This is a resurgence of #11998. We see jobs that successfully stageout their output to EOS, but the files are deleted shortly after. Capturing the exit code of xrdcp with #12058 did not help as the commandis successful returning 0.

How to reproduce it This seems to happen randomly, however, in the last 24h, T0 production has experience 31 occurrences, including during replays.

Additional context and error message Log of the successful stage out of a later deleted file:

2024-08-28 07:39:29,312:INFO:StageOutMgr:LFN to PFN match made:
LFN: /store/unmerged/data/Run2024G/TestEnablesEcalHcal/RAW/Express-v1/000/385/057/00000/448389f8-043b-4352-ab47-9c34e6d404f7.root
PFN: root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2024G/TestEnablesEcalHcal/RAW/Express-v1/000/385/057/00000/448389f8-043b-4352-ab47-9c34e6d404f7.root?eos.app=cmst0

2024-08-28 07:39:29,312:INFO:StageOutImpl:Creating output directory...
2024-08-28 07:39:29,313:INFO:StageOutImpl:Running the stage out...
2024-08-28 07:41:27,626:INFO:StageOutImpl:Command exited with status: 0
Output message: stdout: Local File Size is: 48638573
xrdcp exit code: 0
Remote File Size is: 48638573
Local File Checksum is: fd4eb3b5
Remote File Checksum is: fd4eb3b5

stderr: 
germanfgv commented 2 months ago

@ccaffy @amadio could you please describe your hypothesis about the cause of this issue?

ccaffy commented 2 months ago

Here's what we observed on EOS side:

A first trial to copy the file to EOS failed and xrdcp retried internally via its write recovery mechanism. The retry mechanism did not close the first trial and managed to succesfully copy the file. xrdcp therefore considered the transfer as a success and closed all the connections. The first connection was therefore closed but as it failed to transfer the file properly, the file got deleted.

Disabling the write recovery on xrdcp by setting the environment variable XRD_WRITERECOVERY=0 will fix the issue.

germanfgv commented 2 months ago

@ccaffy could you provide logs were you see this happening? It would be good EOS logs along with WMAgent logs to get the full picture.

ccaffy commented 2 months ago

Sure, for the file /eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root:

log=1fd2c41e-6570-11ef-9007-a4bf0179624c&path=/eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root&fstpath=/data29/00073003/118c38d44&ruid=103031&rgid=1399&td=cmsplt01.5642:8236@b9g20p8382&host=st-096-ii2019bd.cern.ch&lid=1048850&fid=4710436164&fsid=33377&ots=1724871893&otms=106&cts=1724871994&ctms=473&nrc=1295&nwc=1255&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=1315175702&wb_min=261398&wb_max=1048576&wb_sigma=22211.51&sfwdb=878632239104&sbwdb=25165824&sxlfwdb=878632239104&sxlbwdb=25165824&nfwds=1296&nbwds=1&nxlfwds=1296&nxlbwds=1&usage=98.52&iot=101366.961&idt=1501.294&lrt=0.000&lrvt=0.000&lwt=87318.619&ot=7.571&ct=12539.477&rt=5213.35&rvt=0.00&wt=4691.41&osize=0&csize=1357118742&delete_on_close=0&prio_c=2&prio_l=4&prio_d=1&forced_bw=0&ms_sleep=0&ior_err=0&iow_err=0&sec.prot=gsi&sec.name=cmst0&sec.host=b9g20p8382.cern.ch&sec.vorg=&sec.grps=&sec.role=&sec.info=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0&sec.app=cmst0
log=1fd2c41e-6570-11ef-9007-a4bf0179624c&path=/eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root&fstpath=/data11/00073003/118c38d44&ruid=103031&rgid=1399&td=1.213123:1617@st-096-ii2019bd&host=st-096-100gb-ip306-28341.cern.ch&lid=1048850&fid=4710436164&fsid=35398&ots=1724871893&otms=112&cts=1724871994&ctms=472&nrc=1295&nwc=1255&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=0&wb_min=0&wb_max=0&wb_sigma=0.00&sfwdb=878632239104&sbwdb=25165824&sxlfwdb=878632239104&sxlbwdb=25165824&nfwds=1296&nbwds=1&nxlfwds=1296&nxlbwds=1&usage=15.63&iot=101359.914&idt=85518.695&lrt=0.000&lrvt=0.000&lwt=8995.287&ot=0.558&ct=6845.374&rt=6362.97&rvt=0.00&wt=8501.17&osize=0&csize=1357118742&delete_on_close=0&prio_c=2&prio_l=4&prio_d=1&forced_bw=0&ms_sleep=0&ior_err=0&iow_err=0&sec.prot=gsi&sec.name=cmst0&sec.host=b9g20p8382.cern.ch&sec.vorg=&sec.grps=&sec.role=&sec.info=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0&sec.app=cmst0
log=1fd2c41e-6570-11ef-9007-a4bf0179624c&path=/eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root&fstpath=/data11/00073003/118c38d44&ruid=103031&rgid=1399&td=3.213123:2336@st-096-ii2019bd&host=st-096-100gb-ip306-28341.cern.ch&lid=1048850&fid=4710436164&fsid=35398&ots=1724871771&otms=690&cts=1724872036&ctms=999&nrc=0&nwc=49&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=0&wb_min=0&wb_max=0&wb_sigma=0.00&sfwdb=0&sbwdb=0&sxlfwdb=0&sxlbwdb=0&nfwds=0&nbwds=0&nxlfwds=0&nxlbwds=0&usage=0.12&iot=265308.750&idt=264987.750&lrt=0.000&lrvt=0.000&lwt=319.672&ot=0.617&ct=0.724&rt=0.00&rvt=0.00&wt=298.37&osize=0&csize=0&delete_on_close=1&prio_c=2&prio_l=4&prio_d=1&forced_bw=0&ms_sleep=0&ior_err=0&iow_err=0&sec.prot=gsi&sec.name=cmst0&sec.host=b9g20p8382.cern.ch&sec.vorg=&sec.grps=&sec.role=&sec.info=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0&sec.app=cmst0
log=1fd2c41e-6570-11ef-9007-a4bf0179624c&path=/eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root&fstpath=/data29/00073003/118c38d44&ruid=103031&rgid=1399&td=cmsplt01.5642:2031@b9g20p8382&host=st-096-ii2019bd.cern.ch&lid=1048850&fid=4710436164&fsid=33377&ots=1724871771&otms=689&cts=1724872036&ctms=999&nrc=0&nwc=49&rb=0&rb_min=0&rb_max=0&rb_sigma=0.00&rv_op=0&rvb_min=0&rvb_max=0&rvb_sum=0&rvb_sigma=0.00&rs_op=0&rsb_min=0&rsb_max=0&rsb_sum=0&rsb_sigma=0.00&rc_min=0&rc_max=0&rc_sum=0&rc_sigma=0.00&wb=51380224&wb_min=1048576&wb_max=1048576&wb_sigma=0.00&sfwdb=0&sbwdb=0&sxlfwdb=0&sxlbwdb=0&nfwds=0&nbwds=0&nxlfwds=0&nxlbwds=0&usage=99.94&iot=265310.062&idt=161.072&lrt=0.000&lrvt=0.000&lwt=114506.528&ot=2.208&ct=150640.255&rt=0.00&rvt=0.00&wt=189.50&osize=0&csize=0&delete_on_close=1&prio_c=2&prio_l=4&prio_d=1&forced_bw=0&ms_sleep=0&ior_err=0&iow_err=0&sec.prot=gsi&sec.name=cmst0&sec.host=b9g20p8382.cern.ch&sec.vorg=&sec.grps=&sec.role=&sec.info=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0&sec.app=cmst0

What is interesting for you is the ots (open timestamp), cts (close timestamp), td (connection identifier), idt (idle time which is the time in ms where nothing happened) and delete_on_close which if equal to 1, the file will be deleted.

germanfgv commented 2 months ago

for completeness sake, this is the wmagentJob.log extract for the stage out of this file:

2024-08-28 19:02:51,338:INFO:StageOutMgr:==>Working on file: /store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root
2024-08-28 19:02:51,339:INFO:StageOutMgr:===> Attempting 1 Stage Outs
2024-08-28 19:02:51,340:INFO:StageOutMgr:LFN to PFN match made:
LFN: /store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root
PFN: root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root?eos.app=cmst0

2024-08-28 19:02:51,341:INFO:StageOutImpl:Creating output directory...
2024-08-28 19:02:51,353:INFO:StageOutImpl:Running the stage out...
2024-08-28 19:06:34,882:INFO:StageOutImpl:Command exited with status: 0
Output message: stdout: Local File Size is: 1357118742
xrdcp exit code: 0
Remote File Size is: 1357118742
Local File Checksum is: d9b2f991
Remote File Checksum is: d9b2f991

stderr: 
2024-08-28 19:06:34,883:INFO:StageOutMgr:attempting stageOut
2024-08-28 19:06:34,883:INFO:StageOutMgr:===> Stage Out Successful: {'LFN': '/store/unmerged/data/Run2024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root', 'PFN': 'root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2
024G/ParkingSingleMuon6/AOD/PromptReco-v1/000/384/981/00000/dbfe6a13-e382-4362-aa92-86152760503d.root?eos.app=cmst0', 'PNN': 'T0_CH_CERN_Disk', 'StageOutCommand': 'xrdcp', 'Checksums': {'adler32': 'd9b2f991', 'cksum': '2524514726'}, 'StageOutReport': []}
germanfgv commented 2 months ago

We are currently running a replay with XRD_WRITERECOVERY=0: https://monit-grafana.cern.ch/d/t_jr45h7k/cms-tier0-replayid-monitoring?orgId=11&refresh=1m&var-Bin=5m&var-ReplayID=240829104828&var-JobType=All&var-WorkflowType=All

So far, no issues

germanfgv commented 2 months ago

We are currently running a replay with XRD_WRITERECOVERY=0

This is the PR disabling write recovery on the replay: https://github.com/germanfgv/WMCore/pull/16

ccaffy commented 2 months ago

That's great news, thanks !

amaltaro commented 2 months ago

@germanfgv do you know against which storage endpoints this issue is happening? By default, write-recovery should be disabled at least at T0_CH_CERN_Disk and T2_CH_CERN, see: https://gitlab.cern.ch/SITECONF/T0_CH_CERN/-/blob/master/JobConfig/site-local-config.xml#L14 https://gitlab.cern.ch/SITECONF/T2_CH_CERN/-/blob/master/JobConfig/site-local-config.xml?ref_type=heads#L16

however, we are missing it for T2_CH_CERN_P5 and T2_CH_CERN_HLT. It might be worth it reaching out to the SST to update those site-local-config files.

germanfgv commented 2 months ago

We are writing to T0 Disk, so it seems this is not working properly. I've checked the final xrdcp command and it does not include the proper env variable. This is an example of the cp command that we usually generate:

xrdcp --force --nopbar --cksum adler32:2d8dc345  "/srv/job/WMTaskSpace/cmsRun1/write_L1ScoutingSelection_L1SCOUT.root"  "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/Run2024F/L1ScoutingSelecti
on/L1SCOUT/v1/000/383/219/00000/16e6fec3-c3fa-422e-a474-f3f9724f4279.root?eos.app=cmst0" 

This is an example of the command in the running replay:

env XRD_WRITERECOVERY=0 xrdcp --force --nopbar  "/srv/job/WMTaskSpace/logArch1/logArchive.tar.gz"  "root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/logs/prod/2024/8/29/PromptReco_Run382686_EGamma0_Tier0_REPLAY_2024_ID240829104828_v29104828/Reco/0000/0/a96807d0-7d81-4221-8229-9288d4e1a1d3-0-0-logArchive.tar.gz?eos.app=cmst0" 
amadio commented 2 months ago

FYI https://its.cern.ch/jira/browse/EOS-6195