Closed vkuznet closed 1 year ago
FYI: @amaltaro , @haozturk
@vkuznet thanks for creating this issue. @haozturk FYI
Here is my suggestion on how to debug it:
Another option is to use the relational database, using your preferred SQL client (sqlplus, SQLDeveloper, etc). If you use sqlplus, you can type this command in the agent:
$manage db-prompt wmagent
and I think the table that we want to look at is dbsbuffer_file
.
Here is very simple procedure to check DBS vs rucio files:
#
with %23
in block name), e.g.
blk=/Muon/Run2022D-ZMu-10Dec2022-v1/RAW-RECO%23a63aafe6-480d-45d8-9d83-668e6172c271
jq
tool to parse JSON output and extract LFN nmes:
scurl "https://cmsweb.cern.ch:8443/dbs/prod/global/DBSReader/files?block_name=$blk" | jq '.[].logical_file_name'
rucio_client
script:
#!/bin/bash
url=$1
account=das
cert=/etc/proxy/proxy
ckey=/etc/proxy/proxy
if [ ! -f $ckey ]; then
ckey=$HOME/.globus/userkey.pem
cert=$HOME/.globus/usercert.pem
fi
aurl=https://cms-rucio-auth.cern.ch/auth/x509
if [ -n "echo $url | grep rucio-int
" ]; then
aurl=https://cms-rucio-auth-int.cern.ch/auth/x509
fi
opt="-s -L -k --key $HOME/.globus/userkey.pem --cert $HOME/.globus/usercert.pem"
token=curl $opt -v -H "X-Rucio-Account: $account" $aurl 2>&1 | grep "X-Rucio-Auth-Token:" | sed -e "s,< X-Rucio-Auth-Token: ,,g"
curl $opt -H "X-Rucio-Auth-Token: $token" -H "X-Rucio-Account: $account" "$url"
and here I run it
rucio_curl "http://cms-rucio.cern.ch/replicas/cms/Muon/Run2022D-ZMu-10Dec2022-v1/RAW-RECO%23a63aafe6-480d-45d8-9d83-668e6172c271" | jq '.name'
So, if you run both and count number of files you will see the difference, e.g.
scurl "https://cmsweb.cern.ch:8443/dbs/prod/global/DBSReader/files?block_name=$blk" | jq '.[].logical_file_name' | sort -u > /tmp/dbs.blocks
rucio_curl "http://cms-rucio.cern.ch/replicas/cms/Muon/Run2022D-ZMu-10Dec2022-v1/RAW-RECO%23a63aafe6-480d-45d8-9d83-668e6172c271" | jq '.name' | sort -u > /tmp/rucio.blocks
diff -u /tmp/dbs.blocks /tmp/rucio.blocks --- /tmp/dbs.blocks 2023-04-18 09:18:34.000000000 -0400 +++ /tmp/rucio.blocks 2023-04-18 09:18:26.000000000 -0400 @@ -1,4 +1,3 @@ -"/store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/025d55a6-8e48-4695-9087-80bf9681229b.root" "/store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/0866682d-44b9-4f3a-8b5b-4b4baab4c137.root" "/store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/0c1860fd-a94a-4ec7-843b-453c3a4bec3e.root" "/store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/10f22a56-6424-4559-95d0-c505ddefc7b0.root"
The missing LFN with hash `025d55a6-8e48-4695-9087-80bf9681229b` does not exist in Rucio.
I communicated with @ericvaandering who gave me the clue. According to Rucio the file which is not present in Rucio does not have associated block (i.e. it is not attached to the block).
# here is a LFN which is attached to the CMS block, rucio dataset
rucio list-parent-dids
cms:/store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/0866682d-44b9-4f3a-8b5b-4b4baab4c137.root
+-----------------------------------------------------------------------------------+--------------+
| SCOPE:NAME | [DID TYPE] |
|-----------------------------------------------------------------------------------+--------------|
|
cms:/Muon/Run2022D-ZMu-10Dec2022-v1/RAW-RECO#a63aafe6-480d-45d8-9d83-668e6172c271 | DATASET |
# while here is LFN which has no attachment
rucio list-parent-dids
cms:/store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/025d55a6-8e48-4695-9087-80bf9681229b.root
Then, I checked LFNs status in DBS and found the same confirmation, i.e. the LFN /store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/0866682d-44b9-4f3a-8b5b-4b4baab4c137.root
has valid status while LFN /store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/025d55a6-8e48-4695-9087-80bf9681229b.root
is not valid.
My conclusion is that invalid files are not appeared in Rucio.
I further investigated Unified code base and found particular flaw in their function usage which depends on internal cache. Please see full details over here: https://gitlab.cern.ch/CMSToolsIntegration/operations/-/issues/41#note_6636304. In summary, it seems to me that Unified stores DBS results into cache used in multiple places and when finally compare results of DBS vs Rucio files some files may have invalidated by underlying workflow or data-ops leading to file mismatch.
I'll be happy to check another block when PnR will provide that but at the moment I do not see any issues with WMCore/agent/RucioInjector.
@vkuznet Valentin, regardless of a stale cache or not in Unified, didn't you confirm that actually there is a file in Rucio which has not been attached to its parent block? If this is true, then there must be a flaw as well either in:
BTW, here is our Rucio implementation for adding a replica to Rucio server plus attaching it to a block: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/Rucio/Rucio.py#L381
We might have to chase this file/block in the agent logs to see if any error/exception was raised when dealing with it in RucioInjector.
Unless "invalidation" (who/where does that) is not deleting the Rucio DID but only detaching it.
Alan, how I can find which agent was used for archived workflow? I looked up the one provided in description and found it here: https://cmsweb.cern.ch/reqmgr2/fetch?rid=request-pdmvserv_Run2022D_Muon_10Dec2022_221209_230856_3545 but I can't find it in wmstats using this WF name.
Meanwhile, according to WMCore RucioInjector code base https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/Rucio/Rucio.py#L404-L414 and using logs from one of the CERN agents (vocms0255) I do see few errors like this:
2023-02-28 17:28:25,353:140522088707840:ERROR:Rucio:Failed to add replicas for: [{'name': '/store/mc/RunIISummer20UL16RECO/WplusH_HTo2LongLivedTo2G2Q_MH-125_MFF-7_ctau-20cm_TuneCP5_fixedLifetime_13TeV-powheg-pythia8/AODSIM/106X_mcRun2_asymptotic_v13-v2/2550000/A9995941-D344-2447-AEC2-A4B0FF2F1810.root', 'scope': 'cms', 'bytes': 1744683405, 'state': 'A', 'adler32': '37f4804d'}] and block: /WplusH_HTo2LongLivedTo2G2Q_MH-125_MFF-7_ctau-20cm_TuneCP5_fixedLifetime_13TeV-powheg-pythia8/RunIISummer20UL16RECO-106X_mcRun2_asymptotic_v13-v2/AODSIM#05edac39-5ba2-4d35-b29f-00ccd0422a1a. Error: An unknown exception occurred
which suggests that RucioInjector can fail and the reason is unknown (or not propagated well from Rucio client). Said that, we may perform inverse look-up and check if PnR has filemismatch in this block /WplusH_HTo2LongLivedTo2G2Q_MH-125_MFF-7_ctau-20cm_TuneCP5_fixedLifetime_13TeV-powheg-pythia8/RunIISummer20UL16RECO-106X_mcRun2_asymptotic_v13-v2/AODSIM#05edac39-5ba2-4d35-b29f-00ccd0422a1a
or any other block we can extract from agent log.
@amaltaro , another suggestion, by inspecting reqmgr2 logDB I found that for a given WF we do have pretty extensive JSON meta-data but it lacks of one particular important element the agent name which processed this WF. Should we open issue for that to make debugging easier?
The only way to know which agents worked on a given request is through wmstats, at the same time, wmstats only contains data for active workflow. The workflow you are looking at has already been archived (normal-archived), so we cannot easily know which agents processed it. I think Hasan mentioned many workflows in the same situation, so I'd suggest to pick one or two that have ran in the past weeks (and that are not yet archived).
This message that you pasted "Error: An unknown exception occurred" suggested that no error details were returned from Rucio server. Still, WMAgent is stateful and it will keep retrying to inject that replica+attach in the coming component cycles.
Regarding reqmgr2 LogDB, I feel like the way it's used is sub-optimal and from time to time the database gets slower and queries might time out. I believe it's also meant to be a temporary database only for active workflows. We might have to revisit its usage and how data is organized in it before expanding with further use-cases. But I agree that the agent information is very much needed!
@amaltaro , while awaiting from PnR about active workflows, I inspected the list of ones they listed in their ticket and I found none available in wmstats, either it does not exist or it is completed. Said that, I also inspected WMCore Rucio.py code base and would like to provide general feedback about this issue:
createReplicas
in Rucio.py does not return concrete status, only success or failure, and only raise exception. Therefore, it is not feasible to trace down what is wrong with rucio injection. I rather suggest to modify this function to return a JSON record instead of traceback to upstream code such that such error record can be propagated to WF JSON (FWJR) and represent the actual error which will ease debugging of such issue. A concrete example is DBS server which return full error along with its error code. The error includes function, the database error, and auxilary information. Here we need a similar concept. But to make it workable it will require
createReplicas
to return such error, including Rucio error info, actual traceback, input parametersThanks Valentin. Then we need to wait for P&R to provide a workflow that has this issue and which is still active.
Regarding the Rucio.py implementation, we need to check what is actually returned from the server - through Rucio client - to see which information we can pass upstream.
Another option would be to actually separate the "rucio replica injection" from the "rucio replica attachment". Making it in 2 different methods such that we can properly see what the outcome of each is.
Note that createReplicas
is only used by RucioInjector, given that it is the only component registering DID in Rucio.
Since in P&R ticket there is restriction, I am pasting what we have from other ticket.
Valentin checked this workflow pdmvserv_task_BTV-Run3Summer22EEGS-00009__v1_T_221223_125301_7115
as completed in FNAL submit8 agent.
An example of a file that is in DBS but not in rucio: /store/mc/Run3Summer22EEMiniAODv3/QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/MINIAODSIM/124X_mcRun3_2022_realistic_postEE_v1-v1/80000/aede32b8-d95a-44f0-aa06-2f21dbda371b.root
The corresponding block is: /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf
Alan did check the workflow which produced the MINIAODSIM above, which was: pdmvserv_task_BTV-Run3Summer22EEGS-00009__v1_T_221223_125301_7115
Looking into the output MINIAODSIM dataset, here is what one of the WMAgent utilitarian tools says:
cmst1@vocms0255:/data/srv/wmagent/current $ python3 apps/wmagentpy3/bin/adhoc-scripts/checkDsetFileCount.py /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM
...
2023-04-25 21:11:27,899:INFO:checkDsetFileCount: *** Dataset: /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM
2023-04-25 21:11:27,899:INFO:checkDsetFileCount: Rucio file count : 1974
2023-04-25 21:11:27,899:INFO:checkDsetFileCount: DBS file count : 2016
2023-04-25 21:11:27,899:INFO:checkDsetFileCount: - valid files : 2006
2023-04-25 21:11:27,899:INFO:checkDsetFileCount: - invalid files : 10
2023-04-25 21:11:27,899:INFO:checkDsetFileCount: Blocks in Rucio but not in DBS: set()
2023-04-25 21:11:27,899:INFO:checkDsetFileCount: Blocks in DBS but not in Rucio: set()
2023-04-25 21:11:27,899:WARNING:checkDsetFileCount: Block with file mismatch: /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf
2023-04-25 21:11:27,899:WARNING:checkDsetFileCount: Rucio: 0 DBS: 41
2023-04-25 21:11:27,899:WARNING:checkDsetFileCount: Block with file mismatch: /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#956bfed7-3d2b-401d-b8f7-f13a86176ec5
2023-04-25 21:11:27,899:WARNING:checkDsetFileCount: Rucio: 7 DBS: 8
so one of the blocks is said not to have any files in Rucio, while DBS has 41 files, confirmed with: https://cmsweb.cern.ch/dbs/prod/global/DBSReader/filesummaries?block_name=/QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM%232ac6771a-ae1e-4666-9c3d-f2e3d8daccbf
IF I grep the submit8 RucioInjector logs, I can find that this block was reported as being successfully injected in Rucio back in January (AND it reports the very same 41 files):
[cmsdataops@cmsgwms-submit8 current]$ grep 'MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf' install/wmagentpy3/DBS3Upload/ComponentLog
2023-01-08 07:10:07,090:140090107418368:INFO:DBSUploadPoller:Queueing block for insertion: /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf
2023-01-08 07:10:07,110:140090107418368:INFO:DBSUploadPoller:About to call insert block for: /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf
[cmsdataops@cmsgwms-submit8 current]$
[cmsdataops@cmsgwms-submit8 current]$ grep 'MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf' install/wmagentpy3/RucioInjector/ComponentLog
2023-01-06 07:11:17,939:140090104010496:INFO:RucioInjectorPoller:Block /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf inserted into Rucio
2023-01-06 07:11:40,684:140090104010496:INFO:RucioInjectorPoller:Successfully inserted 41 files on block /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf
2023-01-06 07:12:17,958:140090104010496:INFO:RucioInjectorPoller:Block rule created for block: /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf, at: T2_US_Florida, with rule id: d6326e6f122540b9b8f92e07b5912b03
2023-01-08 07:11:53,813:140090104010496:INFO:RucioInjectorPoller:Closing block: /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf
Alan still need to look into Rucio server to see if it actually knowns any of the DIDs (files) in this block. To be done next.
From Alan:
FYI. I asked for feedback in the #DM Ops Mattermost channel and Jhonatan seemed to have found all those files in Rucio: https://mattermost.web.cern.ch/cms-o-and-c/pl/4mg45jykbbn8jdbqxydn64f58y
I still have follow up questions and we still need to understand what that means and why that happened. Anyhow, feel free to follow this up over there as well.
@z4027163 thank you for updating this ticket! Sorry that I had to miss the DM meeting, is there anything that we WM are expected to do next? Or what is the next step to properly understanding what happened here?
@amaltaro They are aware of the issue, we couldn't figure out the exact issue in the meeting. Eric/Diago will check the ticket for more details. We are waiting for their feedback, maybe there are follow-ups to do from wmcore side after that.
I think we said we'd monitor the ticket, not that there was anything I could look into. Jhonathan's posts in MM show that these file DIDs are getting created in Rucio, right? And they are being attached to the block DIDs. So the mystery is how they are being unattached.
@ericvaandering Ah yes, I misunderstood. I thought you guys would be the right people to check it. Who should we ask to look into it? I had the impression it was outside wmcore's reach.
Well, I did just take a look at the Rucio code to see WHEN file are detached from blocks. It looks like it can happen if a file is declared bad AND it is the last copy of that file: https://github.com/rucio/rucio/blob/e595093ef688422a9a3d8823570d7826369ab22e/lib/rucio/core/rule.py#L1932
@ivmfnal would it be possible to search the consistency checking logs for some of these files (is there a sample list somewhere?) to see if this is what's happening?
@ericvaandering , I certainly do not know details of rucio but looking at the code I find suspicious the following block:
def update_rules_for_lost_replica(scope, name, rse_id, nowait=False, session=None, logger=logging.log):
...
for dts in datasets:
logger(logging.INFO, 'File %s:%s bad at site %s is completely lost from dataset %s:%s. Will be marked as LOST and detached', scope, name, rse, dts['scope'], dts['name'])
rucio.core.did.detach_dids(scope=dts['scope'], name=dts['name'], dids=[{'scope': scope, 'name': name}], session=session)
Here we have two parameters, name
which is passed to the API and I think it is LFN name, anddst['name']
which I would assume is name of dataset/block. So, is it possible that LFN belongs to different datasets/blocks and can be discarded from a wrong dataset/block? Or, is it possible that we pass LFN from a CMS dataset but it can be present in multiple CMS blocks, and therefore if it is discarded from single CMS block it will be LOST for entire dataset?
Well, in the Rucio model block:lfn (to use CMS terms) is a many to many relationship, not one:many. However, we still use (but don't enforce) 1:many.
However, if a replica is lost and it's the only replica, that means the file is lost forever and the right thing to do IS to remove it from all the blocks (Rucio datasets) it belongs to.
Based on the workflow that we have been investigating, in this comment: https://github.com/dmwm/WMCore/issues/11556#issuecomment-1531485494
I can list all the 41 files in that CMS block with the following command:
amaltaro@lxplus797:~ $ rucio list-content-history "cms:/QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf"
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
| SCOPE:NAME | [DID TYPE] |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------|
| cms:/store/mc/Run3Summer22EEMiniAODv3/QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/MINIAODSIM/124X_mcRun3_2022_realistic_postEE_v1-v1/80000/9d1b582f-0825-4c77-a468-6c898f8505fe.root | FILE |
| cms:/store/mc/Run3Summer22EEMiniAODv3/QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/MINIAODSIM/124X_mcRun3_2022_realistic_postEE_v1-v1/80000/28eb066c-da6c-4b40-a46c-671ab356ebee.root | FILE |
and the second block - with 8 files in DBS but only 7 in Rucio - is even more interesting. The list-content-history
API returns only 1 file, while list-content
returns the remaining 7 files, e.g.:
amaltaro@lxplus797:~ $ rucio list-content-history "cms:/QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#956bfed7-3d2b-401d-b8f7-f13a86176ec5"
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
| SCOPE:NAME | [DID TYPE] |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------|
| cms:/store/mc/Run3Summer22EEMiniAODv3/QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/MINIAODSIM/124X_mcRun3_2022_realistic_postEE_v1-v1/80001/ae8b90c7-64c7-4331-b18d-be6e1133fa73.root | FILE |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
given that we can list files by querying the CMS block, I believe it is not a matter of leaving files detached from the CMS block. If I recall correctly, using the python client APIs we can even collect further details for these files. Last time I checked, they were deleted 4 or 5 days post injection into Rucio server.
IF we cannot answer what/who deleted those files back then, then I'd suggest to consider this issue: https://github.com/dmwm/CMSRucio/issues/495 at our earliest convenience, otherwise we won't be able to debug and identify which system is deleting these files.
@ericvaandering yes, there are files with "missing" lists for each run:
[root@consistency-rucio-consistency-6cc7b4c985-kx5d4 consistency-dump]# ls -l *_M_action.list
-rw-r--r--. 1 root root 0 Mar 18 14:27 T0_CH_CERN_Disk_2023_03_18_14_00_M_action.list
-rw-r--r--. 1 root root 0 Mar 25 14:30 T0_CH_CERN_Disk_2023_03_25_14_00_M_action.list
-rw-r--r--. 1 root root 0 Mar 31 23:15 T0_CH_CERN_Disk_2023_03_31_22_48_M_action.list
-rw-r--r--. 1 root root 0 Apr 13 17:09 T0_CH_CERN_Disk_2023_04_13_16_29_M_action.list
-rw-r--r--. 1 root root 0 Apr 20 17:10 T0_CH_CERN_Disk_2023_04_20_16_29_M_action.list
-rw-r--r--. 1 root root 0 Apr 27 16:57 T0_CH_CERN_Disk_2023_04_27_16_29_M_action.list
-rw-r--r--. 1 root root 0 Mar 22 11:53 T1_DE_KIT_Disk_2023_03_18_13_32_M_action.list
-rw-r--r--. 1 root root 3213200 Mar 26 11:14 T1_DE_KIT_Disk_2023_03_25_13_33_M_action.list
-rw-r--r--. 1 root root 0 Mar 24 15:18 T1_ES_PIC_Disk_2023_03_24_01_57_M_action.list
-rw-r--r--. 1 root root 0 Apr 2 17:33 T1_ES_PIC_Disk_2023_04_01_15_15_M_action.list
-rw-r--r--. 1 root root 0 Apr 10 04:13 T1_ES_PIC_Disk_2023_04_09_15_49_M_action.list
-rw-r--r--. 1 root root 0 Apr 17 04:37 T1_ES_PIC_Disk_2023_04_16_15_49_M_action.list
-rw-r--r--. 1 root root 0 May 1 02:01 T1_ES_PIC_Disk_2023_04_30_15_49_M_action.list
-rw-r--r--. 1 root root 0 Mar 20 12:57 T1_FR_CCIN2P3_Disk_2023_03_18_14_25_M_action.list
-rw-r--r--. 1 root root 0 Apr 2 04:15 T1_FR_CCIN2P3_Disk_2023_03_30_19_36_M_action.list
-rw-r--r--. 1 root root 0 Mar 24 04:23 T1_FR_CCIN2P3_Tape_2023_03_24_03_24_M_action.list
-rw-r--r--. 1 root root 0 Mar 28 19:16 T1_FR_CCIN2P3_Tape_2023_03_28_18_39_M_action.list
-rw-r--r--. 1 root root 0 Apr 4 19:31 T1_FR_CCIN2P3_Tape_2023_04_04_18_39_M_action.list
...
Let me propose a hypothesis. It would be good to have a more recent example where we can check to prove or disprove parts of this hypothesis.
Now if this is what happens then I would say everything is "correct" except that DBS has to be updated too to deal with the loss of a file.
However, IF step 3/4 is incorrect and the file is really OK, then that is a problem.
@ericvaandering , if I use your logic then I doubt step 6 is correct since it is not up to Rucio to decide if file belongs to a block. I think step 6 requires Rucio to consult DBS if file no longer valid and if so it may detach it from the block. I think we are facing here issue with primary authority. In my view DBS is ultimate source of information about dataset/block file relationship and if such relationship exists all other tools should follow it, otherwise we will end-up with inconsistencies. This also aligns with your final statement that DBS needs to be updated at some point, and if it will Rucio should always consult DBS about file association to dataset/block.
@ericvaandering , Related to your hypothesis, I have a question. Will a suspended rule interfere in that logic as follows?
I agree with @vkuznet that Rucio should not remove the file from a dataset just because the file has no replicas, but it does: https://github.com/rucio/rucio/blob/e595093ef688422a9a3d8823570d7826369ab22e/lib/rucio/core/rule.py#L1932
Hello. We (me and @amarique1) were discussing this issue and we elaborate the following scenario couple of days ago where we concluded that suspended rules seems to be the culprit in this issue. To tested, we asked @amaltaro to deploy an agent injecting data to a cms-block. Once the block was created and the file was attached I forced that rule to go to suspended state[1]. The wf was still running and new files were attached but in this case the rule locking the data into the rse_expression was not updated [2]. This seems to be expected due that the only possible way to unsuspend a rule is to manually update that rule into stuck
and then the deamon will retrieve and update the rule with the new files that must be locked.
[1]
rucio --auth-host https://cms-rucio-auth-int.cern.ch --host http://cms-rucio-int.cern.ch rule-info 33e1c1d3330b4d23b6bde943a0cae2e8
Id: 33e1c1d3330b4d23b6bde943a0cae2e8
Account: wma_test
Scope: cms
Name: /VBF_HHTo2G2Tau_CV_1_C2V_1_C3_0_TuneCP5_13TeV-powheg-pythia8/RunIISummer20UL16MiniAODAPVv2-HIG_RunIISummer20UL16MiniAODAPVv2_02457_0_SC_6Steps_PU_issue_11349_v5-v11/MINIAODSIM#fee48763-26a9-4f4e-a449-4637da31b41f
RSE Expression: T1_US_FNAL_Disk
Copies: 1
State: SUSPENDED
Locks OK/REPLICATING/STUCK: 1/0/0
[2]
rucio --auth-host https://cms-rucio-auth-int.cern.ch --host http://cms-rucio-int.cern.ch list-files cms:/VBF_HHTo2G2Tau_CV_1_C2V_1_C3_0_TuneCP5_13TeV-powheg-pythia8/RunIISummer20UL16MiniAODAPVv2-HIG_RunIISummer20UL16MiniAODAPVv2_02457_0_SC_6Steps_PU_issue_11349_v5-v11/MINIAODSIM#fee48763-26a9-4f4e-a449-4637da31b41f | tail -n 2
Total files : 3
Total size : 5.354 GB
Im not the expert but if im not wrong Rucio will move the rule from stuck
to suspended
state automatically if the rule is in stuck
state for more that 2 weeks.
In the end seems that if new files are attached to a cms-dataset/blocks and a suspended rule is locking that cms-dataset/blocks, the locks are not effective, rule is not satisfied and those new files can be potentially be deleted by CC or reaper. .Probably we will only know who deleted if https://github.com/dmwm/CMSRucio/issues/495 is solved.
Adding a child rule locking data where the wf is running could solve this issue but other ideas are appreciated.
@jhonatanamado thank you for testing it and updating this ticket.
Adding a child rule locking data where the wf is running could solve this issue but other ideas are appreciated.
I am not sure what you mean by "child rule" here. Can you please elaborate? Note that the default rule created by WMAgent is made in a CMS block basis, locking blocks being produced at the origin site/rse. Sometimes, workflows will be configured to have a CMS dataset (container) rule as well. Those are created for the whole dataset and the rse expression is pretty generic (basically any Disk T1/T2).
It would be great if you could clarify how long it can take to set a rule to SUSPENDED as well (minimum/maximum time period). This is one of the pieces that I do not manage to make sense yet.
Hi @amaltaro ,
It would be great if you could clarify how long it can take to set a rule to SUSPENDED as well (minimum/maximum time period). This is one of the pieces that I do not manage to make sense yet
Well, here is where Rucio update the rule to SUSPENDED
:
if rule.stuck_at < (datetime.utcnow() - timedelta(days=14)):
rule.state = RuleState.SUSPENDED
insert_rule_history(rule=rule, recent=True, longterm=False, session=session)
logger(logging.INFO, 'Replication rule %s has been SUSPENDED', rule_id)
return
and will keep in that state unless someone/something update the rule to stuck/replicating And here where no new dids will be attached to a suspended rule
rules = session.query(models.ReplicationRule).filter(
or_(*rule_clauses),
models.ReplicationRule.state != RuleState.SUSPENDED,
models.ReplicationRule.state != RuleState.WAITING_APPROVAL,
models.ReplicationRule.state != RuleState.INJECT).with_for_update(nowait=True).all()
But if wma_prod is creating rules at cms-block level where the wf is running, is something else what is detaching the files. We will continue investigating.
But if wma_prod is creating rules at cms-block level where the wf is running, is something else what is detaching the files. We will continue investigating.
@jhonatanamado Jhonatan, but didn't you prove that, if files get added to a rucio dataset that has a SUSPENDED rule, that those files will not be considered by the rule (in none of the 3 states: ok/stuck/replicating)? Thus leaving such files eligible for deletion by one of the Rucio daemons.
In the relevant email thread, @ericvaandering asked for a recent example. The most recent I could find is from April:
/EGamma/Run2018D-UL2018_MiniAODv2_GT36-v3/MINIAOD
/store/data/Run2018D/EGamma/MINIAOD/UL2018_MiniAODv2_GT36-v3/2550001/668DA1AD-8A6F-044E-91D1-02312875AC4D.root
/store/data/Run2018D/EGamma/MINIAOD/UL2018_MiniAODv2_GT36-v3/2550001/2137D153-48A6-D245-BF72-78183B05A1E6.root
/store/data/Run2018D/EGamma/MINIAOD/UL2018_MiniAODv2_GT36-v3/2550001/A286FE68-F229-044F-ACD4-FAD7C8ACA2C1.root
Hope this is recent enough. If new issues pop up, I'll share them with you.
hi @amaltaro .
but didn't you prove that, if files get added to a rucio dataset that has a SUSPENDED rule, that those files will not be considered by the rule (in none of the 3 states: ok/stuck/replicating)? Thus leaving such files eligible for deletion by one of the Rucio daemons.
Yes indeed we proved that if files get added to a rucio-dataset or rucio-container that has a SUSPENDED rule, those files are eligible for deletion by one of the rucio daemons or CC. But I did not consider that production is creating Rucio-datasets rules where the wf is running. So, those files are attached to a Rucio-datasets rule which rse_expression is the same place where the wf is running. No need to transfers files. So the state of the rucio-dataset rule in principle is that all time will be in state OK.
Now, I don't know what Rucio is doing for the unique replica files that needs to be transferred. i.e Will a unique file be detached from the content if has been replicated several times and the transfers fails repeatedly for a long time leaving suspended rules?.
Thanks @haozturk. Indeed those files gave me enough clues of what happened (at least for those particular files).
Rucio has an active rule for that rucio-dataset
ID ACCOUNT SCOPE:NAME STATE[OK/REPL/STUCK] RSE_EXPRESSION COPIES EXPIRES (UTC) CREATED (UTC)
-------------------------------- --------- ------------------------------------------------------------------------------------------ ---------------------- ---------------- -------- --------------- -------------------
81fb482053ca4ae298eb2ab285dfc480 wma_prod cms:/EGamma/Run2018D-UL2018_MiniAODv2_GT36-v3/MINIAOD#b842335e-27ba-41c6-bd37-f0234bbf31fc OK[99/0/0] T2_HU_Budapest 1 2023-04-04 19:20:56
The files were declared as missing by CC at T2_HU_Budapest on 2023_04_08_08_06. Hear the link In general terms if a file is declared as missing. CC will them declares as "bad" to Rucio, and Rucio will re-replicate them
If file its unique. (which seems this case because file was generated at Budapest) it will be detached from the datasets, containers because it cannot be restored. Now the question is: Why those files were declared as missing? @ivmfnal @ericvaandering Any ideas?
Thanks for this, this is illuminating (I think).
As far as I can tell they were declared missing because ... they were missing.
I've gone through the dark lists at the site after April 8 and these files never show up. If the CC erroneously declared the files missing, but they were not missing, you'd expect them to appear on the dark list some time later.
So I would say https://github.com/dmwm/WMCore/issues/11556#issuecomment-1532147169 seems like the likely explanation (or at least fits all the data I am aware of).
Thanks @ericvaandering . I went through your suggestion and indeed you are correct. The file was missing in the first place. At least that's what FTS shows.
So let's recap for this particular file.
/store/data/Run2018D/EGamma/MINIAOD/UL2018_MiniAODv2_GT36-v3/2550001/2137D153-48A6-D245-BF72-78183B05A1E6.root
From content_hist we can get when the file was created and deleted (detached).
{'scope': 'cms', 'name': '/store/data/Run2018D/EGamma/MINIAOD/UL2018_MiniAODv2_GT36-v3/2550001/2137D153-48A6-D245-BF72-78183B05A1E6.root',
'type': 'FILE', 'bytes': 2152115154, 'adler32': 'a7e91744', 'md5': None, 'deleted_at': '2023-04-08 08:21:51', 'created_at': '2023-04-05 03:45:01',
'updated_at': '2023-04-05 03:45:13'}
We can agree that the file was detached due that CC marked that file as missing at Budapest.
Checking the historical behavior of transfers for that particular file (request_hist table), transfers started to fail with File not found error
from T2_HU_Budapest
to T1_US_FNAL_Disk
starting just a couple of minutes after the file was created. Transfer started_at 'started_at': '2023-04-05 03:46:16'
and returned File not found after a couple of seconds.
{'id': 'bb16fc7d6f1b4c84b3f0c30d6ad3dca2', 'request_type': 'T', 'scope': 'cms', 'name': '/store/data/Run2018D/EGamma/MINIAOD/UL2018_MiniAODv2_GT36-v3/2550001/2137D153-48A6-D245-BF72-78183B05A1E6.root',
'did_type': 'F', 'dest_rse_id': '087ee3383b9d45f6b31814af07b2c56d', 'source_rse_id': '9f13e775f8104689956f75b7695bf586',
'attributes': '{"activity": "Data rebalancing", "source_replica_expression": null, "lifetime": null, "ds_scope": "cms", "ds_name": "/EGamma/Run2018D-UL2018_MiniAODv2_GT36-v3/MINIAOD#b842335e-27ba-41c6-bd37-f0234bbf31fc", "bytes": 2152115154, "md5": null, "adler32": "a7e91744", "priority": 3, "allow_tape_source": true}',
'state': 'F', 'external_id': '61da3224-d364-11ed-9990-3cecef436a36', 'external_host': 'https://cmsfts3.fnal.gov:8446', 'retry_count': 0,
'err_msg': 'RequestErrMsg.TRANSFER_FAILED:SOURCE [2] Result HTTP 404 : File not found after 1 attempts', 'previous_attempt_id': None,
'rule_id': '24ba7b177e4c4b87b6f382d152913ec3', 'activity': 'Data rebalancing', 'bytes': 2152115154, 'md5': None, 'adler32': 'a7e91744',
'dest_url': 'davs://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/data/Run2018D/EGamma/MINIAOD/UL2018_MiniAODv2_GT36-v3/2550001/2137D153-48A6-D245-BF72-78183B05A1E6.root',
'submitted_at': '2023-04-05 03:46:01', 'started_at': '2023-04-05 03:46:16', 'transferred_at': '2023-04-05 03:46:17', 'estimated_at': None, 'submitter_id': None,
'estimated_started_at': None, 'estimated_transferred_at': None, 'staging_started_at': None, 'staging_finished_at': None,
'account': 'transfer_ops', 'requested_at': None, 'priority': None, 'updated_at': '2023-04-05 03:46:48', 'created_at': '2023-04-05 03:45:13', 'transfertool': None}
What DataManagement can deploy/monitor with the help of ElasticSearch is to find recently entries with File not found
errors and filtered if the file was created during the las 24h. This at least will give some fresh hints on what's going on.
Finding these earlier would be good, but unless you can use that to figure out why they are disappearing, it's not going to fundamentally change the situation.
@amaltaro Kevin and I were talking about this yesterday and I think what we ultimately need are
I think that lost file detachment is a separate issue from needing to re-generate, if possible, the lost file. I do not think is it necessary or even sensible to alter file metadata (including file dataset membership information) based on replica availability. Just because the last known to Rucio replica is gone does not actually mean the file is lost. What if it is being recovered from tape backup or is being recreated or is being transferred by FTS ? I think that file metadata should be separated from replica availability.
As for regenerating a file, I think it should be fairly easy to get Rucio generate a list of files (perhaps from a dataset or a container), which do not have any good replicas. Then these replicas can be re-generated, restored, if possible and declared to Rucio. If re-generation is not possible, then steps 2.a, 2.b will need to be taken.
So here is what I am proposing, assuming Rucio does not detach lost files:
For each lost file: if it is possible to recreate a replica: a. recreate the replica b. declare the replica to Rucio
If recreating a replica is not possible: c. Detach the file from Rucio if needed d. invalidate the file in DBS
Yes, detachment and lost replicas are separate issues. Except something is not declared LOST if there is a copy on tape. Nor can a file be truly recreated (size/checksum). The physics content can be recreated by the WM system, not the actual file. That's what I meant.
What you are proposing isn't so different than what I would propose initially. Except for the decision point, which does not exist (see above). Eventually, if the steps are known, there is not reason not to automate this. And my assumption (a good one I think) is that if a file is LOST, there is no other course of action but to detach it and declare it invalid.
The whole reason we have this mismatch is because Rucio thinks it is the "truth" of dataset contents and CMS has assumed DBS is the truth.
(The reason I said we might recreate the content of the file and not "will" is because if it's MC, we probably don't care enough to go through the hassle. And if it's RAW, we cannot.)
+1 to @ericvaandering , i.e. we need CMS Rucio code which will contact DBS and make Rucio decision based on DBS info and update DBS if necessary. This will require proper certificate and rules.
Yes, detachment and lost replicas are separate issues. Except something is not declared LOST if there is a copy on tape. Nor can a file be truly recreated (size/checksum). The physics content can be recreated by the WM system, not the actual file. That's what I meant.
I was talking hypothetically to illustrate the fact that just because there are no more replicas known to Rucio does not always mean there is no replica somewhere unknown to Rucio or one can not be recreated.
My point is that the decision about detaching the file and invalidating it in DBS should not be made by Rucio.
My view of this is that it’s a micro service since it may need information and actions spanning Rucio, DBS, and WMAgent.
Sent from a mobile device. Please excuse my brevity or transcription errors.
On May 16, 2023, at 1:56 PM, Valentin Kuznetsov @.***> wrote:
+1 to @ericvaanderinghttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ericvaandering&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=jZ98h1BTR4g1B82AcxlJIsfYrrpBr9eENEUd7NYjk0Vda6ezLo-2-SNgbIz3rBAk&s=V7E1O8SKu4tlG1zkKpEhSA2o40mBtkI8GZ2k1DYv2Kw&e= , i.e. we need CMS Rucio code which will contact DBS and make Rucio decision based on DBS info and update DBS if necessary. This will require proper certificate and rules.
— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_WMCore_issues_11556-23issuecomment-2D1550197317&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=jZ98h1BTR4g1B82AcxlJIsfYrrpBr9eENEUd7NYjk0Vda6ezLo-2-SNgbIz3rBAk&s=DjZ28tlOp0_upBu6gC__aHkRppdAqCPGZIb5K1foIvg&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAMYJLQHNFE7HYCAOZZZ7LTXGPEVVANCNFSM6AAAAAAXBSSDSE&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=jZ98h1BTR4g1B82AcxlJIsfYrrpBr9eENEUd7NYjk0Vda6ezLo-2-SNgbIz3rBAk&s=FOZtPSHuIXZAomrH415IDaMJw_0Dbvi4HVVSi4YCNZs&e=. You are receiving this because you were mentioned.Message ID: @.***>
@ericvaandering , I like the idea of micro-service but it is unclear what is a request flow. For example,
But if it is the case, it is unclear why do we need MS, since WMagent can talk to both DBS and Rucio, and Rucio can talk to DBS. The point is that we should clearly provide architectural diagram with all requests. For instance, it seems to me that Rucio must talk to DBS before making decision to delete replica, why do we need MS for that? I suggest that we draw on a board all requests and then decide if we really need MS for that (as it may be overkill for request flow). At the moment I think the missing part is in Rucio only since it never take into account DBS and it can be simply solve by wrapper (or even a decorator function) to appropriate delete API, i.e. the delete API will need to consult DBS and then delete the replica.
We’re not going to be able to build an explicit communication channel into Rucio to connect to DBS. That’s a non-starter.
What we can probably ask for is a configuration to not detach a replica if the last copy is lost. In that case, a message should be emitted from Rucio to a MQ. Something, then, should listen for that message, gather CMS specific information from the various CMS services and decide on a course of action. It then may have to take action with several different CMS services to restore consistency.
Eric
On May 17, 2023, at 9:17 AM, Valentin Kuznetsov @.***> wrote:
@ericvaandering https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ericvaandering&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=Mirwnp6iA5aXEM5eJzdkEAzMRVYw6UdK6JF4GO_i0VRoz1iaOUez3K5i83z1CMRj&s=aXSrvmlTV4N6ltXPnH_F6HHFaCCpgH6XKSGb4loWUYw&e= , I like the idea of micro-service but it is unclear the request flow. For example,
new MicroService (MS) will be a proxy between DBS/Rucio/WMAgent WMAgent will place requests through MS MS will talk to DBS and Rucio to place requests on behalf of WMAgent But if it is the case, it is unclear why do we need MS, since WMagent can talk to both DBS and Rucio, and Rucio can talk to DBS. The point is that we should clearly provide architectural diagram with all requests. For instance, it seems to me that Rucio must talk to DBS before making decision to delete replica, why do we need MS for that? I suggest that we draw on a board all requests and then decide if we really need MS for that (as it may be overkill for request flow). At the moment I think the missing part is in Rucio only since it never take into account DBS and it can be simply solve by wrapper (or even a decorator function) to appropriate delete API, i.e. the delete API will need to consult DBS and then delete the replica.
— Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_WMCore_issues_11556-23issuecomment-2D1551482388&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=Mirwnp6iA5aXEM5eJzdkEAzMRVYw6UdK6JF4GO_i0VRoz1iaOUez3K5i83z1CMRj&s=qFd7ZhHytukWxSyeKXbCzYy-m0pwfbysv11UrJ8-2SU&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAMYJLRPMVBNQ7UTJJWIXETXGTMV7ANCNFSM6AAAAAAXBSSDSE&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=Mirwnp6iA5aXEM5eJzdkEAzMRVYw6UdK6JF4GO_i0VRoz1iaOUez3K5i83z1CMRj&s=RGdcZBcaXlhxHKD_fbj9jaT7dXkdLt-JxeRaUWyvVnc&e=. You are receiving this because you were mentioned.
Alan gave me some pointers and I would like to update on my findings:
/store/mc/RunIISummer20UL18RECO/G1Jet_LHEGpT-50To100_PhotonInBarrel_Mjj-300_TuneCP5_13TeV-amcatnlo-pythia8/AODSIM/106X_upgrade2018_realistic_v11_L1v1-v2/2820004/1D29A34F-EB43-D94D-840F-845115A852A6.root
/G1Jet_LHEGpT-50To100_PhotonInBarrel_Mjj-300_TuneCP5_13TeV-amcatnlo-pythia8/RunIISummer20UL18RECO-106X_upgrade2018_realistic_v11_L1v1-v2/AODSIM#c56456c6-2aac-41e3-b7fc-aa3a569ac531
cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766
cmsgwms-submit3.fnal.gov,cmsgwms-submit4.fnal.gov,cmsgwms-submit5.fnal.gov,cmsgwms-submit6.fnal.gov,cmsgwms-submit7.fnal.gov,cmsgwms-submit8.fnal.gov,vocms0254.cern.ch,vocms0255.cern.ch,vocms0256.cern.ch,vocms0281.cern.ch,vocms0282.cern.ch,vocms0283.cern.ch
cmsgwms-submit5
and its log area data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir
/data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766
[cmsdataops@cmsgwms-submit5 current]$ cat find_entries.sh
#!/bin/bash
idir=$1
pat=$2
files=`find $1 -name Job*.tar.bz2`
for fname in $files;
do
results=`bzgrep "$pat" $fname`
if [ -n "$results" ]; then
echo "Pattern \"$pat\" found in $fname"
fi
done
stage out failed
can be found in the following logs:
[cmsdataops@cmsgwms-submit5 current]$ ./find_entries.sh $idir "$pat"
Pattern "stage out failed" found in /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766/JobCluster_3711/Job_3711941.tar.bz2
Pattern "stage out failed" found in /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766/JobCluster_3711/Job_3711681.tar.bz2
Pattern "stage out failed" found in /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766/JobCluster_3711/Job_3711751.tar.bz2
Pattern "stage out failed" found in /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766/JobCluster_3711/Job_3711993.tar.bz2
Pattern "stage out failed" found in /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766/JobCluster_3711/Job_3711515.tar.bz2
Pattern "stage out failed" found in /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766/JobCluster_3711/Job_3711466.tar.bz2
Pattern "stage out failed" found in /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766/JobCluster_3711/Job_3711422.tar.bz2
Pattern "stage out failed" found in /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766/JobCluster_3711/Job_3711292.tar.bz2
Now, if we look into individual log, e.g.
bzcat /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766/JobCluster_3711/Job_3711422.tar.bz2 | less
we can see the following failures:
2023-04-19 05:43:42,959:ERROR:StageOutImpl:Attempt 0 to stage out failed.
Automatically retrying in 300 secs
Error details:
<@========== WMException Start ==========@>
Exception Class: StageOutError
Message: Command exited non-zero, ExitCode:151
Output: stdout: Local File Size is: 8393858
Remote File Size is: 0
Local File Checksum is: c3e0a807
Remote File Checksum is:
ERROR: Size or Checksum Mismatch between local and SE
stderr: Run: [ERROR] Operation expired: (destination) [ERROR] Server responded with an error: [3013] No checksum available for this file.
ClassName : None
ModuleName : WMCore.Storage.StageOutError
MethodName : __init__
ClassInstance : None
FileName : /srv/job/WMCore.zip/WMCore/Storage/StageOutError.py
LineNumber : 32
ErrorNr : 0
Command : LOCAL_SIZE=`stat -c%s "/srv/job/WMTaskSpace/cmsRun5/AODSIMoutput.root"`
... Command exited non-zero, ExitCode:151 Output: stdout: Local File Size is: 8393858 Remote File Size is: 0 Local File Checksum is: c3e0a807 Remote File Checksum is: ERROR: Size or Checksum Mismatch between local and SE
stderr: Run: [ERROR] Operation expired: (destination) [ERROR] Server responded with an error: [3013] No checksum available for this file.
ERROR:root:Attempt 0 to stage out failed.
...
echo "Local File Size is: $LOCAL_SIZE"
xrdcp --force --nopbar -p --cksum adler32:c3e0a807 "/srv/job/WMTaskSpace/cmsRun5/AODSIMoutput.root" "root://cmsdcadisk.fnal.gov//dcache/uscmsdisk/store/unmerged/RunIISummer20UL18RECO/G1Jet_LHEGpT-50To100_PhotonInBarrel_Mjj-300_TuneCP5_13TeV-amcatnlo-pythia8/AODSIM/106X_upgrade2018_realistic_v11_L1v1-v2/50568/172D2C31-CA56-8949-8FD4-50BCA1D033F5.root"
REMOTE_SIZE=xrdfs 'cmsdcadisk.fnal.gov' stat '/dcache/uscmsdisk/store/unmerged/RunIISummer20UL18RECO/G1Jet_LHEGpT-50To100_PhotonInBarrel_Mjj-300_TuneCP5_13TeV-amcatnlo-pythia8/AODSIM/106X_upgrade2018_realistic_v11_L1v1-v2/50568/172D2C31-CA56-8949-8FD4-50BCA1D033F5.root' | grep Size | sed -r 's/.*Size:[ ]*([0-9]+).*/\1/'
echo "Remote File Size is: $REMOTE_SIZE"
echo "Local File Checksum is: c3e0a807"
REMOTE_XS=xrdfs 'cmsdcadisk.fnal.gov' query checksum '/dcache/uscmsdisk/store/unmerged/RunIISummer20UL18RECO/G1Jet_LHEGpT -50To100_PhotonInBarrel_Mjj-300_TuneCP5_13TeV-amcatnlo-pythia8/AODSIM/106X_upgrade2018_realistic_v11_L1v1-v2/50568/172D2C3 1-CA56-8949-8FD4-50BCA1D033F5.root' | grep -i adler32 | sed -r 's/.*[adler|ADLER]32[ ]*([0-9a-fA-F]{8}).*/\1/'
echo "Remote File Checksum is: $REMOTE_XS"
if [ $REMOTE_SIZE ] && [ $REMOTE_XS ] && [ $LOCAL_SIZE == $REMOTE_SIZE ] && [ 'c3e0a807' == $REMOTE_XS ]; then exit 0; els
e echo "ERROR: Size or Checksum Mismatch between local and SE"; xrdfs 'cmsdcadisk.fnal.gov' rm /dcache/uscmsdisk/store/unm
erged/RunIISummer20UL18RECO/G1Jet_LHEGpT-50To100_PhotonInBarrel_Mjj-300_TuneCP5_13TeV-amcatnlo-pythia8/AODSIM/106X_upgrade
2018_realistic_v11_L1v1-v2/50568/172D2C31-CA56-8949-8FD4-50BCA1D033F5.root ; exit 60311 ; fi
ExitCode : 151
ErrorCode : 60311
ErrorType : GeneralStageOutFailure
Since we have different agents involved I need more information what we should spot in a log files. The more precise regexp you can provide the better, easier and faster we can find this needle. May be we should look at specific code or message. Please provide any valuable input you can give or know.
@vkuznet Valentin, Andress provided a list of files in the #wmagent-dev slack channel. I am trying to keep debugging data lost in Rome, so here is one of those use cases:
"name": "/store/mc/Run3Winter23Digi/SinglePion_E-50_Eta-0to3-pythia8-gun/GEN-SIM-RAW/NoPUGTv4_126X_mcRun3_2023_forPU65_v4-v2/60004/5c301e42-561c-4489-a2ba-235d00ff0058.root",
"dest_rse": "T1_UK_RAL_Disk",
"src_rse": "T2_IT_Rome",
"err_msg": "RequestErrMsg.TRANSFER_FAILED:SOURCE [2] Result HTTP 404 : File not found after 1 attempts",
"tx_requested_at": "2023-05-20 22:08:14",
"tx_created_at": "2023-05-20 22:08:14",
"tx_submitted_at": "2023-05-20 22:08:15",
"tx_started_at": "2023-05-20 22:08:17",
"tx_transferred_at": "2023-05-20 22:08:18",
"tx_updated_at": "2023-05-20 22:08:58",
"content_updated_at": "2023-05-18 04:17:24",
"content_created_at": "2023-05-18 04:17:22"
performing the file to dataset and then to workflow, here is the running-closed workflow writing to that dataset: cmsunified_task_HCA-Run3Winter23GS-00001__v1_T_230517_182344_1871
which is running from a single agent:
agent url: cmsgwms-submit6.fnal.gov
If we scan the job logs in submit6 for that file id in the wmagentJob.log
log file - using this script [^] - it spits out the following log:
MATCH! LogArchive /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_HCA-Run3Winter23GS-00001__v1_T_230517_182344_1871/JobCluster_2706/Job_2706067.tar.bz2
Which indeed has something VERY suspicious in the log, which I also paste it here:
2023-05-18 06:03:50,553:INFO:StageOutImpl:Running the stage out...
2023-05-18 06:04:04,404:INFO:StageOutImpl:Command exited with status: 0
Output message: stdout: Thu May 18 06:03:50 CEST 2023
Copying 3227887890 bytes file:///srv/job/WMTaskSpace/cmsRun1/Merged.root => davs://cmsrm-xrootd01.roma1.infn.it:2880/mc/Run3Winter23Digi/SinglePion_E-50_Eta-0to3-pythia8-gun/GEN-SIM-RAW/NoPUGTv4_126X_mcRun3_2023_forPU65_v4-v2/60004/5c301e42-561c-4489-a2ba-235d00ff0058.root
gfal-copy exit status: 0
stderr: /srv/startup_environment.sh: line 8: BASHOPTS: readonly variable
/srv/startup_environment.sh: line 15: BASH_VERSINFO: readonly variable
/srv/startup_environment.sh: line 36: EUID: readonly variable
/srv/startup_environment.sh: line 205: PPID: readonly variable
/srv/startup_environment.sh: line 213: SHELLOPTS: readonly variable
/srv/startup_environment.sh: line 230: UID: readonly variable
gfal-copy error: 256 (Unknown error 256) - TRANSFER ERROR: Copy failed (streamed). Last attempt: HTTP 500 : Unexpected server error: 500 (destination)
[^] https://github.com/amaltaro/ProductionTools/blob/master/untarLogArchive.py
which can be executed like:
python3 untarLogArchive.py /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_HCA-Run3Winter23GS-00001__v1_T_230517_182344_1871 5c301e42-561c-4489-a2ba-235d00ff0058.root
Alan, I confirm that I do see too such errors:
gfal-copy error: 5 (Input/output error) - DESTINATION OVERWRITE Result HTTP 500 : Unexpected server error: 500 after 1 attempts
/srv/startup_environment.sh: line 9: BASHOPTS: readonly variable
/srv/startup_environment.sh: line 16: BASH_VERSINFO: readonly variable
/srv/startup_environment.sh: line 35: EUID: readonly variable
/srv/startup_environment.sh: line 156: PPID: readonly variable
/srv/startup_environment.sh: line 163: SHELLOPTS: readonly variable
/srv/startup_environment.sh: line 181: UID: readonly variable
gfal-rm error: 5 (Input/output error) - Result HTTP 500 : Unexpected server error: 500 after 1 attempts
in agents for cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766/JobCluster_294
workflow. It is very common error which happens in many logs in all agents. For instance on vocms0283:/data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766/JobCluster_294/Job_294596.tar.bz2
I scanned few agents, and I do see HTTP 500
errors in all of them. I copied few log files to this location /afs/cern.ch/user/v/valya/public/WMCore
if anyone would like to look, but I observed many stage out failures (not only related to HTTP 500 error) and they are quite frequent. The errors related to 500 message come from gfal calls. Based on this observation may be it is worth to re-think how to handle such kind of errors for easy access to them. For instance, we can introduce the code which will send error messages to MONIT via AMQ brokers, such that later someone can view them in ES/OpenSearch. Such error message may contain name of blocks, lfn, agent, its log, error message, etc. The, the look-up of errors can be simplified by looking into ES/OpenSearch, and moreover we can provide corresponding dashboard for easy navigation.
Impact of the bug Identify issue with stuck PnR workflows and allow them to progress.
Describe the bug As was described in PnR#41 we have certain level of workflows which have file mismatch between DBS and Rucio and they never get announced for months. The filemismatch usually happens when there are output files in DBS but not recorded by rucio, here is one example:
for WF:
pdmvserv_Run2022D_Muon_10Dec2022_221209_230856_3545
. A recent list of workflows having such issue can be found here.How to reproduce it Most likely we need to perform debugging of agent logs and its local databases to identify the issue with stuck workflows.
Expected behavior We should have a file match between DBS and Rucio to allow WF to progress.
Additional context and error message