dmwm / WMCore

Core workflow management components for CMS.

Apache License 2.0

46 stars 107 forks source link

gfal-copy failing but returning status code 0 - leaving a bunch of file mismatch in the system #11556

Closed vkuznet closed 1 year ago

vkuznet commented 1 year ago

Impact of the bug Identify issue with stuck PnR workflows and allow them to progress.

Describe the bug As was described in PnR#41 we have certain level of workflows which have file mismatch between DBS and Rucio and they never get announced for months. The filemismatch usually happens when there are output files in DBS but not recorded by rucio, here is one example:

file: /store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/025d55a6-8e48-4695-9087-80bf9681229b.root

for WF: pdmvserv_Run2022D_Muon_10Dec2022_221209_230856_3545. A recent list of workflows having such issue can be found here.

How to reproduce it Most likely we need to perform debugging of agent logs and its local databases to identify the issue with stuck workflows.

Expected behavior We should have a file match between DBS and Rucio to allow WF to progress.

Additional context and error message

vkuznet commented 1 year ago

FYI: @amaltaro , @haozturk

amaltaro commented 1 year ago

@vkuznet thanks for creating this issue. @haozturk FYI

Here is my suggestion on how to debug it:

pick a workflow reported by Unified (if possible, zoom in in the relevant block/file missing)
using wmstats (or wmstatsserver), check which agents worked on that workflow
connect to that/those agents
grep for the file name in the relevant component. If file is missing in DBS, check the DBS3Upload; if it's missing in Rucio, check the RucioInjector component log.
Note that we do not record the filename in RucioInjector, only the block and how many files are/were attached to it.

Another option is to use the relational database, using your preferred SQL client (sqlplus, SQLDeveloper, etc). If you use sqlplus, you can type this command in the agent:

$manage db-prompt wmagent

and I think the table that we want to look at is dbsbuffer_file.

vkuznet commented 1 year ago

Here is very simple procedure to check DBS vs rucio files:

obtain (CMS) block in question (please note I escape # with %23 in block name), e.g.

blk=/Muon/Run2022D-ZMu-10Dec2022-v1/RAW-RECO%23a63aafe6-480d-45d8-9d83-668e6172c271

run DBS files API to obtain list of files, here I use jq tool to parse JSON output and extract LFN nmes:

scurl "https://cmsweb.cern.ch:8443/dbs/prod/global/DBSReader/files?block_name=$blk" | jq '.[].logical_file_name'

run rucio client to obtain list of files, here is my rucio_client script:
```
#!/bin/bash
```

url=$1 account=das cert=/etc/proxy/proxy ckey=/etc/proxy/proxy if [ ! -f $ckey ]; then ckey=$HOME/.globus/userkey.pem cert=$HOME/.globus/usercert.pem fi aurl=https://cms-rucio-auth.cern.ch/auth/x509 if [ -n "echo $url | grep rucio-int" ]; then aurl=https://cms-rucio-auth-int.cern.ch/auth/x509 fi

opt="-s -L -k --key $HOME/.globus/userkey.pem --cert $HOME/.globus/usercert.pem" token=curl $opt -v -H "X-Rucio-Account: $account" $aurl 2>&1 | grep "X-Rucio-Auth-Token:" | sed -e "s,< X-Rucio-Auth-Token: ,,g" curl $opt -H "X-Rucio-Auth-Token: $token" -H "X-Rucio-Account: $account" "$url"

and here I run it

rucio_curl "http://cms-rucio.cern.ch/replicas/cms/Muon/Run2022D-ZMu-10Dec2022-v1/RAW-RECO%23a63aafe6-480d-45d8-9d83-668e6172c271" | jq '.name'

So, if you run both and count number of files you will see the difference, e.g.

create dbs output

scurl "https://cmsweb.cern.ch:8443/dbs/prod/global/DBSReader/files?block_name=$blk" | jq '.[].logical_file_name' | sort -u > /tmp/dbs.blocks

create rucio output

rucio_curl "http://cms-rucio.cern.ch/replicas/cms/Muon/Run2022D-ZMu-10Dec2022-v1/RAW-RECO%23a63aafe6-480d-45d8-9d83-668e6172c271" | jq '.name' | sort -u > /tmp/rucio.blocks

perform comparison:

diff -u /tmp/dbs.blocks /tmp/rucio.blocks --- /tmp/dbs.blocks 2023-04-18 09:18:34.000000000 -0400 +++ /tmp/rucio.blocks 2023-04-18 09:18:26.000000000 -0400 @@ -1,4 +1,3 @@ -"/store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/025d55a6-8e48-4695-9087-80bf9681229b.root" "/store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/0866682d-44b9-4f3a-8b5b-4b4baab4c137.root" "/store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/0c1860fd-a94a-4ec7-843b-453c3a4bec3e.root" "/store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/10f22a56-6424-4559-95d0-c505ddefc7b0.root"


The missing LFN with hash `025d55a6-8e48-4695-9087-80bf9681229b` does not exist in Rucio.

vkuznet commented 1 year ago

I communicated with @ericvaandering who gave me the clue. According to Rucio the file which is not present in Rucio does not have associated block (i.e. it is not attached to the block).

# here is a LFN which is attached to the CMS block, rucio dataset
rucio list-parent-dids
   cms:/store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/0866682d-44b9-4f3a-8b5b-4b4baab4c137.root

   +-----------------------------------------------------------------------------------+--------------+
   | SCOPE:NAME               | [DID TYPE]   |

   |-----------------------------------------------------------------------------------+--------------|

   |
   cms:/Muon/Run2022D-ZMu-10Dec2022-v1/RAW-RECO#a63aafe6-480d-45d8-9d83-668e6172c271 | DATASET      |

# while here is LFN which has no attachment
rucio list-parent-dids
    cms:/store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/025d55a6-8e48-4695-9087-80bf9681229b.root

Then, I checked LFNs status in DBS and found the same confirmation, i.e. the LFN /store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/0866682d-44b9-4f3a-8b5b-4b4baab4c137.root has valid status while LFN /store/data/Run2022D/Muon/RAW-RECO/ZMu-10Dec2022-v1/2550000/025d55a6-8e48-4695-9087-80bf9681229b.root is not valid.

My conclusion is that invalid files are not appeared in Rucio.

vkuznet commented 1 year ago

I further investigated Unified code base and found particular flaw in their function usage which depends on internal cache. Please see full details over here: https://gitlab.cern.ch/CMSToolsIntegration/operations/-/issues/41#note_6636304. In summary, it seems to me that Unified stores DBS results into cache used in multiple places and when finally compare results of DBS vs Rucio files some files may have invalidated by underlying workflow or data-ops leading to file mismatch.

I'll be happy to check another block when PnR will provide that but at the moment I do not see any issues with WMCore/agent/RucioInjector.

amaltaro commented 1 year ago

@vkuznet Valentin, regardless of a stale cache or not in Unified, didn't you confirm that actually there is a file in Rucio which has not been attached to its parent block? If this is true, then there must be a flaw as well either in:

WMCore (RucioInjector component)
Rucio Server itself.

BTW, here is our Rucio implementation for adding a replica to Rucio server plus attaching it to a block: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/Rucio/Rucio.py#L381

We might have to chase this file/block in the agent logs to see if any error/exception was raised when dealing with it in RucioInjector.

ericvaandering commented 1 year ago

Unless "invalidation" (who/where does that) is not deleting the Rucio DID but only detaching it.

vkuznet commented 1 year ago

Alan, how I can find which agent was used for archived workflow? I looked up the one provided in description and found it here: https://cmsweb.cern.ch/reqmgr2/fetch?rid=request-pdmvserv_Run2022D_Muon_10Dec2022_221209_230856_3545 but I can't find it in wmstats using this WF name.

vkuznet commented 1 year ago

Meanwhile, according to WMCore RucioInjector code base https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/Rucio/Rucio.py#L404-L414 and using logs from one of the CERN agents (vocms0255) I do see few errors like this:

2023-02-28 17:28:25,353:140522088707840:ERROR:Rucio:Failed to add replicas for: [{'name': '/store/mc/RunIISummer20UL16RECO/WplusH_HTo2LongLivedTo2G2Q_MH-125_MFF-7_ctau-20cm_TuneCP5_fixedLifetime_13TeV-powheg-pythia8/AODSIM/106X_mcRun2_asymptotic_v13-v2/2550000/A9995941-D344-2447-AEC2-A4B0FF2F1810.root', 'scope': 'cms', 'bytes': 1744683405, 'state': 'A', 'adler32': '37f4804d'}] and block: /WplusH_HTo2LongLivedTo2G2Q_MH-125_MFF-7_ctau-20cm_TuneCP5_fixedLifetime_13TeV-powheg-pythia8/RunIISummer20UL16RECO-106X_mcRun2_asymptotic_v13-v2/AODSIM#05edac39-5ba2-4d35-b29f-00ccd0422a1a. Error: An unknown exception occurred

which suggests that RucioInjector can fail and the reason is unknown (or not propagated well from Rucio client). Said that, we may perform inverse look-up and check if PnR has filemismatch in this block /WplusH_HTo2LongLivedTo2G2Q_MH-125_MFF-7_ctau-20cm_TuneCP5_fixedLifetime_13TeV-powheg-pythia8/RunIISummer20UL16RECO-106X_mcRun2_asymptotic_v13-v2/AODSIM#05edac39-5ba2-4d35-b29f-00ccd0422a1a or any other block we can extract from agent log.

vkuznet commented 1 year ago

@amaltaro , another suggestion, by inspecting reqmgr2 logDB I found that for a given WF we do have pretty extensive JSON meta-data but it lacks of one particular important element the agent name which processed this WF. Should we open issue for that to make debugging easier?

amaltaro commented 1 year ago

The only way to know which agents worked on a given request is through wmstats, at the same time, wmstats only contains data for active workflow. The workflow you are looking at has already been archived (normal-archived), so we cannot easily know which agents processed it. I think Hasan mentioned many workflows in the same situation, so I'd suggest to pick one or two that have ran in the past weeks (and that are not yet archived).

This message that you pasted "Error: An unknown exception occurred" suggested that no error details were returned from Rucio server. Still, WMAgent is stateful and it will keep retrying to inject that replica+attach in the coming component cycles.

Regarding reqmgr2 LogDB, I feel like the way it's used is sub-optimal and from time to time the database gets slower and queries might time out. I believe it's also meant to be a temporary database only for active workflows. We might have to revisit its usage and how data is organized in it before expanding with further use-cases. But I agree that the agent information is very much needed!

vkuznet commented 1 year ago

@amaltaro , while awaiting from PnR about active workflows, I inspected the list of ones they listed in their ticket and I found none available in wmstats, either it does not exist or it is completed. Said that, I also inspected WMCore Rucio.py code base and would like to provide general feedback about this issue:

the debugging process seems quite complicated since we do not record agent in WF JSON files, therefore for completed or archived workflows we have no reference which agent was used
the function createReplicas in Rucio.py does not return concrete status, only success or failure, and only raise exception. Therefore, it is not feasible to trace down what is wrong with rucio injection. I rather suggest to modify this function to return a JSON record instead of traceback to upstream code such that such error record can be propagated to WF JSON (FWJR) and represent the actual error which will ease debugging of such issue. A concrete example is DBS server which return full error along with its error code. The error includes function, the database error, and auxilary information. Here we need a similar concept. But to make it workable it will require
- adjusting createReplicas to return such error, including Rucio error info, actual traceback, input parameters
- adjusting code which use this API, e.g. MSPileup or MSTransferor
- capture such error records and add them to FWJR such that we can have full picture without going into debug process. These are suggestions regardless of this issue and you may consider put them as GH feature issue for future development.

amaltaro commented 1 year ago

Thanks Valentin. Then we need to wait for P&R to provide a workflow that has this issue and which is still active.

Regarding the Rucio.py implementation, we need to check what is actually returned from the server - through Rucio client - to see which information we can pass upstream.

Another option would be to actually separate the "rucio replica injection" from the "rucio replica attachment". Making it in 2 different methods such that we can properly see what the outcome of each is.

Note that createReplicas is only used by RucioInjector, given that it is the only component registering DID in Rucio.

z4027163 commented 1 year ago

Since in P&R ticket there is restriction, I am pasting what we have from other ticket.

Valentin checked this workflow pdmvserv_task_BTV-Run3Summer22EEGS-00009__v1_T_221223_125301_7115 as completed in FNAL submit8 agent.

An example of a file that is in DBS but not in rucio: /store/mc/Run3Summer22EEMiniAODv3/QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/MINIAODSIM/124X_mcRun3_2022_realistic_postEE_v1-v1/80000/aede32b8-d95a-44f0-aa06-2f21dbda371b.root

The corresponding block is: /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf

z4027163 commented 1 year ago

Alan did check the workflow which produced the MINIAODSIM above, which was: pdmvserv_task_BTV-Run3Summer22EEGS-00009__v1_T_221223_125301_7115

Looking into the output MINIAODSIM dataset, here is what one of the WMAgent utilitarian tools says:

cmst1@vocms0255:/data/srv/wmagent/current $ python3 apps/wmagentpy3/bin/adhoc-scripts/checkDsetFileCount.py /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM
...
2023-04-25 21:11:27,899:INFO:checkDsetFileCount: *** Dataset: /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM
2023-04-25 21:11:27,899:INFO:checkDsetFileCount: Rucio file count : 1974
2023-04-25 21:11:27,899:INFO:checkDsetFileCount: DBS file count   : 2016
2023-04-25 21:11:27,899:INFO:checkDsetFileCount:  - valid files   : 2006
2023-04-25 21:11:27,899:INFO:checkDsetFileCount:  - invalid files : 10
2023-04-25 21:11:27,899:INFO:checkDsetFileCount: Blocks in Rucio but not in DBS: set()
2023-04-25 21:11:27,899:INFO:checkDsetFileCount: Blocks in DBS but not in Rucio: set()
2023-04-25 21:11:27,899:WARNING:checkDsetFileCount: Block with file mismatch: /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf
2023-04-25 21:11:27,899:WARNING:checkDsetFileCount:     Rucio: 0        DBS: 41
2023-04-25 21:11:27,899:WARNING:checkDsetFileCount: Block with file mismatch: /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#956bfed7-3d2b-401d-b8f7-f13a86176ec5
2023-04-25 21:11:27,899:WARNING:checkDsetFileCount:     Rucio: 7        DBS: 8

so one of the blocks is said not to have any files in Rucio, while DBS has 41 files, confirmed with: https://cmsweb.cern.ch/dbs/prod/global/DBSReader/filesummaries?block_name=/QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM%232ac6771a-ae1e-4666-9c3d-f2e3d8daccbf

IF I grep the submit8 RucioInjector logs, I can find that this block was reported as being successfully injected in Rucio back in January (AND it reports the very same 41 files):

[cmsdataops@cmsgwms-submit8 current]$ grep 'MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf' install/wmagentpy3/DBS3Upload/ComponentLog 
2023-01-08 07:10:07,090:140090107418368:INFO:DBSUploadPoller:Queueing block for insertion: /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf
2023-01-08 07:10:07,110:140090107418368:INFO:DBSUploadPoller:About to call insert block for: /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf
[cmsdataops@cmsgwms-submit8 current]$ 
[cmsdataops@cmsgwms-submit8 current]$ grep 'MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf' install/wmagentpy3/RucioInjector/ComponentLog 
2023-01-06 07:11:17,939:140090104010496:INFO:RucioInjectorPoller:Block /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf inserted into Rucio
2023-01-06 07:11:40,684:140090104010496:INFO:RucioInjectorPoller:Successfully inserted 41 files on block /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf
2023-01-06 07:12:17,958:140090104010496:INFO:RucioInjectorPoller:Block rule created for block: /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf, at: T2_US_Florida, with rule id: d6326e6f122540b9b8f92e07b5912b03
2023-01-08 07:11:53,813:140090104010496:INFO:RucioInjectorPoller:Closing block: /QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf

Alan still need to look into Rucio server to see if it actually knowns any of the DIDs (files) in this block. To be done next.

z4027163 commented 1 year ago

From Alan:

FYI. I asked for feedback in the #DM Ops Mattermost channel and Jhonatan seemed to have found all those files in Rucio: https://mattermost.web.cern.ch/cms-o-and-c/pl/4mg45jykbbn8jdbqxydn64f58y

I still have follow up questions and we still need to understand what that means and why that happened. Anyhow, feel free to follow this up over there as well.

amaltaro commented 1 year ago

@z4027163 thank you for updating this ticket! Sorry that I had to miss the DM meeting, is there anything that we WM are expected to do next? Or what is the next step to properly understanding what happened here?

z4027163 commented 1 year ago

@amaltaro They are aware of the issue, we couldn't figure out the exact issue in the meeting. Eric/Diago will check the ticket for more details. We are waiting for their feedback, maybe there are follow-ups to do from wmcore side after that.

ericvaandering commented 1 year ago

I think we said we'd monitor the ticket, not that there was anything I could look into. Jhonathan's posts in MM show that these file DIDs are getting created in Rucio, right? And they are being attached to the block DIDs. So the mystery is how they are being unattached.

z4027163 commented 1 year ago

@ericvaandering Ah yes, I misunderstood. I thought you guys would be the right people to check it. Who should we ask to look into it? I had the impression it was outside wmcore's reach.

ericvaandering commented 1 year ago

Well, I did just take a look at the Rucio code to see WHEN file are detached from blocks. It looks like it can happen if a file is declared bad AND it is the last copy of that file: https://github.com/rucio/rucio/blob/e595093ef688422a9a3d8823570d7826369ab22e/lib/rucio/core/rule.py#L1932

@ivmfnal would it be possible to search the consistency checking logs for some of these files (is there a sample list somewhere?) to see if this is what's happening?

vkuznet commented 1 year ago

@ericvaandering , I certainly do not know details of rucio but looking at the code I find suspicious the following block:

def update_rules_for_lost_replica(scope, name, rse_id, nowait=False, session=None, logger=logging.log):
...
    for dts in datasets:
        logger(logging.INFO, 'File %s:%s bad at site %s is completely lost from dataset %s:%s. Will be marked as LOST and detached', scope, name, rse, dts['scope'], dts['name'])
        rucio.core.did.detach_dids(scope=dts['scope'], name=dts['name'], dids=[{'scope': scope, 'name': name}], session=session)

Here we have two parameters, name which is passed to the API and I think it is LFN name, anddst['name'] which I would assume is name of dataset/block. So, is it possible that LFN belongs to different datasets/blocks and can be discarded from a wrong dataset/block? Or, is it possible that we pass LFN from a CMS dataset but it can be present in multiple CMS blocks, and therefore if it is discarded from single CMS block it will be LOST for entire dataset?

ericvaandering commented 1 year ago

Well, in the Rucio model block:lfn (to use CMS terms) is a many to many relationship, not one:many. However, we still use (but don't enforce) 1:many.

However, if a replica is lost and it's the only replica, that means the file is lost forever and the right thing to do IS to remove it from all the blocks (Rucio datasets) it belongs to.

amaltaro commented 1 year ago

Based on the workflow that we have been investigating, in this comment: https://github.com/dmwm/WMCore/issues/11556#issuecomment-1531485494

I can list all the 41 files in that CMS block with the following command:

amaltaro@lxplus797:~ $ rucio list-content-history "cms:/QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#2ac6771a-ae1e-4666-9c3d-f2e3d8daccbf"
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
| SCOPE:NAME                                                                                                                                                                                     | [DID TYPE]   |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------|
| cms:/store/mc/Run3Summer22EEMiniAODv3/QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/MINIAODSIM/124X_mcRun3_2022_realistic_postEE_v1-v1/80000/9d1b582f-0825-4c77-a468-6c898f8505fe.root | FILE         |
| cms:/store/mc/Run3Summer22EEMiniAODv3/QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/MINIAODSIM/124X_mcRun3_2022_realistic_postEE_v1-v1/80000/28eb066c-da6c-4b40-a46c-671ab356ebee.root | FILE         |

and the second block - with 8 files in DBS but only 7 in Rucio - is even more interesting. The list-content-history API returns only 1 file, while list-content returns the remaining 7 files, e.g.:

amaltaro@lxplus797:~ $ rucio list-content-history "cms:/QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v1/MINIAODSIM#956bfed7-3d2b-401d-b8f7-f13a86176ec5"
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
| SCOPE:NAME                                                                                                                                                                                     | [DID TYPE]   |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------|
| cms:/store/mc/Run3Summer22EEMiniAODv3/QCD_PT-600to800_MuEnrichedPt5_TuneCP5_13p6TeV_pythia8/MINIAODSIM/124X_mcRun3_2022_realistic_postEE_v1-v1/80001/ae8b90c7-64c7-4331-b18d-be6e1133fa73.root | FILE         |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+

given that we can list files by querying the CMS block, I believe it is not a matter of leaving files detached from the CMS block. If I recall correctly, using the python client APIs we can even collect further details for these files. Last time I checked, they were deleted 4 or 5 days post injection into Rucio server.

IF we cannot answer what/who deleted those files back then, then I'd suggest to consider this issue: https://github.com/dmwm/CMSRucio/issues/495 at our earliest convenience, otherwise we won't be able to debug and identify which system is deleting these files.

ivmfnal commented 1 year ago

@ericvaandering yes, there are files with "missing" lists for each run:

[root@consistency-rucio-consistency-6cc7b4c985-kx5d4 consistency-dump]# ls -l *_M_action.list
-rw-r--r--. 1 root root       0 Mar 18 14:27 T0_CH_CERN_Disk_2023_03_18_14_00_M_action.list
-rw-r--r--. 1 root root       0 Mar 25 14:30 T0_CH_CERN_Disk_2023_03_25_14_00_M_action.list
-rw-r--r--. 1 root root       0 Mar 31 23:15 T0_CH_CERN_Disk_2023_03_31_22_48_M_action.list
-rw-r--r--. 1 root root       0 Apr 13 17:09 T0_CH_CERN_Disk_2023_04_13_16_29_M_action.list
-rw-r--r--. 1 root root       0 Apr 20 17:10 T0_CH_CERN_Disk_2023_04_20_16_29_M_action.list
-rw-r--r--. 1 root root       0 Apr 27 16:57 T0_CH_CERN_Disk_2023_04_27_16_29_M_action.list
-rw-r--r--. 1 root root       0 Mar 22 11:53 T1_DE_KIT_Disk_2023_03_18_13_32_M_action.list
-rw-r--r--. 1 root root 3213200 Mar 26 11:14 T1_DE_KIT_Disk_2023_03_25_13_33_M_action.list
-rw-r--r--. 1 root root       0 Mar 24 15:18 T1_ES_PIC_Disk_2023_03_24_01_57_M_action.list
-rw-r--r--. 1 root root       0 Apr  2 17:33 T1_ES_PIC_Disk_2023_04_01_15_15_M_action.list
-rw-r--r--. 1 root root       0 Apr 10 04:13 T1_ES_PIC_Disk_2023_04_09_15_49_M_action.list
-rw-r--r--. 1 root root       0 Apr 17 04:37 T1_ES_PIC_Disk_2023_04_16_15_49_M_action.list
-rw-r--r--. 1 root root       0 May  1 02:01 T1_ES_PIC_Disk_2023_04_30_15_49_M_action.list
-rw-r--r--. 1 root root       0 Mar 20 12:57 T1_FR_CCIN2P3_Disk_2023_03_18_14_25_M_action.list
-rw-r--r--. 1 root root       0 Apr  2 04:15 T1_FR_CCIN2P3_Disk_2023_03_30_19_36_M_action.list
-rw-r--r--. 1 root root       0 Mar 24 04:23 T1_FR_CCIN2P3_Tape_2023_03_24_03_24_M_action.list
-rw-r--r--. 1 root root       0 Mar 28 19:16 T1_FR_CCIN2P3_Tape_2023_03_28_18_39_M_action.list
-rw-r--r--. 1 root root       0 Apr  4 19:31 T1_FR_CCIN2P3_Tape_2023_04_04_18_39_M_action.list
...

ericvaandering commented 1 year ago

Let me propose a hypothesis. It would be good to have a more recent example where we can check to prove or disprove parts of this hypothesis.

The workflow runs and an output file for a block is produced.
Both DBSInjector and RucioInjector run and the file is attached to the block
Before the file in question can be replicated to another site, something happens to it and it is lost by the original site.
Some time later (under a week, generally) the Rucio consistency checking comes along and notices the file as missing
The replica is marked bad in Rucio and since it is the last replica, the "necromancer" marks it as LOST.
The Necromancer then detaches the file from the block
Now we have an inconsistency between Rucio and DBS

Now if this is what happens then I would say everything is "correct" except that DBS has to be updated too to deal with the loss of a file.

However, IF step 3/4 is incorrect and the file is really OK, then that is a problem.

vkuznet commented 1 year ago

@ericvaandering , if I use your logic then I doubt step 6 is correct since it is not up to Rucio to decide if file belongs to a block. I think step 6 requires Rucio to consult DBS if file no longer valid and if so it may detach it from the block. I think we are facing here issue with primary authority. In my view DBS is ultimate source of information about dataset/block file relationship and if such relationship exists all other tools should follow it, otherwise we will end-up with inconsistencies. This also aligns with your final statement that DBS needs to be updated at some point, and if it will Rucio should always consult DBS about file association to dataset/block.

jhonatanamado commented 1 year ago

@ericvaandering , Related to your hypothesis, I have a question. Will a suspended rule interfere in that logic as follows?

The workflow runs and an output file for a block is produced.
Both DBSInjector and RucioInjector run and the file is attached to the block
Content of the block is updated in Rucio
The block is being "protected" by a SUSPENDED rule. (will the SUSPENDED rule known about the new files that were attached?) if not ->
Some time later (under a week, generally) the Rucio consistency checking comes along and notices that files and is marked to be deleted.

ivmfnal commented 1 year ago

I agree with @vkuznet that Rucio should not remove the file from a dataset just because the file has no replicas, but it does: https://github.com/rucio/rucio/blob/e595093ef688422a9a3d8823570d7826369ab22e/lib/rucio/core/rule.py#L1932

jhonatanamado commented 1 year ago

Hello. We (me and @amarique1) were discussing this issue and we elaborate the following scenario couple of days ago where we concluded that suspended rules seems to be the culprit in this issue. To tested, we asked @amaltaro to deploy an agent injecting data to a cms-block. Once the block was created and the file was attached I forced that rule to go to suspended state[1]. The wf was still running and new files were attached but in this case the rule locking the data into the rse_expression was not updated [2]. This seems to be expected due that the only possible way to unsuspend a rule is to manually update that rule into stuck and then the deamon will retrieve and update the rule with the new files that must be locked. [1]

rucio --auth-host https://cms-rucio-auth-int.cern.ch --host http://cms-rucio-int.cern.ch rule-info 33e1c1d3330b4d23b6bde943a0cae2e8
Id:                         33e1c1d3330b4d23b6bde943a0cae2e8
Account:                    wma_test
Scope:                      cms
Name:                       /VBF_HHTo2G2Tau_CV_1_C2V_1_C3_0_TuneCP5_13TeV-powheg-pythia8/RunIISummer20UL16MiniAODAPVv2-HIG_RunIISummer20UL16MiniAODAPVv2_02457_0_SC_6Steps_PU_issue_11349_v5-v11/MINIAODSIM#fee48763-26a9-4f4e-a449-4637da31b41f
RSE Expression:             T1_US_FNAL_Disk
Copies:                     1
State:                      SUSPENDED
Locks OK/REPLICATING/STUCK: 1/0/0

[2]

rucio --auth-host https://cms-rucio-auth-int.cern.ch --host http://cms-rucio-int.cern.ch list-files cms:/VBF_HHTo2G2Tau_CV_1_C2V_1_C3_0_TuneCP5_13TeV-powheg-pythia8/RunIISummer20UL16MiniAODAPVv2-HIG_RunIISummer20UL16MiniAODAPVv2_02457_0_SC_6Steps_PU_issue_11349_v5-v11/MINIAODSIM#fee48763-26a9-4f4e-a449-4637da31b41f | tail -n 2
Total files : 3
Total size : 5.354 GB

Im not the expert but if im not wrong Rucio will move the rule from stuck to suspended state automatically if the rule is in stuck state for more that 2 weeks.

In the end seems that if new files are attached to a cms-dataset/blocks and a suspended rule is locking that cms-dataset/blocks, the locks are not effective, rule is not satisfied and those new files can be potentially be deleted by CC or reaper. .Probably we will only know who deleted if https://github.com/dmwm/CMSRucio/issues/495 is solved.

Adding a child rule locking data where the wf is running could solve this issue but other ideas are appreciated.

amaltaro commented 1 year ago

@jhonatanamado thank you for testing it and updating this ticket.

Adding a child rule locking data where the wf is running could solve this issue but other ideas are appreciated.

I am not sure what you mean by "child rule" here. Can you please elaborate? Note that the default rule created by WMAgent is made in a CMS block basis, locking blocks being produced at the origin site/rse. Sometimes, workflows will be configured to have a CMS dataset (container) rule as well. Those are created for the whole dataset and the rse expression is pretty generic (basically any Disk T1/T2).

It would be great if you could clarify how long it can take to set a rule to SUSPENDED as well (minimum/maximum time period). This is one of the pieces that I do not manage to make sense yet.

jhonatanamado commented 1 year ago

Hi @amaltaro ,

It would be great if you could clarify how long it can take to set a rule to SUSPENDED as well (minimum/maximum time period). This is one of the pieces that I do not manage to make sense yet

Well, here is where Rucio update the rule to SUSPENDED:

        if rule.stuck_at < (datetime.utcnow() - timedelta(days=14)):
            rule.state = RuleState.SUSPENDED
            insert_rule_history(rule=rule, recent=True, longterm=False, session=session)
            logger(logging.INFO, 'Replication rule %s has been SUSPENDED', rule_id)
            return

and will keep in that state unless someone/something update the rule to stuck/replicating And here where no new dids will be attached to a suspended rule

                rules = session.query(models.ReplicationRule).filter(
                    or_(*rule_clauses),
                    models.ReplicationRule.state != RuleState.SUSPENDED,
                    models.ReplicationRule.state != RuleState.WAITING_APPROVAL,
                    models.ReplicationRule.state != RuleState.INJECT).with_for_update(nowait=True).all()

But if wma_prod is creating rules at cms-block level where the wf is running, is something else what is detaching the files. We will continue investigating.

amaltaro commented 1 year ago

But if wma_prod is creating rules at cms-block level where the wf is running, is something else what is detaching the files. We will continue investigating.

@jhonatanamado Jhonatan, but didn't you prove that, if files get added to a rucio dataset that has a SUSPENDED rule, that those files will not be considered by the rule (in none of the 3 states: ok/stuck/replicating)? Thus leaving such files eligible for deletion by one of the Rucio daemons.

haozturk commented 1 year ago

In the relevant email thread, @ericvaandering asked for a recent example. The most recent I could find is from April:

Workflow
Output dataset: /EGamma/Run2018D-UL2018_MiniAODv2_GT36-v3/MINIAOD

Files causing mismatch:

/store/data/Run2018D/EGamma/MINIAOD/UL2018_MiniAODv2_GT36-v3/2550001/668DA1AD-8A6F-044E-91D1-02312875AC4D.root
/store/data/Run2018D/EGamma/MINIAOD/UL2018_MiniAODv2_GT36-v3/2550001/2137D153-48A6-D245-BF72-78183B05A1E6.root
/store/data/Run2018D/EGamma/MINIAOD/UL2018_MiniAODv2_GT36-v3/2550001/A286FE68-F229-044F-ACD4-FAD7C8ACA2C1.root

Hope this is recent enough. If new issues pop up, I'll share them with you.

jhonatanamado commented 1 year ago

hi @amaltaro .

but didn't you prove that, if files get added to a rucio dataset that has a SUSPENDED rule, that those files will not be considered by the rule (in none of the 3 states: ok/stuck/replicating)? Thus leaving such files eligible for deletion by one of the Rucio daemons.

Yes indeed we proved that if files get added to a rucio-dataset or rucio-container that has a SUSPENDED rule, those files are eligible for deletion by one of the rucio daemons or CC. But I did not consider that production is creating Rucio-datasets rules where the wf is running. So, those files are attached to a Rucio-datasets rule which rse_expression is the same place where the wf is running. No need to transfers files. So the state of the rucio-dataset rule in principle is that all time will be in state OK.

Now, I don't know what Rucio is doing for the unique replica files that needs to be transferred. i.e Will a unique file be detached from the content if has been replicated several times and the transfers fails repeatedly for a long time leaving suspended rules?.

jhonatanamado commented 1 year ago

Thanks @haozturk. Indeed those files gave me enough clues of what happened (at least for those particular files).

Rucio has an active rule for that rucio-dataset

ID                                ACCOUNT    SCOPE:NAME                                                                                  STATE[OK/REPL/STUCK]    RSE_EXPRESSION    COPIES    EXPIRES (UTC)    CREATED (UTC)
--------------------------------  ---------  ------------------------------------------------------------------------------------------  ----------------------  ----------------  --------  ---------------  -------------------
81fb482053ca4ae298eb2ab285dfc480  wma_prod   cms:/EGamma/Run2018D-UL2018_MiniAODv2_GT36-v3/MINIAOD#b842335e-27ba-41c6-bd37-f0234bbf31fc  OK[99/0/0]              T2_HU_Budapest    1                          2023-04-04 19:20:56

The files were declared as missing by CC at T2_HU_Budapest on 2023_04_08_08_06. Hear the link In general terms if a file is declared as missing. CC will them declares as "bad" to Rucio, and Rucio will re-replicate them

If file its unique. (which seems this case because file was generated at Budapest) it will be detached from the datasets, containers because it cannot be restored. Now the question is: Why those files were declared as missing? @ivmfnal @ericvaandering Any ideas?

ericvaandering commented 1 year ago

Thanks for this, this is illuminating (I think).

As far as I can tell they were declared missing because ... they were missing.

I've gone through the dark lists at the site after April 8 and these files never show up. If the CC erroneously declared the files missing, but they were not missing, you'd expect them to appear on the dark list some time later.

So I would say https://github.com/dmwm/WMCore/issues/11556#issuecomment-1532147169 seems like the likely explanation (or at least fits all the data I am aware of).

jhonatanamado commented 1 year ago

Thanks @ericvaandering . I went through your suggestion and indeed you are correct. The file was missing in the first place. At least that's what FTS shows.

So let's recap for this particular file.

/store/data/Run2018D/EGamma/MINIAOD/UL2018_MiniAODv2_GT36-v3/2550001/2137D153-48A6-D245-BF72-78183B05A1E6.root

From content_hist we can get when the file was created and deleted (detached).

{'scope': 'cms', 'name': '/store/data/Run2018D/EGamma/MINIAOD/UL2018_MiniAODv2_GT36-v3/2550001/2137D153-48A6-D245-BF72-78183B05A1E6.root', 
'type': 'FILE', 'bytes': 2152115154, 'adler32': 'a7e91744', 'md5': None, 'deleted_at': '2023-04-08 08:21:51', 'created_at': '2023-04-05 03:45:01', 
'updated_at': '2023-04-05 03:45:13'}

We can agree that the file was detached due that CC marked that file as missing at Budapest.

Checking the historical behavior of transfers for that particular file (request_hist table), transfers started to fail with File not found error from T2_HU_Budapest to T1_US_FNAL_Disk starting just a couple of minutes after the file was created. Transfer started_at 'started_at': '2023-04-05 03:46:16' and returned File not found after a couple of seconds.

{'id': 'bb16fc7d6f1b4c84b3f0c30d6ad3dca2', 'request_type': 'T', 'scope': 'cms', 'name': '/store/data/Run2018D/EGamma/MINIAOD/UL2018_MiniAODv2_GT36-v3/2550001/2137D153-48A6-D245-BF72-78183B05A1E6.root', 
'did_type': 'F', 'dest_rse_id': '087ee3383b9d45f6b31814af07b2c56d', 'source_rse_id': '9f13e775f8104689956f75b7695bf586', 
'attributes': '{"activity": "Data rebalancing", "source_replica_expression": null, "lifetime": null, "ds_scope": "cms", "ds_name": "/EGamma/Run2018D-UL2018_MiniAODv2_GT36-v3/MINIAOD#b842335e-27ba-41c6-bd37-f0234bbf31fc", "bytes": 2152115154, "md5": null, "adler32": "a7e91744", "priority": 3, "allow_tape_source": true}', 
'state': 'F', 'external_id': '61da3224-d364-11ed-9990-3cecef436a36', 'external_host': 'https://cmsfts3.fnal.gov:8446', 'retry_count': 0, 
'err_msg': 'RequestErrMsg.TRANSFER_FAILED:SOURCE [2] Result HTTP 404 : File not found  after 1 attempts', 'previous_attempt_id': None, 
'rule_id': '24ba7b177e4c4b87b6f382d152913ec3', 'activity': 'Data rebalancing', 'bytes': 2152115154, 'md5': None, 'adler32': 'a7e91744', 
'dest_url': 'davs://cmsdcadisk.fnal.gov:2880/dcache/uscmsdisk/store/data/Run2018D/EGamma/MINIAOD/UL2018_MiniAODv2_GT36-v3/2550001/2137D153-48A6-D245-BF72-78183B05A1E6.root', 
'submitted_at': '2023-04-05 03:46:01', 'started_at': '2023-04-05 03:46:16', 'transferred_at': '2023-04-05 03:46:17', 'estimated_at': None, 'submitter_id': None, 
'estimated_started_at': None, 'estimated_transferred_at': None, 'staging_started_at': None, 'staging_finished_at': None, 
'account': 'transfer_ops', 'requested_at': None, 'priority': None, 'updated_at': '2023-04-05 03:46:48', 'created_at': '2023-04-05 03:45:13', 'transfertool': None}

What DataManagement can deploy/monitor with the help of ElasticSearch is to find recently entries with File not found errors and filtered if the file was created during the las 24h. This at least will give some fresh hints on what's going on.

ericvaandering commented 1 year ago

Finding these earlier would be good, but unless you can use that to figure out why they are disappearing, it's not going to fundamentally change the situation.

@amaltaro Kevin and I were talking about this yesterday and I think what we ultimately need are

Rucio generates a message when these detachments (or if we change that behavior) when replicas are LOST. a. It probably does this already, but need to verify
Then we either need to alert Ops to this (1st step) or alternatively take all the actions in CMS to recover from this a. Detach the file from Rucio if needed b. invalidate the file in DBS c. If it's desired (e.g. real data processing), get WMAgent to reprocess the data needed to reproduce the LOST file.

ivmfnal commented 1 year ago

I think that lost file detachment is a separate issue from needing to re-generate, if possible, the lost file. I do not think is it necessary or even sensible to alter file metadata (including file dataset membership information) based on replica availability. Just because the last known to Rucio replica is gone does not actually mean the file is lost. What if it is being recovered from tape backup or is being recreated or is being transferred by FTS ? I think that file metadata should be separated from replica availability.

As for regenerating a file, I think it should be fairly easy to get Rucio generate a list of files (perhaps from a dataset or a container), which do not have any good replicas. Then these replicas can be re-generated, restored, if possible and declared to Rucio. If re-generation is not possible, then steps 2.a, 2.b will need to be taken.

So here is what I am proposing, assuming Rucio does not detach lost files:

Periodically, generate the list of lost files
For each lost file: if it is possible to recreate a replica: a. recreate the replica b. declare the replica to Rucio

If recreating a replica is not possible: c. Detach the file from Rucio if needed d. invalidate the file in DBS

ericvaandering commented 1 year ago

Yes, detachment and lost replicas are separate issues. Except something is not declared LOST if there is a copy on tape. Nor can a file be truly recreated (size/checksum). The physics content can be recreated by the WM system, not the actual file. That's what I meant.

What you are proposing isn't so different than what I would propose initially. Except for the decision point, which does not exist (see above). Eventually, if the steps are known, there is not reason not to automate this. And my assumption (a good one I think) is that if a file is LOST, there is no other course of action but to detach it and declare it invalid.

The whole reason we have this mismatch is because Rucio thinks it is the "truth" of dataset contents and CMS has assumed DBS is the truth.

(The reason I said we might recreate the content of the file and not "will" is because if it's MC, we probably don't care enough to go through the hassle. And if it's RAW, we cannot.)

vkuznet commented 1 year ago

+1 to @ericvaandering , i.e. we need CMS Rucio code which will contact DBS and make Rucio decision based on DBS info and update DBS if necessary. This will require proper certificate and rules.

ivmfnal commented 1 year ago

Yes, detachment and lost replicas are separate issues. Except something is not declared LOST if there is a copy on tape. Nor can a file be truly recreated (size/checksum). The physics content can be recreated by the WM system, not the actual file. That's what I meant.

I was talking hypothetically to illustrate the fact that just because there are no more replicas known to Rucio does not always mean there is no replica somewhere unknown to Rucio or one can not be recreated.

My point is that the decision about detaching the file and invalidating it in DBS should not be made by Rucio.

ericvaandering commented 1 year ago

My view of this is that it’s a micro service since it may need information and actions spanning Rucio, DBS, and WMAgent.

Sent from a mobile device. Please excuse my brevity or transcription errors.

On May 16, 2023, at 1:56 PM, Valentin Kuznetsov @.***> wrote:

+1 to @ericvaanderinghttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ericvaandering&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=jZ98h1BTR4g1B82AcxlJIsfYrrpBr9eENEUd7NYjk0Vda6ezLo-2-SNgbIz3rBAk&s=V7E1O8SKu4tlG1zkKpEhSA2o40mBtkI8GZ2k1DYv2Kw&e= , i.e. we need CMS Rucio code which will contact DBS and make Rucio decision based on DBS info and update DBS if necessary. This will require proper certificate and rules.

— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_WMCore_issues_11556-23issuecomment-2D1550197317&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=jZ98h1BTR4g1B82AcxlJIsfYrrpBr9eENEUd7NYjk0Vda6ezLo-2-SNgbIz3rBAk&s=DjZ28tlOp0_upBu6gC__aHkRppdAqCPGZIb5K1foIvg&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAMYJLQHNFE7HYCAOZZZ7LTXGPEVVANCNFSM6AAAAAAXBSSDSE&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=jZ98h1BTR4g1B82AcxlJIsfYrrpBr9eENEUd7NYjk0Vda6ezLo-2-SNgbIz3rBAk&s=FOZtPSHuIXZAomrH415IDaMJw_0Dbvi4HVVSi4YCNZs&e=. You are receiving this because you were mentioned.Message ID: @.***>

vkuznet commented 1 year ago

@ericvaandering , I like the idea of micro-service but it is unclear what is a request flow. For example,

new MicroService (MS) will be a proxy between DBS/Rucio/WMAgent
WMAgent will place requests through MS
MS will talk to DBS and Rucio to place requests on behalf of WMAgent

But if it is the case, it is unclear why do we need MS, since WMagent can talk to both DBS and Rucio, and Rucio can talk to DBS. The point is that we should clearly provide architectural diagram with all requests. For instance, it seems to me that Rucio must talk to DBS before making decision to delete replica, why do we need MS for that? I suggest that we draw on a board all requests and then decide if we really need MS for that (as it may be overkill for request flow). At the moment I think the missing part is in Rucio only since it never take into account DBS and it can be simply solve by wrapper (or even a decorator function) to appropriate delete API, i.e. the delete API will need to consult DBS and then delete the replica.

ericvaandering commented 1 year ago

We’re not going to be able to build an explicit communication channel into Rucio to connect to DBS. That’s a non-starter.

What we can probably ask for is a configuration to not detach a replica if the last copy is lost. In that case, a message should be emitted from Rucio to a MQ. Something, then, should listen for that message, gather CMS specific information from the various CMS services and decide on a course of action. It then may have to take action with several different CMS services to restore consistency.

Eric

On May 17, 2023, at 9:17 AM, Valentin Kuznetsov @.***> wrote:

@ericvaandering https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ericvaandering&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=Mirwnp6iA5aXEM5eJzdkEAzMRVYw6UdK6JF4GO_i0VRoz1iaOUez3K5i83z1CMRj&s=aXSrvmlTV4N6ltXPnH_F6HHFaCCpgH6XKSGb4loWUYw&e= , I like the idea of micro-service but it is unclear the request flow. For example,

new MicroService (MS) will be a proxy between DBS/Rucio/WMAgent WMAgent will place requests through MS MS will talk to DBS and Rucio to place requests on behalf of WMAgent But if it is the case, it is unclear why do we need MS, since WMagent can talk to both DBS and Rucio, and Rucio can talk to DBS. The point is that we should clearly provide architectural diagram with all requests. For instance, it seems to me that Rucio must talk to DBS before making decision to delete replica, why do we need MS for that? I suggest that we draw on a board all requests and then decide if we really need MS for that (as it may be overkill for request flow). At the moment I think the missing part is in Rucio only since it never take into account DBS and it can be simply solve by wrapper (or even a decorator function) to appropriate delete API, i.e. the delete API will need to consult DBS and then delete the replica.

— Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_WMCore_issues_11556-23issuecomment-2D1551482388&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=Mirwnp6iA5aXEM5eJzdkEAzMRVYw6UdK6JF4GO_i0VRoz1iaOUez3K5i83z1CMRj&s=qFd7ZhHytukWxSyeKXbCzYy-m0pwfbysv11UrJ8-2SU&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAMYJLRPMVBNQ7UTJJWIXETXGTMV7ANCNFSM6AAAAAAXBSSDSE&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=Mirwnp6iA5aXEM5eJzdkEAzMRVYw6UdK6JF4GO_i0VRoz1iaOUez3K5i83z1CMRj&s=RGdcZBcaXlhxHKD_fbj9jaT7dXkdLt-JxeRaUWyvVnc&e=. You are receiving this because you were mentioned.

vkuznet commented 1 year ago

Alan gave me some pointers and I would like to update on my findings:

I start with the following LFN: /store/mc/RunIISummer20UL18RECO/G1Jet_LHEGpT-50To100_PhotonInBarrel_Mjj-300_TuneCP5_13TeV-amcatnlo-pythia8/AODSIM/106X_upgrade2018_realistic_v11_L1v1-v2/2820004/1D29A34F-EB43-D94D-840F-845115A852A6.root
it is defined in this block /G1Jet_LHEGpT-50To100_PhotonInBarrel_Mjj-300_TuneCP5_13TeV-amcatnlo-pythia8/RunIISummer20UL18RECO-106X_upgrade2018_realistic_v11_L1v1-v2/AODSIM#c56456c6-2aac-41e3-b7fc-aa3a569ac531
it belongs to this workflow: cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766
which runs on these agents: cmsgwms-submit3.fnal.gov,cmsgwms-submit4.fnal.gov,cmsgwms-submit5.fnal.gov,cmsgwms-submit6.fnal.gov,cmsgwms-submit7.fnal.gov,cmsgwms-submit8.fnal.gov,vocms0254.cern.ch,vocms0255.cern.ch,vocms0256.cern.ch,vocms0281.cern.ch,vocms0282.cern.ch,vocms0283.cern.ch
where using single agent cmsgwms-submit5 and its log area data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir
we can find this workflow over here /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766

and scanning for its logs using the following script:

[cmsdataops@cmsgwms-submit5 current]$ cat find_entries.sh
#!/bin/bash
idir=$1
pat=$2
files=`find $1 -name Job*.tar.bz2`
for fname in $files;
do
results=`bzgrep "$pat" $fname`
if [ -n "$results" ]; then
    echo "Pattern \"$pat\" found in $fname"
fi
done

we can see that pattern stage out failed can be found in the following logs:

[cmsdataops@cmsgwms-submit5 current]$ ./find_entries.sh $idir "$pat"
Pattern "stage out failed" found in /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766/JobCluster_3711/Job_3711941.tar.bz2
Pattern "stage out failed" found in /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766/JobCluster_3711/Job_3711681.tar.bz2
Pattern "stage out failed" found in /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766/JobCluster_3711/Job_3711751.tar.bz2
Pattern "stage out failed" found in /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766/JobCluster_3711/Job_3711993.tar.bz2
Pattern "stage out failed" found in /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766/JobCluster_3711/Job_3711515.tar.bz2
Pattern "stage out failed" found in /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766/JobCluster_3711/Job_3711466.tar.bz2
Pattern "stage out failed" found in /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766/JobCluster_3711/Job_3711422.tar.bz2
Pattern "stage out failed" found in /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766/JobCluster_3711/Job_3711292.tar.bz2

Now, if we look into individual log, e.g.

bzcat /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766/JobCluster_3711/Job_3711422.tar.bz2 | less

we can see the following failures:


2023-04-19 05:43:42,959:ERROR:StageOutImpl:Attempt 0 to stage out failed.
Automatically retrying in 300 secs
Error details:
<@========== WMException Start ==========@>
Exception Class: StageOutError
Message: Command exited non-zero, ExitCode:151
Output: stdout: Local File Size is: 8393858
Remote File Size is: 0
Local File Checksum is: c3e0a807
Remote File Checksum is:
ERROR: Size or Checksum Mismatch between local and SE

stderr: Run: [ERROR] Operation expired: (destination) [ERROR] Server responded with an error: [3013] No checksum available for this file.

    ClassName : None
    ModuleName : WMCore.Storage.StageOutError
    MethodName : __init__
    ClassInstance : None
    FileName : /srv/job/WMCore.zip/WMCore/Storage/StageOutError.py
    LineNumber : 32
    ErrorNr : 0
    Command : LOCAL_SIZE=`stat -c%s "/srv/job/WMTaskSpace/cmsRun5/AODSIMoutput.root"`

... Command exited non-zero, ExitCode:151 Output: stdout: Local File Size is: 8393858 Remote File Size is: 0 Local File Checksum is: c3e0a807 Remote File Checksum is: ERROR: Size or Checksum Mismatch between local and SE

stderr: Run: [ERROR] Operation expired: (destination) [ERROR] Server responded with an error: [3013] No checksum available for this file.

ERROR:root:Attempt 0 to stage out failed. ... echo "Local File Size is: $LOCAL_SIZE" xrdcp --force --nopbar -p --cksum adler32:c3e0a807 "/srv/job/WMTaskSpace/cmsRun5/AODSIMoutput.root" "root://cmsdcadisk.fnal.gov//dcache/uscmsdisk/store/unmerged/RunIISummer20UL18RECO/G1Jet_LHEGpT-50To100_PhotonInBarrel_Mjj-300_TuneCP5_13TeV-amcatnlo-pythia8/AODSIM/106X_upgrade2018_realistic_v11_L1v1-v2/50568/172D2C31-CA56-8949-8FD4-50BCA1D033F5.root" REMOTE_SIZE=xrdfs 'cmsdcadisk.fnal.gov' stat '/dcache/uscmsdisk/store/unmerged/RunIISummer20UL18RECO/G1Jet_LHEGpT-50To100_PhotonInBarrel_Mjj-300_TuneCP5_13TeV-amcatnlo-pythia8/AODSIM/106X_upgrade2018_realistic_v11_L1v1-v2/50568/172D2C31-CA56-8949-8FD4-50BCA1D033F5.root' | grep Size | sed -r 's/.*Size:[ ]*([0-9]+).*/\1/' echo "Remote File Size is: $REMOTE_SIZE" echo "Local File Checksum is: c3e0a807" REMOTE_XS=xrdfs 'cmsdcadisk.fnal.gov' query checksum '/dcache/uscmsdisk/store/unmerged/RunIISummer20UL18RECO/G1Jet_LHEGpT -50To100_PhotonInBarrel_Mjj-300_TuneCP5_13TeV-amcatnlo-pythia8/AODSIM/106X_upgrade2018_realistic_v11_L1v1-v2/50568/172D2C3 1-CA56-8949-8FD4-50BCA1D033F5.root' | grep -i adler32 | sed -r 's/.*[adler|ADLER]32[ ]*([0-9a-fA-F]{8}).*/\1/' echo "Remote File Checksum is: $REMOTE_XS" if [ $REMOTE_SIZE ] && [ $REMOTE_XS ] && [ $LOCAL_SIZE == $REMOTE_SIZE ] && [ 'c3e0a807' == $REMOTE_XS ]; then exit 0; els e echo "ERROR: Size or Checksum Mismatch between local and SE"; xrdfs 'cmsdcadisk.fnal.gov' rm /dcache/uscmsdisk/store/unm erged/RunIISummer20UL18RECO/G1Jet_LHEGpT-50To100_PhotonInBarrel_Mjj-300_TuneCP5_13TeV-amcatnlo-pythia8/AODSIM/106X_upgrade 2018_realistic_v11_L1v1-v2/50568/172D2C31-CA56-8949-8FD4-50BCA1D033F5.root ; exit 60311 ; fi ExitCode : 151 ErrorCode : 60311 ErrorType : GeneralStageOutFailure



Since we have different agents involved I need more information what we should spot in a log files. The more precise regexp you can provide the better, easier and faster we can find this needle.  May be we should look at specific code or message. Please provide any valuable input you can give or know.

amaltaro commented 1 year ago

@vkuznet Valentin, Andress provided a list of files in the #wmagent-dev slack channel. I am trying to keep debugging data lost in Rome, so here is one of those use cases:

    "name": "/store/mc/Run3Winter23Digi/SinglePion_E-50_Eta-0to3-pythia8-gun/GEN-SIM-RAW/NoPUGTv4_126X_mcRun3_2023_forPU65_v4-v2/60004/5c301e42-561c-4489-a2ba-235d00ff0058.root",
    "dest_rse": "T1_UK_RAL_Disk",
    "src_rse": "T2_IT_Rome",
    "err_msg": "RequestErrMsg.TRANSFER_FAILED:SOURCE [2] Result HTTP 404 : File not found  after 1 attempts",
    "tx_requested_at": "2023-05-20 22:08:14",
    "tx_created_at": "2023-05-20 22:08:14",
    "tx_submitted_at": "2023-05-20 22:08:15",
    "tx_started_at": "2023-05-20 22:08:17",
    "tx_transferred_at": "2023-05-20 22:08:18",
    "tx_updated_at": "2023-05-20 22:08:58",
    "content_updated_at": "2023-05-18 04:17:24",
    "content_created_at": "2023-05-18 04:17:22"

performing the file to dataset and then to workflow, here is the running-closed workflow writing to that dataset: cmsunified_task_HCA-Run3Winter23GS-00001__v1_T_230517_182344_1871

which is running from a single agent:

agent url: cmsgwms-submit6.fnal.gov

If we scan the job logs in submit6 for that file id in the wmagentJob.log log file - using this script [^] - it spits out the following log:

  MATCH! LogArchive /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_HCA-Run3Winter23GS-00001__v1_T_230517_182344_1871/JobCluster_2706/Job_2706067.tar.bz2

Which indeed has something VERY suspicious in the log, which I also paste it here:

2023-05-18 06:03:50,553:INFO:StageOutImpl:Running the stage out...
2023-05-18 06:04:04,404:INFO:StageOutImpl:Command exited with status: 0
Output message: stdout: Thu May 18 06:03:50 CEST 2023
Copying 3227887890 bytes file:///srv/job/WMTaskSpace/cmsRun1/Merged.root => davs://cmsrm-xrootd01.roma1.infn.it:2880/mc/Run3Winter23Digi/SinglePion_E-50_Eta-0to3-pythia8-gun/GEN-SIM-RAW/NoPUGTv4_126X_mcRun3_2023_forPU65_v4-v2/60004/5c301e42-561c-4489-a2ba-235d00ff0058.root
gfal-copy exit status: 0

stderr: /srv/startup_environment.sh: line 8: BASHOPTS: readonly variable
/srv/startup_environment.sh: line 15: BASH_VERSINFO: readonly variable
/srv/startup_environment.sh: line 36: EUID: readonly variable
/srv/startup_environment.sh: line 205: PPID: readonly variable
/srv/startup_environment.sh: line 213: SHELLOPTS: readonly variable
/srv/startup_environment.sh: line 230: UID: readonly variable
gfal-copy error: 256 (Unknown error 256) - TRANSFER ERROR: Copy failed (streamed). Last attempt: HTTP 500 : Unexpected server error: 500  (destination)

[^] https://github.com/amaltaro/ProductionTools/blob/master/untarLogArchive.py

which can be executed like:

python3 untarLogArchive.py /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_HCA-Run3Winter23GS-00001__v1_T_230517_182344_1871 5c301e42-561c-4489-a2ba-235d00ff0058.root

vkuznet commented 1 year ago

Alan, I confirm that I do see too such errors:

gfal-copy error: 5 (Input/output error) - DESTINATION OVERWRITE   Result HTTP 500 : Unexpected server error: 500  after 1 attempts
/srv/startup_environment.sh: line 9: BASHOPTS: readonly variable
/srv/startup_environment.sh: line 16: BASH_VERSINFO: readonly variable
/srv/startup_environment.sh: line 35: EUID: readonly variable
/srv/startup_environment.sh: line 156: PPID: readonly variable
/srv/startup_environment.sh: line 163: SHELLOPTS: readonly variable
/srv/startup_environment.sh: line 181: UID: readonly variable
gfal-rm error: 5 (Input/output error) - Result HTTP 500 : Unexpected server error: 500  after 1 attempts

in agents for cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766/JobCluster_294 workflow. It is very common error which happens in many logs in all agents. For instance on vocms0283:/data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766/JobCluster_294/Job_294596.tar.bz2

vkuznet commented 1 year ago

I scanned few agents, and I do see HTTP 500 errors in all of them. I copied few log files to this location /afs/cern.ch/user/v/valya/public/WMCore if anyone would like to look, but I observed many stage out failures (not only related to HTTP 500 error) and they are quite frequent. The errors related to 500 message come from gfal calls. Based on this observation may be it is worth to re-think how to handle such kind of errors for easy access to them. For instance, we can introduce the code which will send error messages to MONIT via AMQ brokers, such that later someone can view them in ES/OpenSearch. Such error message may contain name of blocks, lfn, agent, its log, error message, etc. The, the look-up of errors can be simplified by looking into ES/OpenSearch, and moreover we can provide corresponding dashboard for easy navigation.