new: 2024-10-24 21:18:28 UTC
assignment-approved: 2024-10-24 21:18:35 UTC
assigned: 2024-10-24 22:02:27 UTC
staging: 2024-10-24 22:04:03 UTC
aborted: 2024-11-05 16:20:00 UTC
new: 2024-10-24 21:22:36 UTC
assignment-approved: 2024-10-24 21:22:42 UTC
assigned: 2024-10-24 22:03:10 UTC
staging: 2024-10-24 22:04:06 UTC
aborted: 2024-11-05 14:02:36 UTC
Now Keep in mind that all these wfs require the same input blocks i.e
and that
wf1 was rejected 4 mins after injection, before it was even assigned, and therefore it never went to staging state and thus no ms-transferor entry or rule creation in rucio.
Now Nearly, 9 hours later wf2 is injected and shortly (~mins) after wf3 is also injected
The follwing rules are created for wf2 by ms-transferor
transferor - 1st Create (This aligns with rucio block rules history I pasted above)
reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log-2024-10-24 22:04:02,376:INFO:MSTransferor: Creating rule for workflow pdmvserv_RVCMSSW_14_0_18RunJetMET02024D__2024D_ReReco_GTValidation_RelVal_2024D_241024_211828_4633 with 2 DIDs in container /JetMET0/Run2024D-v1/RAW, RSEs: T1_US_FNAL_Disk, grouping: ALL
reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log:2024-10-24 22:04:02,484:INFO:MSTransferor: Rules successful created for /JetMET0/Run2024D-v1/RAW : ['3353ed719bf446d38ccb739ba1076642', '6595ca5e42204e4c913e549e0ba24205']
Now, MS-transferor processes wf3, but the rules are already created. So no new rules are created and MS-transferors keeps the same rule ids
transferor - 2nd Try Create
reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log-2024-10-24 22:04:05,676:INFO:MSTransferor: Creating rule for workflow pdmvserv_RVCMSSW_14_0_18RunJetMET02024D__2024D_Prompt_GTValidation_RelVal_2024D_241024_212236_6565 with 2 DIDs in container /JetMET0/Run2024D-v1/RAW, RSEs: T1_US_FNAL_Disk, grouping: ALL
reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log:2024-10-24 22:04:05,740:WARNING:Rucio: Resolving duplicate rules and separating every DID in a new rule...
reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log-2024-10-24 22:04:05,790:WARNING:Rucio: Found duplicate rule for account: wmcore_transferor
reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log-, rseExp: T1_US_FNAL_Disk
reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log-dids: [{'scope': 'cms', 'name': '/JetMET0/Run2024D-v1/RAW#9d364486-0ad1-4207-812f-29f23249e197'}, {'scope': 'cms', 'name': '/JetMET0/Run2024D-v1/RAW#4a41434d-4d35-4c12-b6c3-037e172642d8'}]
reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log-2024-10-24 22:04:05,849:WARNING:Rucio: Found duplicate rule for account: wmcore_transferor
reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log-, rseExp: T1_US_FNAL_Disk
Everything goes smoothly, wf2 and wf3 are staging, BUT
MS-Rulecleanor picks up rejected wf1 sees that it has rucio rules and it deletes them.
The rest two deletes were for these 2 wfs,
pdmvserv_RVCMSSW_14_0_18RunJetMET02024D__2024D_ReReco_GTValidation_RelVal_2024D_241022_092626_4354
pdmvserv_RVCMSSW_14_0_18RunJetMET02024D__2024D_ReReco_GTValidation_RelVal_2024D_241022_101534_7605
They also share the same input Blocks as above 3 wfs but difference is that they were placed at CERN.
How to reproduce it
Steps to reproduce the behavior:
submit 1 request requiring a certain block of data, reject it and resubmit another request requiring the same blocks in the timeframe where the oriignal rucio rules are not deleted.
Expected behavior
A clear and concise description of what you expected to happen.
MSRuleCleaner to not delete rules being used by active workflows.
Additional context and error message
Add any other context about the problem here, like error message and/or traceback. You might want to use triple back ticks to properly format it.
https://its.cern.ch/jira/browse/CMSPROD-227
Impact of the bug Inform which systems get affected by this bug. Which agent(s)? Which central service(s)? MSRuleCleaner, MSTransferor, ReqMgr2
Describe the bug
There are three wfs concerning this situation in order of injection to RegMgr2.
wf1 - pdmvserv_RVCMSSW_14_0_18RunJetMET02024D__2024D_ReReco_GTValidation_RelVal_2024D_241024_121809_606
wf2 - pdmvserv_RVCMSSW_14_0_18RunJetMET02024D__2024D_ReReco_GTValidation_RelVal_2024D_241024_211828_4633
wf3 - pdmvserv_RVCMSSW_14_0_18RunJetMET02024D__2024D_Prompt_GTValidation_RelVal_2024D_241024_212236_6565
Now Keep in mind that all these wfs require the same input blocks i.e
and that wf1 was rejected 4 mins after injection, before it was even assigned, and therefore it never went to staging state and thus no ms-transferor entry or rule creation in rucio.
Now Nearly, 9 hours later wf2 is injected and shortly (~mins) after wf3 is also injected
The follwing rules are created for wf2 by ms-transferor
transferor - 1st Create (This aligns with rucio block rules history I pasted above)
Now, MS-transferor processes wf3, but the rules are already created. So no new rules are created and MS-transferors keeps the same rule ids
transferor - 2nd Try Create
Everything goes smoothly, wf2 and wf3 are staging, BUT MS-Rulecleanor picks up rejected wf1 sees that it has rucio rules and it deletes them.
rulecleaner - 1st Delete
The rest two deletes were for these 2 wfs, pdmvserv_RVCMSSW_14_0_18RunJetMET02024D__2024D_ReReco_GTValidation_RelVal_2024D_241022_092626_4354 pdmvserv_RVCMSSW_14_0_18RunJetMET02024D__2024D_ReReco_GTValidation_RelVal_2024D_241022_101534_7605 They also share the same input Blocks as above 3 wfs but difference is that they were placed at CERN.
rulecleaner - to Delete again 1 second later
rulecleaner - Delete again 3 second later again
Conclusion, wf1's rejection lead to rules deletion for wf1 and wf2 because ** MS-Rulecleanor does not check if the rule is being used by any other workflow or not. I went through the code and found that it only checks the current wfs state, https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/MicroService/MSRuleCleaner/MSRuleCleaner.py#L323
How to reproduce it Steps to reproduce the behavior: submit 1 request requiring a certain block of data, reject it and resubmit another request requiring the same blocks in the timeframe where the oriignal rucio rules are not deleted.
Expected behavior A clear and concise description of what you expected to happen. MSRuleCleaner to not delete rules being used by active workflows.
Additional context and error message Add any other context about the problem here, like error message and/or traceback. You might want to use triple back ticks to properly format it. https://its.cern.ch/jira/browse/CMSPROD-227