dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

MSRuleCleaner deleting rules for rejected wfs that are being used by other workflows. #12164

Open hassan11196 opened 1 week ago

hassan11196 commented 1 week ago

Impact of the bug Inform which systems get affected by this bug. Which agent(s)? Which central service(s)? MSRuleCleaner, MSTransferor, ReqMgr2

Describe the bug

There are three wfs concerning this situation in order of injection to RegMgr2.

wf1 - pdmvserv_RVCMSSW_14_0_18RunJetMET02024D__2024D_ReReco_GTValidation_RelVal_2024D_241024_121809_606

new: 2024-10-24 12:18:09 UTC 
assignment-approved: 2024-10-24 12:18:14 UTC
rejected: 2024-10-24 12:22:05 UTC 
rejected-archived: 2024-10-26 16:56:46 UTC

wf2 - pdmvserv_RVCMSSW_14_0_18RunJetMET02024D__2024D_ReReco_GTValidation_RelVal_2024D_241024_211828_4633

new: 2024-10-24 21:18:28 UTC 
assignment-approved: 2024-10-24 21:18:35 UTC 
assigned: 2024-10-24 22:02:27 UTC 
staging: 2024-10-24 22:04:03 UTC 
aborted: 2024-11-05 16:20:00 UTC

wf3 - pdmvserv_RVCMSSW_14_0_18RunJetMET02024D__2024D_Prompt_GTValidation_RelVal_2024D_241024_212236_6565

new: 2024-10-24 21:22:36 UTC
assignment-approved: 2024-10-24 21:22:42 UTC 
assigned: 2024-10-24 22:03:10 UTC 
staging: 2024-10-24 22:04:06 UTC 
aborted: 2024-11-05 14:02:36 UTC 

Now Keep in mind that all these wfs require the same input blocks i.e 

[{'scope': 'cms', 'name': '/JetMET0/Run2024D-v1/RAW#9d364486-0ad1-4207-812f-29f23249e197'},
{'scope': 'cms', 'name': '/JetMET0/Run2024D-v1/RAW#4a41434d-4d35-4c12-b6c3-037e172642d8'}]

and that wf1 was rejected 4 mins after injection, before it was even assigned, and therefore it never went to staging state and thus no ms-transferor entry or rule creation in rucio.

Now Nearly, 9 hours later wf2 is injected and shortly (~mins) after wf3 is also injected

The follwing rules are created for wf2 by ms-transferor 

['3353ed719bf446d38ccb739ba1076642', '6595ca5e42204e4c913e549e0ba24205']

transferor - 1st Create (This aligns with rucio block rules history I pasted above)

reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log-2024-10-24 22:04:02,376:INFO:MSTransferor: Creating rule for workflow pdmvserv_RVCMSSW_14_0_18RunJetMET02024D__2024D_ReReco_GTValidation_RelVal_2024D_241024_211828_4633 with 2 DIDs in container /JetMET0/Run2024D-v1/RAW, RSEs: T1_US_FNAL_Disk, grouping: ALL
reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log:2024-10-24 22:04:02,484:INFO:MSTransferor: Rules successful created for /JetMET0/Run2024D-v1/RAW : ['3353ed719bf446d38ccb739ba1076642', '6595ca5e42204e4c913e549e0ba24205']

Now, MS-transferor processes wf3, but the rules are already created. So no new rules are created and MS-transferors keeps the same rule ids

transferor - 2nd Try Create 

reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log-2024-10-24 22:04:05,676:INFO:MSTransferor: Creating rule for workflow pdmvserv_RVCMSSW_14_0_18RunJetMET02024D__2024D_Prompt_GTValidation_RelVal_2024D_241024_212236_6565 with 2 DIDs in container /JetMET0/Run2024D-v1/RAW, RSEs: T1_US_FNAL_Disk, grouping: ALL
reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log:2024-10-24 22:04:05,740:WARNING:Rucio: Resolving duplicate rules and separating every DID in a new rule...
reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log-2024-10-24 22:04:05,790:WARNING:Rucio: Found duplicate rule for account: wmcore_transferor
reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log-, rseExp: T1_US_FNAL_Disk
reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log-dids: [{'scope': 'cms', 'name': '/JetMET0/Run2024D-v1/RAW#9d364486-0ad1-4207-812f-29f23249e197'}, {'scope': 'cms', 'name': '/JetMET0/Run2024D-v1/RAW#4a41434d-4d35-4c12-b6c3-037e172642d8'}]
reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log-2024-10-24 22:04:05,849:WARNING:Rucio: Found duplicate rule for account: wmcore_transferor
reqmgr2ms-20241024-ms-transferor-574c4c9f57-drtk2.log-, rseExp: T1_US_FNAL_Disk 

Everything goes smoothly, wf2 and wf3 are staging, BUT MS-Rulecleanor picks up rejected wf1 sees that it has rucio rules and it deletes them.

rulecleaner - 1st Delete

reqmgr2ms-20241024-ms-rulecleaner-fc79648b4-qtrf9.log-2024-10-24 22:14:27,830:DEBUG:MSRuleCleaner: Dispatching workflow: pdmvserv_RVCMSSW_14_0_18RunJetMET02024D__2024D_ReReco_GTValidation_RelVal_2024D_241024_121809_606 
reqmgr2ms-20241024-ms-rulecleaner-fc79648b4-qtrf9.log-2024-10-24 22:14:28,189:INFO:MSRuleCleaner: plineMSTrBlock: Deleting ruleId: 6595ca5e42204e4c913e549e0ba24205 
reqmgr2ms-20241024-ms-rulecleaner-fc79648b4-qtrf9.log-2024-10-24 22:14:28,217:DEBUG:connectionpool: http://cms-rucio.cern.ch:80 "DELETE /rules/6595ca5e42204e4c913e549e0ba24205 HTTP/1.1" 200 0
reqmgr2ms-20241024-ms-rulecleaner-fc79648b4-qtrf9.log:2024-10-24 22:14:28,217:INFO:MSRuleCleaner: plineMSTrBlock: Deleting ruleId: 3353ed719bf446d38ccb739ba1076642 
reqmgr2ms-20241024-ms-rulecleaner-fc79648b4-qtrf9.log-2024-10-24 22:14:28,236:DEBUG:connectionpool: http://cms-rucio.cern.ch:80 "DELETE /rules/3353ed719bf446d38ccb739ba1076642 HTTP/1.1" 200 0

The rest two deletes were for these 2 wfs,  pdmvserv_RVCMSSW_14_0_18RunJetMET02024D__2024D_ReReco_GTValidation_RelVal_2024D_241022_092626_4354 pdmvserv_RVCMSSW_14_0_18RunJetMET02024D__2024D_ReReco_GTValidation_RelVal_2024D_241022_101534_7605 They also share the same input Blocks as above 3 wfs but difference is that they were placed at CERN.

rulecleaner - to Delete again 1 second later

reqmgr2ms-20241024-ms-rulecleaner-fc79648b4-qtrf9.log:2024-10-24 22:14:29,027:INFO:MSRuleCleaner: plineMSTrBlock: Deleting ruleId: 3353ed719bf446d38ccb739ba1076642 

rulecleaner - Delete again 3 second later again

reqmgr2ms-20241024-ms-rulecleaner-fc79648b4-qtrf9.log:2024-10-24 22:14:32,810:INFO:MSRuleCleaner: plineMSTrBlock: Deleting ruleId: 3353ed719bf446d38ccb739ba1076642  

Conclusion, wf1's rejection lead to rules deletion for wf1 and wf2 because ** MS-Rulecleanor does not check if the rule is being used by any other workflow or not. I went through the code and found that it only checks the current wfs state, https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/MicroService/MSRuleCleaner/MSRuleCleaner.py#L323

How to reproduce it Steps to reproduce the behavior: submit 1 request requiring a certain block of data, reject it and resubmit another request requiring the same blocks in the timeframe where the oriignal rucio rules are not deleted.

Expected behavior A clear and concise description of what you expected to happen. MSRuleCleaner to not delete rules being used by active workflows.

Additional context and error message Add any other context about the problem here, like error message and/or traceback. You might want to use triple back ticks to properly format it. https://its.cern.ch/jira/browse/CMSPROD-227

hassan11196 commented 1 week ago

I will follow up with a pull requst for this issue.