dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Run WMCore central services validation for HG2207 #11197

Closed todor-ivanov closed 2 years ago

todor-ivanov commented 2 years ago

Impact of the new feature WMCore central services

Is your feature request related to a problem? Please describe. Monthly task

Describe the solution you'd like Validate central services in cmsweb-testbed (well, it might have to be in one of our VMs due to the current Rucio setup) and provide the final feedback by the June deadline specified by the CMSWEB team.

It also includes the creation of the service release notes and the validation check-list twiki.

Describe alternatives you've considered None

Additional context None

todor-ivanov commented 2 years ago

The validatio workflows have been injected.

todor-ivanov commented 2 years ago

And we've run into a more global issue - both CERN and FNAL have run out of space:

 2022-06-30 19:19:59,219:INFO:MSTransferor: List of out-of-space RSEs dropped for 'tivanov_ReReco_RunBlockWhite_HG2207_Val_220630_171417_6069' is: {'T1_US_FNAL_Disk', 'T2_CH_CERN'}
todor-ivanov commented 2 years ago

At the end I had to edit the MSTransferror code online to skip the quota check, and let it create the service create the rules itself. those datasets are really small and would not affect the system in any manner. 6 out of 12 stuck in assigned workflows are already moving. But there are another 5 which are impossible to rescue because of their data being deleted from Rusio. Here is the Error [1] and here is the datasets' rules history [2]. So I am aborting/rejecting those 5 workflows now.

[1]

2022-07-05 01:51:21,362:ERROR:PycurlRucio: Failure in getBlocksAndSizeRucio function for container /RelValQCD_Pt_600_800_14/CMSSW_11_2_0_pre8-112X_mcRun3_2024_realistic_v10_forTrk-v1/GEN-SIM
. Response: {'url': 'http://cms-rucio.cern.ch/dids/cms/dids/search?type=dataset&long=True&name=/RelValQCD_Pt_600_800_14/CMSSW_11_2_0_pre8-112X_mcRun3_2024_realistic_v10_forTrk-v1/GEN-SIM%23%
2A', 'data': '', 'headers': 'HTTP/1.1 200 OK\r\nDate: Mon, 04 Jul 2022 23:51:21 GMT\r\nContent-Type: application/x-json-stream\r\nContent-Length: 0\r\nConnection: keep-alive\r\nAccess-Contro
l-Allow-Origin: None\r\nAccess-Control-Allow-Headers: None\r\nAccess-Control-Allow-Methods: *\r\nAccess-Control-Allow-Credentials: true\r\nCache-Control: post-check=0, pre-check=0\r\nPragma:
 no-cache\r\nX-Rucio-Host: cms-rucio.cern.ch\r\n\r\n'}
2022-07-05 01:51:21,363:ERROR:PycurlRucio: Failure in getBlocksAndSizeRucio function for container /RelValTTbar_14TeV/CMSSW_11_2_0_pre8-112X_mcRun3_2024_realistic_v10_forTrk-v1/GEN-SIM. Resp
onse: {'url': 'http://cms-rucio.cern.ch/dids/cms/dids/search?type=dataset&long=True&name=/RelValTTbar_14TeV/CMSSW_11_2_0_pre8-112X_mcRun3_2024_realistic_v10_forTrk-v1/GEN-SIM%23%2A', 'data':
 '', 'headers': 'HTTP/1.1 200 OK\r\nDate: Mon, 04 Jul 2022 23:51:21 GMT\r\nContent-Type: application/x-json-stream\r\nContent-Length: 0\r\nConnection: keep-alive\r\nAccess-Control-Allow-Orig
in: None\r\nAccess-Control-Allow-Headers: None\r\nAccess-Control-Allow-Methods: *\r\nAccess-Control-Allow-Credentials: true\r\nCache-Control: post-check=0, pre-check=0\r\nPragma: no-cache\r\
nX-Rucio-Host: cms-rucio.cern.ch\r\n\r\n'}
2022-07-05 01:51:21,369:ERROR:PycurlRucio: Failure in getBlocksAndSizeRucio function for container /NoBPTX/Run2016F-23Sep2016-v1/DQMIO. Response: {'url': 'http://cms-rucio.cern.ch/dids/cms/d
ids/search?type=dataset&long=True&name=/NoBPTX/Run2016F-23Sep2016-v1/DQMIO%23%2A', 'data': '', 'headers': 'HTTP/1.1 200 OK\r\nDate: Mon, 04 Jul 2022 23:51:21 GMT\r\nContent-Type: application
/x-json-stream\r\nContent-Length: 0\r\nConnection: keep-alive\r\nAccess-Control-Allow-Origin: None\r\nAccess-Control-Allow-Headers: None\r\nAccess-Control-Allow-Methods: *\r\nAccess-Control-
Allow-Credentials: true\r\nCache-Control: post-check=0, pre-check=0\r\nPragma: no-cache\r\nX-Rucio-Host: cms-rucio.cern.ch\r\n\r\n'}
2022-07-05 01:51:21,449:WARNING:RequestInfo: Removing workflow that failed processing in MSTransferor: tivanov_DQMHarvesting_HG2207_Val_220630_171451_6004
2022-07-05 01:51:21,449:WARNING:RequestInfo: Removing workflow that failed processing in MSTransferor: tivanov_TaskChain_PUMCRecyc_HG2207_Val_220630_171515_4708
2022-07-05 01:51:21,449:WARNING:RequestInfo: Removing workflow that failed processing in MSTransferor: tivanov_DQMHarvesting_MultiRun_HG2207_Val_220630_171453_7939
2022-07-05 01:51:21,449:WARNING:RequestInfo: Removing workflow that failed processing in MSTransferor: tivanov_DQMHarvesting_LumiMask_HG2207_Val_220630_171452_6718
2022-07-05 01:51:21,449:WARNING:RequestInfo: Removing workflow that failed processing in MSTransferor: tivanov_SC_ReDigi_Harvest_Prod_HG2207_Val_220630_171503_9802

[2]

In [4]: list(rcl.list_content_history('cms', '/RelValTTbar_14TeV/CMSSW_11_2_0_pre8-112X_mcRun3_2024_realistic_v10_forTrk-v1/GEN-SIM'))
Out[4]: 
[{'scope': 'cms',
  'name': '/RelValTTbar_14TeV/CMSSW_11_2_0_pre8-112X_mcRun3_2024_realistic_v10_forTrk-v1/GEN-SIM#2ad9217c-002d-4725-809b-b54458f2a88f',
  'type': 'DATASET',
  'bytes': None,
  'adler32': None,
  'md5': None,
  'deleted_at': datetime.datetime(2022, 6, 15, 17, 41, 55),
  'created_at': datetime.datetime(2020, 11, 2, 15, 8, 45),
  'updated_at': datetime.datetime(2020, 11, 2, 15, 8, 53)}]
...
In [5]: list(rcl.list_content_history('cms', '/RelValQCD_Pt_600_800_14/CMSSW_11_2_0_pre8-112X_mcRun3_2024_realistic_v10_forTrk-v1/GEN-SIM'))
Out[5]: 
[{'scope': 'cms',
  'name': '/RelValQCD_Pt_600_800_14/CMSSW_11_2_0_pre8-112X_mcRun3_2024_realistic_v10_forTrk-v1/GEN-SIM#e08c7263-fd35-49e0-b1e8-aff38192e55e',
  'type': 'DATASET',
  'bytes': None,
  'adler32': None,
  'md5': None,
  'deleted_at': datetime.datetime(2022, 6, 15, 17, 41, 55),
  'created_at': datetime.datetime(2020, 11, 2, 17, 14, 6),
  'updated_at': datetime.datetime(2020, 11, 2, 17, 14, 23)}]
...
In [8]: list(rcl.list_content_history('cms', '/NoBPTX/Run2016F-23Sep2016-v1/DQMIO'))
Out[8]: 
[{'scope': 'cms',
  'name': '/NoBPTX/Run2016F-23Sep2016-v1/DQMIO#c386453a-c100-11e6-80ad-001e67abf094',
  'type': 'DATASET',
  'bytes': None,
  'adler32': None,
  'md5': None,
  'deleted_at': datetime.datetime(2022, 6, 29, 15, 16, 49),
  'created_at': datetime.datetime(2020, 9, 19, 6, 55, 1),
  'updated_at': datetime.datetime(2020, 9, 19, 7, 0, 39)},
 {'scope': 'cms',
  'name': '/NoBPTX/Run2016F-23Sep2016-v1/DQMIO#c3d28f34-c106-11e6-9206-001e67abefa8',
  'type': 'DATASET',
  'bytes': None,
  'adler32': None,
  'md5': None,
  'deleted_at': datetime.datetime(2022, 6, 28, 16, 59, 1),
  'created_at': datetime.datetime(2020, 9, 19, 6, 59, 48),
  'updated_at': datetime.datetime(2020, 9, 19, 7, 0, 39)}]
todor-ivanov commented 2 years ago

We have yet another workflow tivanov_DQMHarvest_RunWhitelist_HG2207_Val_220630_171416_4188 doomed too failure because of corrupted Rucio data :

2022-07-05 16:31:03,300:140554330814208:ERROR:WorkQueue:tivanov_DQMHarvest_RunWhitelist_HG2207_Val_220630_171416_4188, ['/ZeroBias/Run2016B-UL16_ver2_forHarvestOnly-v1/DQMIO']: 
failed to retrieve data from DBS/Rucio in LQ: 
<@========== WMException Start ==========@>
Exception Class: WMRucioDIDNotFoundException
Message: Data identifier not found in Rucio: /ZeroBias/Run2016B-UL16_ver2_forHarvestOnly-v1/DQMIO#bf288630-b881-4b66-ba8d-ed592f846d01. Error: Data identifier not found.
Details: Data identifier 'cms:/ZeroBias/Run2016B-UL16_ver2_forHarvestOnly-v1/DQMIO#bf288630-b881-4b66-ba8d-ed592f846d01' not found
        ClassName : None
        ModuleName : WMCore.Services.Rucio.Rucio
        MethodName : isContainer
        ClassInstance : None
        FileName : /data/srv/wmagent/v2.0.4.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.4.patch1/lib/python3.8/site-packages/WMCore/Services/Rucio/Rucio.py
        LineNumber : 831
        ErrorNr : 0

Traceback: 
  File "/data/srv/wmagent/v2.0.4.patch1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.0.4.patch1/lib/python3.8/site-packages/WMCore/Services/Rucio/Rucio.py", line 828, in isContainer
    response = self.cli.get_did(scope=scope, name=didName)

  File "/data/srv/wmagent/v2.0.4.patch1/sw/slc7_amd64_gcc630/external/py3-rucio-clients/1.25.5-comp2/lib/python3.8/site-packages/rucio/client/didclient.py", line 428, in get_did
    raise exc_cls(exc_msg)

<@---------- WMException End ----------@>

There are blocks deleted from Rucio for this ZeroBias dataset. Here is what DBS tells us about the blocks in it:

$ dasgoclient --query="block dataset=/ZeroBias/Run2016B-UL16_ver2_forHarvestOnly-v1/DQMIO"
/ZeroBias/Run2016B-UL16_ver2_forHarvestOnly-v1/DQMIO#0d25a8ce-ea25-422c-bf86-a37174e7f6a3
/ZeroBias/Run2016B-UL16_ver2_forHarvestOnly-v1/DQMIO#2152dc25-2b31-4cee-b254-fbd6063cdba2
/ZeroBias/Run2016B-UL16_ver2_forHarvestOnly-v1/DQMIO#651cff8a-d0cc-4717-8d6e-1d949df8e4b2
/ZeroBias/Run2016B-UL16_ver2_forHarvestOnly-v1/DQMIO#8c936f17-811c-4c7b-9cdf-c6fc6351944c
/ZeroBias/Run2016B-UL16_ver2_forHarvestOnly-v1/DQMIO#bf288630-b881-4b66-ba8d-ed592f846d01
/ZeroBias/Run2016B-UL16_ver2_forHarvestOnly-v1/DQMIO#d60cb49e-4320-47ef-8860-4706a532f185

And here is what Rucio knows about its contents:

In [8]: list(rcl.list_content('cms', '/ZeroBias/Run2016B-UL16_ver2_forHarvestOnly-v1/DQMIO'))
Out[8]: 
[{'scope': 'cms',
  'name': '/ZeroBias/Run2016B-UL16_ver2_forHarvestOnly-v1/DQMIO#0d25a8ce-ea25-422c-bf86-a37174e7f6a3',
  'type': 'DATASET',
  'bytes': None,
  'adler32': None,
  'md5': None},
 {'scope': 'cms',
  'name': '/ZeroBias/Run2016B-UL16_ver2_forHarvestOnly-v1/DQMIO#2152dc25-2b31-4cee-b254-fbd6063cdba2',
  'type': 'DATASET',
  'bytes': None,
  'adler32': None,
  'md5': None},
 {'scope': 'cms',
  'name': '/ZeroBias/Run2016B-UL16_ver2_forHarvestOnly-v1/DQMIO#8c936f17-811c-4c7b-9cdf-c6fc6351944c',
  'type': 'DATASET',
  'bytes': None,
  'adler32': None,
  'md5': None}]

And the block that is failing this workflow was indeed deleted few days ago:

In [7]: list(rcl.list_content_history('cms', '/ZeroBias/Run2016B-UL16_ver2_forHarvestOnly-v1/DQMIO#bf288630-b881-4b66-ba8d-ed592f846d01'))
Out[7]: 
[{'scope': 'cms',
  'name': '/store/data/Run2016B/ZeroBias/DQMIO/UL16_ver2_forHarvestOnly-v1/240000/0DA9F16D-D19B-EC4C-967C-3C2D57F2A801.root',
  'type': 'FILE',
  'bytes': 25973862,
  'adler32': '442588bd',
  'md5': None,
  'deleted_at': datetime.datetime(2022, 7, 3, 2, 51, 39),
  'created_at': datetime.datetime(2020, 9, 27, 21, 47, 11),
  'updated_at': datetime.datetime(2020, 9, 27, 22, 9, 41)}]

I have no choice but aborting this one as well.

FYI @amaltaro

This validation, due to the current system load and the lack of space at both CERN and FNAL, is progressing really slow.

todor-ivanov commented 2 years ago

Yet another problematic workflow tivanov_ReReco_Parents_HG2207_Val_220630_171456_1007. This one is sitting in running-open because of GWQ elements not required by the agent. In GlobalWorkQueue 13 work elements have been created for this workflow [1], but the agent initially acquired only 8 of them and never retried the rest [2].

@amaltaro I may need your input on this one...

[1] https://cmsweb-testbed.cern.ch/couchdb/workqueue/_design/WorkQueue/_rewrite/elementsInfo?request=tivanov_ReReco_Parents_HG2207_Val_220630_171456_1007

| 1f27f08efc141dd724469222ea61f68e | tivanov_ReReco_Parents_HG2207_Val_220630_171456_1007 |   | /NoBPTX/Run2018D-12Nov2019_UL2018-v1/MINIAOD#69950825-2c8e-4b68-9f8e-435dc665dee8 | Available |   | 600000 | 1 | testbed-vocms0192 | 0% | 0% | 07/05/22 16:54:27 | 07/05/22 16:54:27 | T1_US_FNAL
| 2571cc5f71177add78db61c120157212 | tivanov_ReReco_Parents_HG2207_Val_220630_171456_1007 |   | /NoBPTX/Run2018D-12Nov2019_UL2018-v1/MINIAOD#a953ffe8-de4a-49f5-937d-a737ffa277c0 | Available |   | 600000 | 1 | testbed-vocms0192 | 0% | 0% | 07/05/22 16:54:25 | 07/05/22 16:54:25 | T1_US_FNAL
| 934f69aac3328faedbd711ee2ee187d6 | tivanov_ReReco_Parents_HG2207_Val_220630_171456_1007 |   | /NoBPTX/Run2018D-12Nov2019_UL2018-v1/MINIAOD#1486aad7-f45d-4423-aa7a-ab2ca3d939a9 | Available |   | 600000 | 1 | testbed-vocms0192 | 0% | 0% | 07/05/22 16:54:25 | 07/05/22 16:54:25 | T1_US_FNAL
| acdc516ceb1c073204993c6aff24535e | tivanov_ReReco_Parents_HG2207_Val_220630_171456_1007 |   | /NoBPTX/Run2018D-12Nov2019_UL2018-v1/MINIAOD#0798f47e-ba33-4975-b955-70a436faa590 | Available |   | 600000 | 1 | testbed-vocms0192 | 0% | 0% | 07/05/22 16:54:25 | 07/05/22 16:54:25 | T1_US_FNAL
| ef58e4de806a5a0d2d129f601b00f11d | tivanov_ReReco_Parents_HG2207_Val_220630_171456_1007 |   | /NoBPTX/Run2018D-12Nov2019_UL2018-v1/MINIAOD#8c45da82-821d-49e9-9e3c-0263df3e098a | Available |   | 600000 | 1 | testbed-vocms0192 | 0% | 0% | 07/05/22 16:54:25 | 07/05/22 16:54:25 | T1_US_FNAL
| 0f26cf99b6bcf12ec82f9579964991ee | tivanov_ReReco_Parents_HG2207_Val_220630_171456_1007 |   | /NoBPTX/Run2018D-12Nov2019_UL2018-v1/MINIAOD#1b826dbc-e6ea-4939-bbf2-ca3be5fede67 | Running | vocms0192.cern.ch | 600000 | 1 | testbed-vocms0192 | 100% | 100% | 07/05/22 16:54:25 | 07/05/22 17:17:20 | T1_US_FNAL
| 5c0e32d3e32e8d5917f74cc32478e03c | tivanov_ReReco_Parents_HG2207_Val_220630_171456_1007 |   | /NoBPTX/Run2018D-12Nov2019_UL2018-v1/MINIAOD#d2b6a894-7083-4a12-a1e3-324121aa60c3 | Running | vocms0192.cern.ch | 600000 | 1 | testbed-vocms0192 | 100% | 100% | 07/05/22 16:54:25 | 07/05/22 17:20:29 | T2_CH_CERN,T1_US_FNAL,T2_CH_CERN_HLT
| 819a7800e88e5bf25b15f21a96fd509f | tivanov_ReReco_Parents_HG2207_Val_220630_171456_1007 |   | /NoBPTX/Run2018D-12Nov2019_UL2018-v1/MINIAOD#914e4c5a-dda0-4479-bb24-a8203952ddb0 | Running | vocms0192.cern.ch | 600000 | 1 | testbed-vocms0192 | 100% | 100% | 07/05/22 16:54:25 | 07/05/22 17:20:29 | T2_CH_CERN,T1_US_FNAL,T2_CH_CERN_HLT
| a6e06533b5a77f0de9ad5d97277e79c2 | tivanov_ReReco_Parents_HG2207_Val_220630_171456_1007 |   | /NoBPTX/Run2018D-12Nov2019_UL2018-v1/MINIAOD#be382968-82b7-4e65-b73f-57aa0b83d0e4 | Running | vocms0192.cern.ch | 600000 | 1 | testbed-vocms0192 | 100% | 100% | 07/05/22 16:54:24 | 07/05/22 17:26:46 | T2_CH_CERN,T1_US_FNAL,T2_CH_CERN_HLT
| d0b48b795547f717d819389e6d1d6057 | tivanov_ReReco_Parents_HG2207_Val_220630_171456_1007 |   | /NoBPTX/Run2018D-12Nov2019_UL2018-v1/MINIAOD#85ca4ca0-c0ad-4a96-838a-5f3e9207579c | Running | vocms0192.cern.ch | 600000 | 1 | testbed-vocms0192 | 100% | 100% | 07/05/22 16:54:25 | 07/05/22 17:20:29 | T2_CH_CERN,T1_US_FNAL,T2_CH_CERN_HLT
| d45c845992a7fb43a1946ffdacf44092 | tivanov_ReReco_Parents_HG2207_Val_220630_171456_1007 |   | /NoBPTX/Run2018D-12Nov2019_UL2018-v1/MINIAOD#fb0e7e14-05ed-4f28-b89c-51fa14b76ebf | Running | vocms0192.cern.ch | 600000 | 1 | testbed-vocms0192 | 100% | 100% | 07/05/22 16:54:28 | 07/05/22 17:26:46 | T2_CH_CERN,T1_US_FNAL,T2_CH_CERN_HLT
| e13e10178dc8946ea73356197cac0c49 | tivanov_ReReco_Parents_HG2207_Val_220630_171456_1007 |   | /NoBPTX/Run2018D-12Nov2019_UL2018-v1/MINIAOD#9d954f95-3be7-4fff-81eb-d0567aceac1c | Running | vocms0192.cern.ch | 600000 | 1 | testbed-vocms0192 | 100% | 100% | 07/05/22 16:54:24 | 07/05/22 17:26:46 | T2_CH_CERN,T1_US_FNAL,T2_CH_CERN_HLT
| f0cecdb91c4348ed6a9a268f659df13a | tivanov_ReReco_Parents_HG2207_Val_220630_171456_1007 |   | /NoBPTX/Run2018D-12Nov2019_UL2018-v1/MINIAOD#a9c70800-f936-4c6c-a3a8-c6fa4c9421d0 | Running | vocms0192.cern.ch | 600000 | 1 | testbed-vocms0192 | 100% | 100% | 07/05/22 16:54:25 | 07/05/22 17:36:11 | T1_US_FNAL

[2]

cmst1@vocms0192:/data/srv/wmagent/current $ grep tivanov_ReReco_Parents_HG2207_Val_220630_171456_1007 install/wmagentpy3/WorkQueueManager/ComponentLog

...
2022-07-05 15:55:46,615:140554339206912:INFO:WorkQueueBackend:Accepting workflow: tivanov_ReReco_Parents_HG2207_Val_220630_171456_1007, with prio: 600000, element id: a6e06533b5a77f0de9ad5d97277e79c2, for site: T1_US_FNAL
2022-07-05 15:55:46,615:140554339206912:INFO:WorkQueueBackend:Accepting workflow: tivanov_ReReco_Parents_HG2207_Val_220630_171456_1007, with prio: 600000, element id: e13e10178dc8946ea73356197cac0c49, for site: T1_US_FNAL
2022-07-05 15:55:46,615:140554339206912:INFO:WorkQueueBackend:Accepting workflow: tivanov_ReReco_Parents_HG2207_Val_220630_171456_1007, with prio: 600000, element id: 0f26cf99b6bcf12ec82f9579964991ee, for site: T1_US_FNAL
2022-07-05 15:55:46,615:140554339206912:INFO:WorkQueueBackend:Accepting workflow: tivanov_ReReco_Parents_HG2207_Val_220630_171456_1007, with prio: 600000, element id: 819a7800e88e5bf25b15f21a96fd509f, for site: T1_US_FNAL
2022-07-05 15:55:46,615:140554339206912:INFO:WorkQueueBackend:Accepting workflow: tivanov_ReReco_Parents_HG2207_Val_220630_171456_1007, with prio: 600000, element id: f0cecdb91c4348ed6a9a268f659df13a, for site: T1_US_FNAL
2022-07-05 15:55:46,615:140554339206912:INFO:WorkQueueBackend:Accepting workflow: tivanov_ReReco_Parents_HG2207_Val_220630_171456_1007, with prio: 600000, element id: d0b48b795547f717d819389e6d1d6057, for site: T1_US_FNAL
2022-07-05 15:55:46,615:140554339206912:INFO:WorkQueueBackend:Accepting workflow: tivanov_ReReco_Parents_HG2207_Val_220630_171456_1007, with prio: 600000, element id: 5c0e32d3e32e8d5917f74cc32478e03c, for site: T1_US_FNAL
2022-07-05 15:55:46,615:140554339206912:INFO:WorkQueueBackend:Accepting workflow: tivanov_ReReco_Parents_HG2207_Val_220630_171456_1007, with prio: 600000, element id: d45c845992a7fb43a1946ffdacf44092, for site: T2_CH_CERN
2022-07-05 15:55:46,725:140554339206912:INFO:WorkQueue:    8 elements for: tivanov_ReReco_Parents_HG2207_Val_220630_171456_1007
...
todor-ivanov commented 2 years ago

I did my best to validate this release, but given all the problems that we had during this validation, like the combination of an overloaded system and the many templates with corrupted or missing input data from Rucio, some checks could not be perfeormed. Here is the final result:

https://twiki.cern.ch/twiki/bin/view/CMS/WMAgentEndtoEndValidationHG2207