dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Run WMCore central services validation for HG2201 #10947

Closed amaltaro closed 2 years ago

amaltaro commented 2 years ago

Impact of the new feature WMCore central services

Is your feature request related to a problem? Please describe. Monthly task

Describe the solution you'd like Validate central services in cmsweb-testbed (well, it might have to be in one of our VMs due to the current Rucio setup) and provide the final feedback by the January deadline specified by the CMSWEB team.

It also includes the creation of the service release notes and the validation check-list twiki.

Describe alternatives you've considered none

Additional context none

todor-ivanov commented 2 years ago

This validation is again not a trivial one, since we need to take into account a migration of an external service which is happening in parallel: DBSReader migration to Go based version.

Logging the process/steps for the current validation here: 1. First round of validation injections was with the combination :

The result was a series of errors while trying to iterate through the aggregated results returned by some of the APIs: [1]. In result the local work queue fails to complete negotiation process for 3 work queue elements fetched from the Global Queue and it gets stuck [2]. This is an error which is addressed in the new dbs-client. For the purpose we need to upgrade wmagents first with a version which is having the new dbs-client and only then we will have to migrate the DBSReader from the Python based version to the Go based one. For that to happen we need to test the reverse combination: new dbs-client + old DBSReader(Python based), because this will be the setup we will end up in production environment for a while.

Currently upon an explicit request to CMSWeb team we have the two DBSReader versions deployed in testbed pointing to the same database in testbed and reachable under two different urls which are correctly routed to the proper backed through the FE redirection rules:

https://cmsweb-testbed.cern.ch/dbs/int/global/DBSReaderPython -> pointing to the Python version https://cmsweb-testbed.cern.ch/dbs/int/global/DBSReader -> pointing to the Go version

(a configuration change is required at the agent in order to have it pointed to the correct one)

2. Second set of validation injections:

3. Third set of validation injections:

[1]

2022-01-11 21:04:53,976:140149993768704:INFO:WorkQueue:Splitting /tivanov_ReReco_RunBlockWhite_HG2201_Val_220111_171856_8340/DataProcessing with policy Block params = {'DatasetBlock': {'name': 'Block', 'args': {}}, 'MonteCarlo': {'name': 'MonteCarlo', 'args': {}}, 'Dataset': {'name': 'Dataset', 'args': {}}, 'Block': {'name': 'Block', 'args': {}}, 'ResubmitBlock': {'name': 'ResubmitBlock', 'args': {}}}
2022-01-11 21:04:54,044:140149993768704:ERROR:WorkQueue:Exception splitting wqe 2561e9a9df281b67cc6afe38e46dd226 for tivanov_ReReco_RunBlockWhite_HG2201_Val_220111_171856_8340: 'int' object is not iterable
Traceback (most recent call last):
  File "/data/srv/wmagent/v1.5.4.patch4/sw/slc7_amd64_gcc630/cms/wmagentpy3/1.5.4.patch4/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 1164, in processInboundWork
    work, rejectedWork, badWork = self._splitWork(inbound['WMSpec'], data=inbound['Inputs'],
  File "/data/srv/wmagent/v1.5.4.patch4/sw/slc7_amd64_gcc630/cms/wmagentpy3/1.5.4.patch4/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 1108, in _splitWork
    units, rejectedWork, badWork = policy(spec, topLevelTask, data, mask, continuous=continuous)
  File "/data/srv/wmagent/v1.5.4.patch4/sw/slc7_amd64_gcc630/cms/wmagentpy3/1.5.4.patch4/lib/python3.8/site-packages/WMCore/WorkQueue/Policy/Start/StartPolicyInterface.py", line 160, in __call__
    self.split()
  File "/data/srv/wmagent/v1.5.4.patch4/sw/slc7_amd64_gcc630/cms/wmagentpy3/1.5.4.patch4/lib/python3.8/site-packages/WMCore/WorkQueue/Policy/Start/Block.py", line 35, in split
    for block in self.validBlocks(self.initialTask, dbs):
  File "/data/srv/wmagent/v1.5.4.patch4/sw/slc7_amd64_gcc630/cms/wmagentpy3/1.5.4.patch4/lib/python3.8/site-packages/WMCore/WorkQueue/Policy/Start/Block.py", line 138, in validBlocks
    runLumis = dbs.listRunLumis(block=block['block'])
  File "/data/srv/wmagent/v1.5.4.patch4/sw/slc7_amd64_gcc630/cms/wmagentpy3/1.5.4.patch4/lib/python3.8/site-packages/WMCore/Services/DBS/DBS3Reader.py", line 241, in listRunLumis
    for runNumber in x["run_num"]:
TypeError: 'int' object is not iterable

[2]

2022-01-12 15:38:52,858:140149993768704:WARNING:WorkQueue:Not pulling more work. Still replicating 3 previous units, ids:
['08dea44bf003d468fdd520df1e6ec09d', '2561e9a9df281b67cc6afe38e46dd226', '6956de749df41c01f69e89a7f38f147e']
todor-ivanov commented 2 years ago

So far the submission number 2. is ongoing with the usual delays regarding data location complications, because I had to also point the agent's workqueueManager to Rucio production , wich as usual happened with some delay.

I am not going to wait for all the workflows to get completed, but instead once all of them get into running-closed (meaning no more DBSReader references are expected) I will (change the agent configuration yet again to point to DBSReaderPython and will inject the 3. d portion.

todor-ivanov commented 2 years ago

Injection number 3. is done now.

todor-ivanov commented 2 years ago

The validation was done and the deployment was successful this Tuesday.