Closed amaltaro closed 2 years ago
This validation is again not a trivial one, since we need to take into account a migration of an external service which is happening in parallel: DBSReader migration to Go based version.
Logging the process/steps for the current validation here: 1. First round of validation injections was with the combination :
The result was a series of errors while trying to iterate through the aggregated results returned by some of the APIs: [1]. In result the local work queue fails to complete negotiation process for 3 work queue elements fetched from the Global Queue and it gets stuck [2]. This is an error which is addressed in the new dbs-client. For the purpose we need to upgrade wmagents first with a version which is having the new dbs-client and only then we will have to migrate the DBSReader from the Python based version to the Go based one. For that to happen we need to test the reverse combination: new dbs-client + old DBSReader(Python based), because this will be the setup we will end up in production environment for a while.
Currently upon an explicit request to CMSWeb team we have the two DBSReader versions deployed in testbed pointing to the same database in testbed and reachable under two different urls which are correctly routed to the proper backed through the FE redirection rules:
https://cmsweb-testbed.cern.ch/dbs/int/global/DBSReaderPython -> pointing to the Python version https://cmsweb-testbed.cern.ch/dbs/int/global/DBSReader -> pointing to the Go version
(a configuration change is required at the agent in order to have it pointed to the correct one)
2. Second set of validation injections:
3. Third set of validation injections:
[1]
2022-01-11 21:04:53,976:140149993768704:INFO:WorkQueue:Splitting /tivanov_ReReco_RunBlockWhite_HG2201_Val_220111_171856_8340/DataProcessing with policy Block params = {'DatasetBlock': {'name': 'Block', 'args': {}}, 'MonteCarlo': {'name': 'MonteCarlo', 'args': {}}, 'Dataset': {'name': 'Dataset', 'args': {}}, 'Block': {'name': 'Block', 'args': {}}, 'ResubmitBlock': {'name': 'ResubmitBlock', 'args': {}}}
2022-01-11 21:04:54,044:140149993768704:ERROR:WorkQueue:Exception splitting wqe 2561e9a9df281b67cc6afe38e46dd226 for tivanov_ReReco_RunBlockWhite_HG2201_Val_220111_171856_8340: 'int' object is not iterable
Traceback (most recent call last):
File "/data/srv/wmagent/v1.5.4.patch4/sw/slc7_amd64_gcc630/cms/wmagentpy3/1.5.4.patch4/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 1164, in processInboundWork
work, rejectedWork, badWork = self._splitWork(inbound['WMSpec'], data=inbound['Inputs'],
File "/data/srv/wmagent/v1.5.4.patch4/sw/slc7_amd64_gcc630/cms/wmagentpy3/1.5.4.patch4/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 1108, in _splitWork
units, rejectedWork, badWork = policy(spec, topLevelTask, data, mask, continuous=continuous)
File "/data/srv/wmagent/v1.5.4.patch4/sw/slc7_amd64_gcc630/cms/wmagentpy3/1.5.4.patch4/lib/python3.8/site-packages/WMCore/WorkQueue/Policy/Start/StartPolicyInterface.py", line 160, in __call__
self.split()
File "/data/srv/wmagent/v1.5.4.patch4/sw/slc7_amd64_gcc630/cms/wmagentpy3/1.5.4.patch4/lib/python3.8/site-packages/WMCore/WorkQueue/Policy/Start/Block.py", line 35, in split
for block in self.validBlocks(self.initialTask, dbs):
File "/data/srv/wmagent/v1.5.4.patch4/sw/slc7_amd64_gcc630/cms/wmagentpy3/1.5.4.patch4/lib/python3.8/site-packages/WMCore/WorkQueue/Policy/Start/Block.py", line 138, in validBlocks
runLumis = dbs.listRunLumis(block=block['block'])
File "/data/srv/wmagent/v1.5.4.patch4/sw/slc7_amd64_gcc630/cms/wmagentpy3/1.5.4.patch4/lib/python3.8/site-packages/WMCore/Services/DBS/DBS3Reader.py", line 241, in listRunLumis
for runNumber in x["run_num"]:
TypeError: 'int' object is not iterable
[2]
2022-01-12 15:38:52,858:140149993768704:WARNING:WorkQueue:Not pulling more work. Still replicating 3 previous units, ids:
['08dea44bf003d468fdd520df1e6ec09d', '2561e9a9df281b67cc6afe38e46dd226', '6956de749df41c01f69e89a7f38f147e']
So far the submission number 2. is ongoing with the usual delays regarding data location complications, because I had to also point the agent's workqueueManager to Rucio production , wich as usual happened with some delay.
I am not going to wait for all the workflows to get completed, but instead once all of them get into running-closed
(meaning no more DBSReader references are expected) I will (change the agent configuration yet again to point to DBSReaderPython
and will inject the 3. d portion.
Injection number 3. is done now.
The validation was done and the deployment was successful this Tuesday.
Impact of the new feature WMCore central services
Is your feature request related to a problem? Please describe. Monthly task
Describe the solution you'd like Validate central services in cmsweb-testbed (well, it might have to be in one of our VMs due to the current Rucio setup) and provide the final feedback by the January deadline specified by the CMSWEB team.
It also includes the creation of the service release notes and the validation check-list twiki.
Describe alternatives you've considered none
Additional context none