Closed todor-ivanov closed 1 year ago
@todor-ivanov Todor, we have WMCore 2.2.0rc7
deployed in testbed and I would suggest to start those tests today such that we can target a production deployment for the next Tuesday, April 11.
I wouldn't be surprised if some workflows don't go past the assigned
status and I think we will have to inject the pileup configurations into testbed. A dump of the test/testbed templates was provided in this PR and it can be used to populate testbed MSPileup:
https://github.com/dmwm/WMCore/pull/11521
Please let me know if you need help with any of this.
With this release we have two major services deployment activities;
all the databases have just been copied as mentioned in https://github.com/dmwm/WMCore/issues/11534#issuecomment-1499505070 and all testbed service redeployed. I am about to inject the validation workflows now FYI: @amaltaro
well due to the badly deployed secrets this deployment just failed. I'll repeat it tomorrow morning
Indeed there was nothing running under the dmwm
namespace. I went ahead and redeployed WMCore secrets and services with tag 2.2.0rc7
.
Unfortunately, after a few minutes, I see that all services using MongoDB are still failing to run. Error message is [1] @todor-ivanov could it be that you forgot to update the new mongodb credentials in services_config repository?
For the record, I've been using the following documentation for that (already advertised in the past weeks/months): https://github.com/dmwm/WMCore/wiki/WMCore-services-upgrade-in-kubernetes
[1]
2023-04-06 23:07:07,295:ERROR:MongoDB: Could not connect to MongoDB server: ['cms-mongo-preprod-node-0.cern.ch:32001', 'cms-mongo-preprod-node-1.cern.ch:32002', 'cms-mongo-preprod-node-2.cern.ch:32003']. Due to unknown reason: Authentication failed., full error: {'ok': 0.0, 'errmsg': 'Authentication failed.', 'code': 18, 'codeName': 'AuthenticationFailed', '$clusterTime': {'clusterTime': Timestamp(1680822426, 1), 'signature': {'hash': b'\xb7\xc8/\xbb\xf4\xc3\xfb\x817\xef\x8f\x8cC\x82\xb9:\xb8 \xb0\x1e', 'keyId': 7218993718450192385}}, 'operationTime': Timestamp(1680822426, 1)}
Hi @amaltaro Thanks for acting on that. It was failing quite badly for me - all the services were failing due to missing config files. I have to figure out what is going on with my setup environment.
As of the changed secrets for the services using MongoDB indeed it is correct. We do have new passwords but AFAIK we are not supposed to propagate the encrypted secrets files. We should request someone from CMSWeb Team to do so. I am not sure if I even have the correct credentials for doing it. I've never done this before, nor I've ever seen a full set of instructions on how to do so.
@arooshap can you help us in this?
@todor-ivanov I am making the changes now.
So let me put some more details on how exactly those deployments are failing for me (regardless of where I try, being it preprod
or cmsweb-test
clusters): [1]. So this simply tells me it cannot import anything from the secrets files. And indeed going and listing the so mounted volume at /etc/sectrets/
gives me a zero sized file for ReqMgr2Secrets.py
_reqmgr2@reqmgr2-c6ccbd585-g59nb:/data$ ls -la /etc/secrets/..data/
total 8
drwxr-sr-x. 2 root 2000 80 Apr 7 12:37 .
drwxrwsrwt. 3 root 2000 120 Apr 7 12:37 ..
-rw-r--r--. 1 root 2000 0 Apr 7 12:37 ReqMgr2Secrets.py
-rw-r--r--. 1 root 2000 6803 Apr 7 12:37 config.py
I did follow all the steps from https://github.com/dmwm/WMCore/wiki/WMCore-services-upgrade-in-kubernetes#deploying-wmcore-secrets veery carefully , especially the one with deploying secrets.
@arooshap any idea on what may cause all this?
[1]
_reqmgr2@reqmgr2-c6ccbd585-g59nb:/data$ /bin/bash /data/run.sh
ln: failed to create symbolic link '/data/srv/current/apps/reqmgr2/data': File exists
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/WMCore/Configuration.py", line 603, in loadConfigurationFile
modRef = imp.load_module(cfgBaseName, modPath[0],
File "/usr/local/lib/python3.8/imp.py", line 234, in load_module
return load_source(name, filename, file)
File "/usr/local/lib/python3.8/imp.py", line 171, in load_source
module = _load(spec)
File "<frozen importlib._bootstrap>", line 702, in _load
File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 843, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/data/srv/current/config/reqmgr2/config.py", line 36, in <module>
from ReqMgr2Secrets import USER_AMQ, PASS_AMQ, AMQ_TOPIC
ImportError: cannot import name 'USER_AMQ' from 'ReqMgr2Secrets' (/etc/secrets/ReqMgr2Secrets.py)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/wmc-httpd", line 3, in <module>
main()
File "/usr/local/lib/python3.8/site-packages/WMCore/REST/Main.py", line 558, in main
cfg = loadConfigurationFile(args[0])
File "/usr/local/lib/python3.8/site-packages/WMCore/Configuration.py", line 611, in loadConfigurationFile
raise RuntimeError(msg)
RuntimeError: Unable to load Configuration File:
/data/srv/current/config/reqmgr2/config.py
Due to error:
cannot import name 'USER_AMQ' from 'ReqMgr2Secrets' (/etc/secrets/ReqMgr2Secrets.py)Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/WMCore/Configuration.py", line 603, in loadConfigurationFile
modRef = imp.load_module(cfgBaseName, modPath[0],
File "/usr/local/lib/python3.8/imp.py", line 234, in load_module
return load_source(name, filename, file)
File "/usr/local/lib/python3.8/imp.py", line 171, in load_source
module = _load(spec)
File "<frozen importlib._bootstrap>", line 702, in _load
File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 843, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/data/srv/current/config/reqmgr2/config.py", line 36, in <module>
from ReqMgr2Secrets import USER_AMQ, PASS_AMQ, AMQ_TOPIC
ImportError: cannot import name 'USER_AMQ' from 'ReqMgr2Secrets' (/etc/secrets/ReqMgr2Secrets.py)
And here is the exact error from the `cmsweb/scripts/deploy-secrets.sh file;
./scripts/deploy-secrets.sh: line 128: /afs/cern.ch/user/<username>/bin/sops: No such file or directory
And this is strange, why is this script assuming that I am about to be working from my $HOME
as root? https://github.com/dmwm/CMSKubernetes/blob/master/kubernetes/cmsweb/scripts/deploy-secrets.sh#L128
@arooshap any ideas?
p.s. In contrast to the above this one works:
$ ./scripts/decrypt-secrets.sh dmwm /afs/cern.ch/user/<username>/WMCoreDev.d/deploy/deploymentK8/services_config/reqmgr2ms-output/ReqMgr2MSSecrets.py.encrypted
Namespace: dmwm
File to be decrypted: /afs/cern.ch/user/<username>/WMCoreDev.d/deploy/deploymentK8/services_config/reqmgr2ms-output/ReqMgr2MSSecrets.py.encrypted
/tmp/<username>/sops
Key file: /tmp/<username>/sops/dmwm-keys.txt
total 3
-rw-r--r--. 1 <username> zh 1336 Apr 7 14:26 ReqMgr2MSSecrets.py.encrypted
-rw-r--r--. 1 <username> zh 149 Apr 7 14:50 ReqMgr2MSSecrets.py
@todor-ivanov yes, that's strange. The script can definitely be modified.
The difference between two files is that, in the deploy-secrets.sh, you are running this command: $HOME/bin/sops -d $fname > $secretdir/$(basename $fname .encrypted)
, and for the decrypt-secrets.sh, you are running this command: sops -d $encrypted_file > $DIR/$(basename $encrypted_file .encrypted)
. So, I think if you just do, sops -d $fname > $secretdir/$(basename $fname .encrypted)
instead of $HOME/bin/sops -d $fname > $secretdir/$(basename $fname .encrypted)
, it should work.
And I have updated the secrets_config repository to include the new changes in mongoDB credentials for the preprod branch.
@todor-ivanov for your first question, the issue was that you did not configure the secrets correctly. That's why you got that error. It should not be a zero sized file.
Can you please try it again and make sure that the secrets you applied are correct? In order to validate that, you can also use this command: kubectl get secret reqmgr2-secrets -n dmwm -o go-template='{{range $k,$v := .data}}{{printf "%s: " $k}}{{if not $v}}{{$v}}{{else}}{{$v | base64decode}}{{end}}{{"\n"}}{{end}}'
. And once you have updated the secrets, make sure you restart the deployment using kubectl rollout restart deployment <deployment-name> -n dmwm
.
I hope this helps.
Hi @arooshap
I've already found the bug in this script. I am about to make a PR for fixing it soon. The command you cite is indeed the culprit that breaks everything. But it is not the command that is to blame here but rather:
The test for checking the existence of this command is completely broken here, because:
echo "$(command -v sops)"
/usr/bin/sops
Meaning the command is actually present at the machine.
But Regardless of the result from the test for the rest of the script it is simply assumed that the user has the sops
binary installed under the destination $HOME/bin/sops
.
So it is not supposed to be working at all... I simply do not know how it even works for others. Maybe, they have executed the script somehow under different conditions in the past (e.g. previous setup of the lxplus 8
machines, which was missing the executable installed at all), and it happened to install the sops
executable for them under their $HOME
path... I do not know... But starting from scratch in the current case, where in the lxplus
cluster one has the file /usr/bin/sops
present ... this script is impossible to work.
Here is the fix @arooshap : https://github.com/dmwm/CMSKubernetes/pull/1348
And now everything works like a charm.
And here is the next obstacle we stumble on this month: [1] And few more errors related to the above one: [2]
I am about to look further later today....
[1]
2023-04-08 06:04:50,431:ERROR:MSTransferor: Unknown exception updating caches. Error: url=https://cmsweb-testbed.cern.ch:8443/ms-pileup/data/pileup, code=500, reason=Internal Server Error, headers={'Date': 'Sat, 08 Apr 2023 06:04:50 GMT'
, 'Server': 'CherryPy/18.8.0', 'Set-Cookie': 'cms-auth=add2faba041c0d0d1985c050eea6d81f6dcd7abe93cdf5260c529a29a7883d06f0f5163fcec9958c;path=/;secure;httponly;expires=Thu, 01-Jan-1970 00:00:01 GMT', 'Content-Type': 'text/html;charset=utf
-8', 'X-Rest-Status': '403', 'X-Error-Http': '500', 'X-Error-Id': '9df94e0f3bdf373d0f722c452f7b25d0', 'X-Error-Detail': 'Execution error', 'X-Rest-Time': '4367.828 us', 'Content-Length': '745', 'CMS-Server-Time': 'D=35122 t=1680933890394
052', 'Connection': 'close'}, result=b'<!DOCTYPE html PUBLIC\n"-//W3C//DTD XHTML 1.0 Transitional//EN"\n"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html>\n<head>\n <meta http-equiv="Content-Type" content="text/html; c
harset=utf-8"></meta>\n <title>500 Internal Server Error</title>\n <style type="text/css">\n #powered_by {\n margin-top: 20px;\n border-top: 2px solid black;\n font-style: italic;\n }\n\n #traceback {\
n color: red;\n }\n </style>\n</head>\n <body>\n <h2>500 Internal Server Error</h2>\n <p>Execution error</p>\n
<pre id="traceback"></pre>\n <div id="powered_by">\n <span>\n Powered by <a href="http://www.cherrypy.dev">CherryPy 18.8.0</a>\n </span>\n </div>\n </body>\n</html>\n'
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/WMCore/MicroService/MSTransferor/MSTransferor.py", line 173, in execute
self.updateCaches()
File "/usr/local/lib/python3.8/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/usr/local/lib/python3.8/site-packages/retry/api.py", line 73, in retry_decorator
return __retry_internal(partial(f, *args, **kwargs), exceptions, tries, delay, max_delay, backoff, jitter,
File "/usr/local/lib/python3.8/site-packages/retry/api.py", line 33, in __retry_internal
return f()
File "/usr/local/lib/python3.8/site-packages/WMCore/MicroService/MSTransferor/MSTransferor.py", line 131, in updateCaches
self.pileupDocs = getPileupDocs(self.msConfig['mspileupUrl'], self.pileupQuery)
File "/usr/local/lib/python3.8/site-packages/WMCore/MicroService/Tools/Common.py", line 119, in getPileupDocs
data = mgr.getdata(mspileupUrl, queryDict, headers, verb='POST',
File "/usr/local/lib/python3.8/site-packages/WMCore/Services/pycurl_manager.py", line 363, in getdata
_, data = self.request(url=url, params=params, headers=headers, verb=verb,
File "/usr/local/lib/python3.8/site-packages/Utils/PortForward.py", line 67, in portMangle
return callFunc(callObj, newUrl, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/WMCore/Services/pycurl_manager.py", line 353, in request
raise exc
http.client.HTTPException: url=https://cmsweb-testbed.cern.ch:8443/ms-pileup/data/pileup, code=500, reason=Internal Server Error, headers={'Date': 'Sat, 08 Apr 2023 06:04:50 GMT', 'Server': 'CherryPy/18.8.0', 'Set-Cookie': 'cms-auth=add2faba041c0d0d1985c050eea6d81f6dcd7abe93cdf5260c529a29a7883d06f0f5163fcec9958c;path=/;secure;httponly;expires=Thu, 01-Jan-1970 00:00:01 GMT', 'Content-Type': 'text/html;charset=utf-8', 'X-Rest-Status': '403', 'X-Error-Http': '500', 'X-Error-Id': '9df94e0f3bdf373d0f722c452f7b25d0', 'X-Error-Detail': 'Execution error', 'X-Rest-Time': '4367.828 us', 'Content-Length': '745', 'CMS-Server-Time': 'D=35122 t=1680933890394052', 'Connection': 'close'}, result=b'<!DOCTYPE html PUBLIC\n"-//W3C//DTD XHTML 1.0 Transitional//EN"\n"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html>\n<head>\n <meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>\n <title>500 Internal Server Error</title>\n <style type="text/css">\n #powered_by {\n margin-top: 20px;\n border-top: 2px solid black;\n font-style: italic;\n }\n\n #traceback {\n color: red;\n }\n </style>\n</head>\n <body>\n <h2>500 Internal Server Error</h2>\n <p>Execution error</p>\n <pre id="traceback"></pre>\n <div id="powered_by">\n <span>\n Powered by <a href="http://www.cherrypy.dev">CherryPy 18.8.0</a>\n </span>\n </div>\n </body>\n</html>\n'
[2]
2023-04-08 06:04:44,067:INFO:MSTransferor: Updating all local caches...
2023-04-08 06:04:44,128:WARNING:api: url=https://cmsweb-testbed.cern.ch:8443/ms-pileup/data/pileup, code=500, reason=Internal Server Error, headers={'Date': 'Sat, 08 Apr 2023 06:04:44 GMT', 'Server': 'CherryPy/18.8.0', 'Set-Cookie': 'cms
-auth=ce8b8a4d7d8d6a61bd8133baadcf899ab8bb74d1f93d9ed0ef5fec7dc9d4a93b866122facce8a3eb;path=/;secure;httponly;expires=Thu, 01-Jan-1970 00:00:01 GMT', 'Content-Type': 'text/html;charset=utf-8', 'X-Rest-Status': '403', 'X-Error-Http': '500
', 'X-Error-Id': '694585c5d1978da0a8b6f422f6cfa890', 'X-Error-Detail': 'Execution error', 'X-Rest-Time': '8267.641 us', 'Content-Length': '745', 'CMS-Server-Time': 'D=40314 t=1680933884085914', 'Connection': 'close'}, result=b'<!DOCTYPE
html PUBLIC\n"-//W3C//DTD XHTML 1.0 Transitional//EN"\n"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html>\n<head>\n <meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>\n <title>500 Internal Se
rver Error</title>\n <style type="text/css">\n #powered_by {\n margin-top: 20px;\n border-top: 2px solid black;\n font-style: italic;\n }\n\n #traceback {\n color: red;\n }\n </style>\n</head
>\n <body>\n <h2>500 Internal Server Error</h2>\n <p>Execution error</p>\n <pre id="traceback"></pre>\n <div id="powered_by">\n <span>\n Powered by <a href="http://www.cherrypy.dev">CherryPy 18.8.0<
/a>\n </span>\n </div>\n </body>\n</html>\n', retrying in 2 seconds...
2023-04-08 06:04:46,130:INFO:MSTransferor: Updating RSE/PNN quota and usage
@todor-ivanov I have just patched the ms-pileup pods in place with: https://github.com/dmwm/WMCore/pull/11539
and things should be working well now.
Thanks @amaltaro
And here is the final validation document: https://twiki.cern.ch/twiki/bin/view/CMS/WMCore220Validation
I am closing this issue now
FYI: @amaltaro
Here is the release notes - relative to the current version in production - for WMCore 2.2.0 cycle: https://github.com/dmwm/WMCore/releases/tag/2.2.0.2
Deployment is planned for tomorrow afternoon, CERN time.
Impact of the new feature WMCore central services
Is your feature request related to a problem? Please describe. Monthly task
Describe the solution you'd like Validate central services in cmsweb-testbed and provide the final feedback by the March deadline, specified by the CMSWEB team.
It also includes the creation of the service release notes and the validation check-list twiki.
Describe alternatives you've considered None
Additional context None