dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

WMCore v2.2.0 Validation #11523

Closed todor-ivanov closed 1 year ago

todor-ivanov commented 1 year ago

Impact of the new feature WMCore central services

Is your feature request related to a problem? Please describe. Monthly task

Describe the solution you'd like Validate central services in cmsweb-testbed and provide the final feedback by the March deadline, specified by the CMSWEB team.

It also includes the creation of the service release notes and the validation check-list twiki.

Describe alternatives you've considered None

Additional context None

amaltaro commented 1 year ago

@todor-ivanov Todor, we have WMCore 2.2.0rc7 deployed in testbed and I would suggest to start those tests today such that we can target a production deployment for the next Tuesday, April 11.

I wouldn't be surprised if some workflows don't go past the assigned status and I think we will have to inject the pileup configurations into testbed. A dump of the test/testbed templates was provided in this PR and it can be used to populate testbed MSPileup: https://github.com/dmwm/WMCore/pull/11521

Please let me know if you need help with any of this.

todor-ivanov commented 1 year ago

With this release we have two major services deployment activities;

all the databases have just been copied as mentioned in https://github.com/dmwm/WMCore/issues/11534#issuecomment-1499505070 and all testbed service redeployed. I am about to inject the validation workflows now FYI: @amaltaro

todor-ivanov commented 1 year ago

well due to the badly deployed secrets this deployment just failed. I'll repeat it tomorrow morning

amaltaro commented 1 year ago

Indeed there was nothing running under the dmwm namespace. I went ahead and redeployed WMCore secrets and services with tag 2.2.0rc7.

Unfortunately, after a few minutes, I see that all services using MongoDB are still failing to run. Error message is [1] @todor-ivanov could it be that you forgot to update the new mongodb credentials in services_config repository?

For the record, I've been using the following documentation for that (already advertised in the past weeks/months): https://github.com/dmwm/WMCore/wiki/WMCore-services-upgrade-in-kubernetes

[1]

2023-04-06 23:07:07,295:ERROR:MongoDB: Could not connect to MongoDB server: ['cms-mongo-preprod-node-0.cern.ch:32001', 'cms-mongo-preprod-node-1.cern.ch:32002', 'cms-mongo-preprod-node-2.cern.ch:32003']. Due to unknown reason: Authentication failed., full error: {'ok': 0.0, 'errmsg': 'Authentication failed.', 'code': 18, 'codeName': 'AuthenticationFailed', '$clusterTime': {'clusterTime': Timestamp(1680822426, 1), 'signature': {'hash': b'\xb7\xc8/\xbb\xf4\xc3\xfb\x817\xef\x8f\x8cC\x82\xb9:\xb8 \xb0\x1e', 'keyId': 7218993718450192385}}, 'operationTime': Timestamp(1680822426, 1)}
todor-ivanov commented 1 year ago

Hi @amaltaro Thanks for acting on that. It was failing quite badly for me - all the services were failing due to missing config files. I have to figure out what is going on with my setup environment.

As of the changed secrets for the services using MongoDB indeed it is correct. We do have new passwords but AFAIK we are not supposed to propagate the encrypted secrets files. We should request someone from CMSWeb Team to do so. I am not sure if I even have the correct credentials for doing it. I've never done this before, nor I've ever seen a full set of instructions on how to do so.

@arooshap can you help us in this?

arooshap commented 1 year ago

@todor-ivanov I am making the changes now.

todor-ivanov commented 1 year ago

So let me put some more details on how exactly those deployments are failing for me (regardless of where I try, being it preprod or cmsweb-test clusters): [1]. So this simply tells me it cannot import anything from the secrets files. And indeed going and listing the so mounted volume at /etc/sectrets/ gives me a zero sized file for ReqMgr2Secrets.py

_reqmgr2@reqmgr2-c6ccbd585-g59nb:/data$ ls -la /etc/secrets/..data/
total 8
drwxr-sr-x. 2 root 2000   80 Apr  7 12:37 .
drwxrwsrwt. 3 root 2000  120 Apr  7 12:37 ..
-rw-r--r--. 1 root 2000    0 Apr  7 12:37 ReqMgr2Secrets.py
-rw-r--r--. 1 root 2000 6803 Apr  7 12:37 config.py

I did follow all the steps from https://github.com/dmwm/WMCore/wiki/WMCore-services-upgrade-in-kubernetes#deploying-wmcore-secrets veery carefully , especially the one with deploying secrets.

@arooshap any idea on what may cause all this?

[1]

_reqmgr2@reqmgr2-c6ccbd585-g59nb:/data$ /bin/bash /data/run.sh
ln: failed to create symbolic link '/data/srv/current/apps/reqmgr2/data': File exists
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/WMCore/Configuration.py", line 603, in loadConfigurationFile
    modRef = imp.load_module(cfgBaseName, modPath[0],
  File "/usr/local/lib/python3.8/imp.py", line 234, in load_module
    return load_source(name, filename, file)
  File "/usr/local/lib/python3.8/imp.py", line 171, in load_source
    module = _load(spec)
  File "<frozen importlib._bootstrap>", line 702, in _load
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/data/srv/current/config/reqmgr2/config.py", line 36, in <module>
    from ReqMgr2Secrets import USER_AMQ, PASS_AMQ, AMQ_TOPIC
ImportError: cannot import name 'USER_AMQ' from 'ReqMgr2Secrets' (/etc/secrets/ReqMgr2Secrets.py)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/wmc-httpd", line 3, in <module>
    main()
  File "/usr/local/lib/python3.8/site-packages/WMCore/REST/Main.py", line 558, in main
    cfg = loadConfigurationFile(args[0])
  File "/usr/local/lib/python3.8/site-packages/WMCore/Configuration.py", line 611, in loadConfigurationFile
    raise RuntimeError(msg)
RuntimeError: Unable to load Configuration File:
/data/srv/current/config/reqmgr2/config.py
Due to error:
cannot import name 'USER_AMQ' from 'ReqMgr2Secrets' (/etc/secrets/ReqMgr2Secrets.py)Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/WMCore/Configuration.py", line 603, in loadConfigurationFile
    modRef = imp.load_module(cfgBaseName, modPath[0],
  File "/usr/local/lib/python3.8/imp.py", line 234, in load_module
    return load_source(name, filename, file)
  File "/usr/local/lib/python3.8/imp.py", line 171, in load_source
    module = _load(spec)
  File "<frozen importlib._bootstrap>", line 702, in _load
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/data/srv/current/config/reqmgr2/config.py", line 36, in <module>
    from ReqMgr2Secrets import USER_AMQ, PASS_AMQ, AMQ_TOPIC
ImportError: cannot import name 'USER_AMQ' from 'ReqMgr2Secrets' (/etc/secrets/ReqMgr2Secrets.py)
todor-ivanov commented 1 year ago

And here is the exact error from the `cmsweb/scripts/deploy-secrets.sh file;

./scripts/deploy-secrets.sh: line 128: /afs/cern.ch/user/<username>/bin/sops: No such file or directory

And this is strange, why is this script assuming that I am about to be working from my $HOME as root? https://github.com/dmwm/CMSKubernetes/blob/master/kubernetes/cmsweb/scripts/deploy-secrets.sh#L128

@arooshap any ideas?

p.s. In contrast to the above this one works:

$ ./scripts/decrypt-secrets.sh dmwm /afs/cern.ch/user/<username>/WMCoreDev.d/deploy/deploymentK8/services_config/reqmgr2ms-output/ReqMgr2MSSecrets.py.encrypted 
Namespace: dmwm
File to be decrypted: /afs/cern.ch/user/<username>/WMCoreDev.d/deploy/deploymentK8/services_config/reqmgr2ms-output/ReqMgr2MSSecrets.py.encrypted
/tmp/<username>/sops
Key file: /tmp/<username>/sops/dmwm-keys.txt
total 3
-rw-r--r--. 1 <username> zh 1336 Apr  7 14:26 ReqMgr2MSSecrets.py.encrypted
-rw-r--r--. 1 <username> zh  149 Apr  7 14:50 ReqMgr2MSSecrets.py
arooshap commented 1 year ago

@todor-ivanov yes, that's strange. The script can definitely be modified.

The difference between two files is that, in the deploy-secrets.sh, you are running this command: $HOME/bin/sops -d $fname > $secretdir/$(basename $fname .encrypted), and for the decrypt-secrets.sh, you are running this command: sops -d $encrypted_file > $DIR/$(basename $encrypted_file .encrypted). So, I think if you just do, sops -d $fname > $secretdir/$(basename $fname .encrypted) instead of $HOME/bin/sops -d $fname > $secretdir/$(basename $fname .encrypted), it should work.

And I have updated the secrets_config repository to include the new changes in mongoDB credentials for the preprod branch.

arooshap commented 1 year ago

@todor-ivanov for your first question, the issue was that you did not configure the secrets correctly. That's why you got that error. It should not be a zero sized file.

Can you please try it again and make sure that the secrets you applied are correct? In order to validate that, you can also use this command: kubectl get secret reqmgr2-secrets -n dmwm -o go-template='{{range $k,$v := .data}}{{printf "%s: " $k}}{{if not $v}}{{$v}}{{else}}{{$v | base64decode}}{{end}}{{"\n"}}{{end}}'. And once you have updated the secrets, make sure you restart the deployment using kubectl rollout restart deployment <deployment-name> -n dmwm.

I hope this helps.

todor-ivanov commented 1 year ago

Hi @arooshap

I've already found the bug in this script. I am about to make a PR for fixing it soon. The command you cite is indeed the culprit that breaks everything. But it is not the command that is to blame here but rather:

So it is not supposed to be working at all... I simply do not know how it even works for others. Maybe, they have executed the script somehow under different conditions in the past (e.g. previous setup of the lxplus 8 machines, which was missing the executable installed at all), and it happened to install the sops executable for them under their $HOME path... I do not know... But starting from scratch in the current case, where in the lxplus cluster one has the file /usr/bin/sops present ... this script is impossible to work.

todor-ivanov commented 1 year ago

Here is the fix @arooshap : https://github.com/dmwm/CMSKubernetes/pull/1348

And now everything works like a charm.

todor-ivanov commented 1 year ago

And here is the next obstacle we stumble on this month: [1] And few more errors related to the above one: [2]

I am about to look further later today....

[1]

2023-04-08 06:04:50,431:ERROR:MSTransferor: Unknown exception updating caches. Error: url=https://cmsweb-testbed.cern.ch:8443/ms-pileup/data/pileup, code=500, reason=Internal Server Error, headers={'Date': 'Sat, 08 Apr 2023 06:04:50 GMT'
, 'Server': 'CherryPy/18.8.0', 'Set-Cookie': 'cms-auth=add2faba041c0d0d1985c050eea6d81f6dcd7abe93cdf5260c529a29a7883d06f0f5163fcec9958c;path=/;secure;httponly;expires=Thu, 01-Jan-1970 00:00:01 GMT', 'Content-Type': 'text/html;charset=utf
-8', 'X-Rest-Status': '403', 'X-Error-Http': '500', 'X-Error-Id': '9df94e0f3bdf373d0f722c452f7b25d0', 'X-Error-Detail': 'Execution error', 'X-Rest-Time': '4367.828 us', 'Content-Length': '745', 'CMS-Server-Time': 'D=35122 t=1680933890394
052', 'Connection': 'close'}, result=b'<!DOCTYPE html PUBLIC\n"-//W3C//DTD XHTML 1.0 Transitional//EN"\n"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html>\n<head>\n    <meta http-equiv="Content-Type" content="text/html; c
harset=utf-8"></meta>\n    <title>500 Internal Server Error</title>\n    <style type="text/css">\n    #powered_by {\n        margin-top: 20px;\n        border-top: 2px solid black;\n        font-style: italic;\n    }\n\n    #traceback {\
n        color: red;\n    }\n    </style>\n</head>\n    <body>\n        <h2>500 Internal Server Error</h2>\n        <p>Execution error</p>\n     
   <pre id="traceback"></pre>\n    <div id="powered_by">\n      <span>\n        Powered by <a href="http://www.cherrypy.dev">CherryPy 18.8.0</a>\n      </span>\n    </div>\n    </body>\n</html>\n'
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/WMCore/MicroService/MSTransferor/MSTransferor.py", line 173, in execute
    self.updateCaches()
  File "/usr/local/lib/python3.8/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.8/site-packages/retry/api.py", line 73, in retry_decorator
    return __retry_internal(partial(f, *args, **kwargs), exceptions, tries, delay, max_delay, backoff, jitter,
  File "/usr/local/lib/python3.8/site-packages/retry/api.py", line 33, in __retry_internal
    return f()
  File "/usr/local/lib/python3.8/site-packages/WMCore/MicroService/MSTransferor/MSTransferor.py", line 131, in updateCaches
    self.pileupDocs = getPileupDocs(self.msConfig['mspileupUrl'], self.pileupQuery)
  File "/usr/local/lib/python3.8/site-packages/WMCore/MicroService/Tools/Common.py", line 119, in getPileupDocs
    data = mgr.getdata(mspileupUrl, queryDict, headers, verb='POST',
  File "/usr/local/lib/python3.8/site-packages/WMCore/Services/pycurl_manager.py", line 363, in getdata
    _, data = self.request(url=url, params=params, headers=headers, verb=verb,
  File "/usr/local/lib/python3.8/site-packages/Utils/PortForward.py", line 67, in portMangle
    return callFunc(callObj, newUrl, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/WMCore/Services/pycurl_manager.py", line 353, in request
    raise exc
http.client.HTTPException: url=https://cmsweb-testbed.cern.ch:8443/ms-pileup/data/pileup, code=500, reason=Internal Server Error, headers={'Date': 'Sat, 08 Apr 2023 06:04:50 GMT', 'Server': 'CherryPy/18.8.0', 'Set-Cookie': 'cms-auth=add2faba041c0d0d1985c050eea6d81f6dcd7abe93cdf5260c529a29a7883d06f0f5163fcec9958c;path=/;secure;httponly;expires=Thu, 01-Jan-1970 00:00:01 GMT', 'Content-Type': 'text/html;charset=utf-8', 'X-Rest-Status': '403', 'X-Error-Http': '500', 'X-Error-Id': '9df94e0f3bdf373d0f722c452f7b25d0', 'X-Error-Detail': 'Execution error', 'X-Rest-Time': '4367.828 us', 'Content-Length': '745', 'CMS-Server-Time': 'D=35122 t=1680933890394052', 'Connection': 'close'}, result=b'<!DOCTYPE html PUBLIC\n"-//W3C//DTD XHTML 1.0 Transitional//EN"\n"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html>\n<head>\n    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>\n    <title>500 Internal Server Error</title>\n    <style type="text/css">\n    #powered_by {\n        margin-top: 20px;\n        border-top: 2px solid black;\n        font-style: italic;\n    }\n\n    #traceback {\n        color: red;\n    }\n    </style>\n</head>\n    <body>\n        <h2>500 Internal Server Error</h2>\n        <p>Execution error</p>\n        <pre id="traceback"></pre>\n    <div id="powered_by">\n      <span>\n        Powered by <a href="http://www.cherrypy.dev">CherryPy 18.8.0</a>\n      </span>\n    </div>\n    </body>\n</html>\n'

[2]

2023-04-08 06:04:44,067:INFO:MSTransferor: Updating all local caches...
2023-04-08 06:04:44,128:WARNING:api: url=https://cmsweb-testbed.cern.ch:8443/ms-pileup/data/pileup, code=500, reason=Internal Server Error, headers={'Date': 'Sat, 08 Apr 2023 06:04:44 GMT', 'Server': 'CherryPy/18.8.0', 'Set-Cookie': 'cms
-auth=ce8b8a4d7d8d6a61bd8133baadcf899ab8bb74d1f93d9ed0ef5fec7dc9d4a93b866122facce8a3eb;path=/;secure;httponly;expires=Thu, 01-Jan-1970 00:00:01 GMT', 'Content-Type': 'text/html;charset=utf-8', 'X-Rest-Status': '403', 'X-Error-Http': '500
', 'X-Error-Id': '694585c5d1978da0a8b6f422f6cfa890', 'X-Error-Detail': 'Execution error', 'X-Rest-Time': '8267.641 us', 'Content-Length': '745', 'CMS-Server-Time': 'D=40314 t=1680933884085914', 'Connection': 'close'}, result=b'<!DOCTYPE 
html PUBLIC\n"-//W3C//DTD XHTML 1.0 Transitional//EN"\n"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html>\n<head>\n    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>\n    <title>500 Internal Se
rver Error</title>\n    <style type="text/css">\n    #powered_by {\n        margin-top: 20px;\n        border-top: 2px solid black;\n        font-style: italic;\n    }\n\n    #traceback {\n        color: red;\n    }\n    </style>\n</head
>\n    <body>\n        <h2>500 Internal Server Error</h2>\n        <p>Execution error</p>\n        <pre id="traceback"></pre>\n    <div id="powered_by">\n      <span>\n        Powered by <a href="http://www.cherrypy.dev">CherryPy 18.8.0<
/a>\n      </span>\n    </div>\n    </body>\n</html>\n', retrying in 2 seconds...
2023-04-08 06:04:46,130:INFO:MSTransferor: Updating RSE/PNN quota and usage
amaltaro commented 1 year ago

@todor-ivanov I have just patched the ms-pileup pods in place with: https://github.com/dmwm/WMCore/pull/11539

and things should be working well now.

todor-ivanov commented 1 year ago

Thanks @amaltaro

todor-ivanov commented 1 year ago

And here is the final validation document: https://twiki.cern.ch/twiki/bin/view/CMS/WMCore220Validation

I am closing this issue now

FYI: @amaltaro

amaltaro commented 1 year ago

Here is the release notes - relative to the current version in production - for WMCore 2.2.0 cycle: https://github.com/dmwm/WMCore/releases/tag/2.2.0.2

Deployment is planned for tomorrow afternoon, CERN time.