dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Broken micro services after deployment in K8 #11378

Closed todor-ivanov closed 1 year ago

todor-ivanov commented 1 year ago

Impact of the bug All micro services

Describe the bug During the validation process for HG2212b with WMCore v2.1.5rc2 and MicorServices v1.1.5.rc2 We figured out that all the micro services are failing to initialize with the following error [1]. It could be due to the recent splitting of all the various micro services in separate packages... but this is just a speculation at this stage.

How to reproduce it Just deploy the above tag and try to run the microservices in Kuberenetes.

Expected behavior A clear and concise description of what you expected to happen.

Additional context and error message [1]

[29/Nov/2022:18:26:08]  MicroService REST configuration subset:
data.manager = 'WMCore.MicroService.MSManager.MSManager'
data.rucioAuthUrl = 'https://cms-rucio-auth.cern.ch'
data.reqmgr2Url = 'https://cmsweb-testbed.cern.ch/reqmgr2'
data.services = ['monitor']
data.object = 'WMCore.MicroService.Service.RestApiHub.RestApiHub'
data.interval = 600
data.rucioAccount = 'wmcore_transferor'
data.rucioUrl = 'http://cms-rucio.cern.ch'
data.couch_wmstats_db = 'wmstats'
data.couch_host = 'https://cmsweb-testbed.cern.ch/couchdb'
data.enableStatusTransition = True
data.verbose = True

ERROR initializing MicroService REST module.
Traceback (most recent call last):
  File "/data/srv/HG2212b/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.1.5.rc2/lib/python3.8/site-packages/WMCore/MicroService/Service/Data.py", line 57, in __init__
    module = importlib.import_module('.'.join(arr[:-1]))
  File "/data/srv/HG2212b/sw/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'WMCore.MicroService.MSManager'
amaltaro commented 1 year ago

@todor-ivanov please make sure to highlight that this affects only the recently deployed services in TESTBED. Given that it's testbed, I don't think it deserves the "Highest priority" label. But yes, we need to fix it as soon as possible (<24h).

You might want to delete text provided by the GH template as well, even if you do not have anything else to replace it by (e.g. "Expected behavior").

amaltaro commented 1 year ago

I don't know to which microservice the log snippet above - in your original post - belongs to, but here is the problem: https://gitlab.cern.ch/cmsweb-k8s/services_config/-/blob/preprod/reqmgr2ms-transferor/config-transferor.py#L67

In plain english, for some reason, microservices are not using the correct configuration file (which is supposed to have some recent changes made by Erik). Configuration files seem to be Okay in the preprod branch...

todor-ivanov commented 1 year ago

Hi @amaltaro The highest priority label was a minor blunder I fixed it like a minute after I've created the ticket.

but here is the problem:

Thanks for noticing that actually. To me this line looks good in both cases - prod and preprod (at least from what I see in the services_config repostory).

In both cases the correct MSManager (from its new location has been exported:

data.manager = 'WMCore.MicroService.MSCore.MSManager.MSManager'

But indeed what is inside the Pod at K8 is something completely different:

data.manager = 'WMCore.MicroService.MSManager.MSManager'

@muhammadimranfarooqi can you please take a quick look, why are the micro services configured with a old configuration files and also to merge the rest of the already prepared MRs for this deployment. I have pasted a short list in the gitlab request. Thank you in advance!

muhammadimranfarooqi commented 1 year ago

Hi @amaltaro @todor-ivanov

I tried to redeploy those services configuration and all microservices still have errors.

muhammadimranfarooqi commented 1 year ago

Following errror is common in all services logs

    raise RuntimeError(
RuntimeError: You are linking against OpenSSL 1.0.2, which is no longer supported by the OpenSSL project. To use this version of cryptography you need to upgrade to a newer version of OpenSSL. For this version only you can also set the environment variable CRYPTOGRAPHY_ALLOW_OPENSSL_102 to allow OpenSSL 1.0.2.
todor-ivanov commented 1 year ago

Hi @muhammadimranfarooqi @amaltaro , I think I have found the problem, but the solution I can think of is not the best one.

Even though I cannot reproduce it in my VM, I am now pretty sure this [1] upgrade of the rucio-clients rpm package brings in also a hard requirement for the obsolete openssl libraary 1.0.2. And the chain how it happens is the following: rucio-clients v1.29.10 depends on cryptography v3.2.1 and cryptography uses openssl as a backend. (Which is also confirmed from the log messages few lines above, where the backend library is tried: [2])

And here is what they say about using this ssl library version as a backend: [3].

As of why this fails with the RPM based installation but not with the pypi based deployments, I think because in this one we do link to the OS openssl, and since it comes with the base image's OS version - cc7 [4] we always end up with the obsolete version. And checking it in another K8 pod with a container, which is not breaking because is not explicitly importing rucio-clients, I can confirm the library version in question is the one in use:

[_reqmgr2@reqmgr2-85568b67fd-9r7p9 data]$ openssl version 
OpenSSL 1.0.2k-fips  26 Jan 2017

So one way to proceed here would be:

export CRYPTOGRAPHY_ALLOW_OPENSSL_102=1

@muhammadimranfarooqi @arooshap, can you please give it a try manually in one of the containers

[1] https://github.com/cms-sw/cmsdist/commit/bc19b202f907b4d791f02c813a49f0ab9fbb4db5

[2]

  File "/data/srv/HG2212b/sw/slc7_amd64_gcc630/external/py3-cryptography/3.2.1-comp2/lib/python3.8/site-packages/cryptography/hazmat/backends/openssl/__init__.py", line 7, in <module>
    from cryptography.hazmat.backends.openssl.backend import backend
  File "/data/srv/HG2212b/sw/slc7_amd64_gcc630/external/py3-cryptography/3.2.1-comp2/lib/python3.8/site-packages/cryptography/hazmat/backends/openssl/backend.py", line 117, in <module>
    from cryptography.hazmat.bindings.openssl import binding
  File "/data/srv/HG2212b/sw/slc7_amd64_gcc630/external/py3-cryptography/3.2.1-comp2/lib/python3.8/site-packages/cryptography/hazmat/bindings/openssl/binding.py", line 222, in <module>
    _verify_openssl_version(Binding.lib)
  File "/data/srv/HG2212b/sw/slc7_amd64_gcc630/external/py3-cryptography/3.2.1-comp2/lib/python3.8/site-packages/cryptography/hazmat/bindings/openssl/binding.py", line 182, in _verify_openssl_version
    raise RuntimeError(
RuntimeError: You are linking against OpenSSL 1.0.2, which is no longer supported by the OpenSSL project. To use this version of cryptography you need to upgrade to a newer version of OpenSSL. For this version only you can also set the environment variable CRYPTOGRAPHY_ALLOW_OPENSSL_102 to allow OpenSSL 1.0.2.

[3] https://cryptography.io/en/3.2/faq/#importing-cryptography-causes-a-runtimeerror-about-openssl-1-0-2

[4] https://github.com/dmwm/CMSKubernetes/blob/05922fd0b039d505f70d133f6eae0e20fb6fe651/docker/cmsweb/Dockerfile#L5

todor-ivanov commented 1 year ago

Hi @belforte @mapellidario ,

Tagging you here, even though the issues discussed are actually a longer chain, but the bit you might be interested with or affected somehow would be the obsolete OpenSSL version, completely explained in my previous comment here [1]. This is supposed to be a known one but we just stumbled on a hard dependency we did not expect. Hope this helps you avoid it. If you are using a newer OpenSSL version or a Docker image with a newer OS then you shouldn't be affected, but just in case...

[1] https://github.com/dmwm/WMCore/issues/11378#issuecomment-1331938834

mapellidario commented 1 year ago

Thanks @todor-ivanov , in TW/Publisher we noticed around one week ago, "discussed" it here https://github.com/dmwm/CRABServer/issues/7475, implemented a temporary workaround and forgot about it :)

In crabserver rest we noticed when we migrated to py3 and put a workaround here

So, thanks for the heads up, while we simply thought it was our problem only :)

todor-ivanov commented 1 year ago

:+1:

todor-ivanov commented 1 year ago

And we just noticed GlobalWorkQueu is suffering the same issue as well. Here [1] is the fix for it.

[1] https://github.com/dmwm/deployment/pull/1223

todor-ivanov commented 1 year ago

This one obviously did not get automatically closed by merging the relevant PRs at the deployment repository [1]. I am closing it with the current comment.

[1] https://github.com/dmwm/deployment/pull/1222 https://github.com/dmwm/deployment/pull/1223