canonical / opensearch-operator

OpenSearch operator
Apache License 2.0
12 stars 7 forks source link

OpenSearch's CA rotation gets stuck in large deployments #500

Closed phvalguima closed 5 days ago

phvalguima commented 2 weeks ago

Our CI runs for CA rotation, specially the large deployments ones, are continuously timing out, e.g.: https://github.com/canonical/opensearch-operator/actions/runs/11599401070/job/32297803436

The test: test_rollout_new_ca will eventually time out.

I can manually reproduce it and see. Juju status:

Model  Controller           Cloud/Region         Version  SLA          Timestamp
test   localhost-localhost  localhost/localhost  3.5.4    unsupported  10:40:53Z

App                       Version  Status   Scale  Charm                     Channel        Rev  Exposed  Message
opensearch-data                    waiting      1  opensearch                                 0  no       TLS not fully configured in related 'main-orchestrator'.
opensearch-failover                waiting      1  opensearch                                 1  no       TLS not fully configured in related 'main-orchestrator'.
opensearch-main                    active       3  opensearch                                 2  no       
self-signed-certificates           active       1  self-signed-certificates  latest/stable  155  no       

Unit                         Workload     Agent      Machine  Public address  Ports     Message
opensearch-data/0*           maintenance  idle       1        10.72.181.79    9200/tcp  Waiting for TLS to be fully configured...
opensearch-failover/0*       maintenance  idle       2        10.72.181.174   9200/tcp  Waiting for TLS to be fully configured...
opensearch-main/3            maintenance  executing  3        10.72.181.181   9200/tcp  (get-password) Waiting for TLS to be fully configured...
opensearch-main/4*           active       idle       4        10.72.181.111   9200/tcp  
opensearch-main/5            maintenance  idle       5        10.72.181.162   9200/tcp  Waiting for TLS to be fully configured...
self-signed-certificates/0*  active       idle       0        10.72.181.203             

Machine  State    Address        Inst id        Base          AZ  Message
0        started  10.72.181.203  juju-f60bff-0  ubuntu@22.04      Running
1        started  10.72.181.79   juju-f60bff-1  ubuntu@22.04      Running
2        started  10.72.181.174  juju-f60bff-2  ubuntu@22.04      Running
3        started  10.72.181.181  juju-f60bff-3  ubuntu@22.04      Running
4        started  10.72.181.111  juju-f60bff-4  ubuntu@22.04      Running
5        started  10.72.181.162  juju-f60bff-5  ubuntu@22.04      Running

juju show-unit for each of the main units:

The reason is because each unit perceives its own neighbors as in-rotating status. I can see it by entering one of the nodes with juju debug-hooks opensearch-main/3, and executing the dispatch script gives:

root@juju-f60bff-3:/var/lib/juju/agents/unit-opensearch-main-3/charm# ./dispatch 
2024-11-11 10:39:05,637 DEBUG    ops 2.16.1 up and running.
2024-11-11 10:39:05,702 DEBUG    Re-emitting deferred event <CertificateAvailableEvent via OpenSearchOperatorCharm/TLSCertificates
RequiresV3[certificates]/on/certificate_available[485]>.
2024-11-11 10:39:05,735 DEBUG    unit.unit-http TLS certificate available.
2024-11-11 10:39:06,573 DEBUG    TLS CA rotation ongoing in unit <ops.model.Unit opensearch-main/4>, will not update tls certifica
tes.
2024-11-11 10:39:06,598 DEBUG    Deferring <CertificateAvailableEvent via OpenSearchOperatorCharm/TLSCertificatesRequiresV3[certif
icates]/on/certificate_available[485]>.
2024-11-11 10:39:06,602 DEBUG    Re-emitting deferred event <CertificateAvailableEvent via OpenSearchOperatorCharm/TLSCertificates
RequiresV3[certificates]/on/certificate_available[486]>.
2024-11-11 10:39:06,605 DEBUG    unit.unit-transport TLS certificate available.
2024-11-11 10:39:07,363 DEBUG    TLS CA rotation ongoing in unit <ops.model.Unit opensearch-main/4>, will not update tls certifica
tes.
2024-11-11 10:39:07,367 DEBUG    Deferring <CertificateAvailableEvent via OpenSearchOperatorCharm/TLSCertificatesRequiresV3[certif
icates]/on/certificate_available[486]>.
2024-11-11 10:39:07,371 DEBUG    Re-emitting deferred event <CertificateAvailableEvent via OpenSearchOperatorCharm/TLSCertificates
RequiresV3[certificates]/on/certificate_available[499]>.
2024-11-11 10:39:07,374 DEBUG    unit.unit-http TLS certificate available.
2024-11-11 10:39:08,056 DEBUG    TLS CA rotation ongoing in unit <ops.model.Unit opensearch-main/4>, will not update tls certifica
tes.
2024-11-11 10:39:08,061 DEBUG    Deferring <CertificateAvailableEvent via OpenSearchOperatorCharm/TLSCertificatesRequiresV3[certif
icates]/on/certificate_available[499]>.
2024-11-11 10:39:08,065 DEBUG    Re-emitting deferred event <CertificateAvailableEvent via OpenSearchOperatorCharm/TLSCertificates
RequiresV3[certificates]/on/certificate_available[500]>.
2024-11-11 10:39:08,068 DEBUG    unit.unit-transport TLS certificate available.
2024-11-11 10:39:08,703 DEBUG    TLS CA rotation ongoing in unit <ops.model.Unit opensearch-main/4>, will not update tls certificates.
2024-11-11 10:39:08,707 DEBUG    Deferring <CertificateAvailableEvent via OpenSearchOperatorCharm/TLSCertificatesRequiresV3[certif
icates]/on/certificate_available[500]>.
2024-11-11 10:39:08,711 DEBUG    Emitting Juju event get_password_action.
2024-11-11 10:39:08,719 DEBUG    Executing command: openssl pkcs12 -in /var/snap/opensearch/current/etc/opensearch/certificates/ca
.p12 -passin pass:xxx
2024-11-11 10:39:08,740 DEBUG    Executing command: openssl x509 -in /tmp/tmp3x85tfgt -noout -issuer
2024-11-11 10:39:08,751 DEBUG    Executing command: openssl pkcs12 -in /var/snap/opensearch/current/etc/opensearch/certificates/un
it-transport.p12 -nodes -passin pass:xxx | openssl x509 -noout -issuer

Reproducer

sudo apt install -y python3-pip
sudo pip3 install tox poetry charmcraftcache

sudo snap install charmcraft --classic
sudo snap install juju --classic

sudo lxd init ## just removed the IPv6 and set storage to `dir`

juju bootstrap localhost
juju add-model test

git clone https://github.com/canonical/opensearch-operator
cd opensearch-operator
charmcraftcache pack

tox run -e integration -- 'tests/integration/tls/test_ca_rotation.py' --group='large' -m '' --model test 

Initial Conclusions

The run on debug-hooks above shows that opensearch-main/3 is considering its leader unit as still executing the CA rotation:

2024-11-11 10:39:08,703 DEBUG TLS CA rotation ongoing in unit <ops.model.Unit opensearch-main/4>, will not update tls certificates.

However, we can see in the show-unit that the tls_{renewing,renewed} marks are gone in that unit. Therefore, the main problem here is the fact that opensearch-main/4 is executing its reset ca rotation state routine too early. That fact will make the check here to return as "unit XX is still doing its rotation", although in fact that unit finished its rotation entirely.

syncronize-issues-to-jira[bot] commented 2 weeks ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-5932.

This message was autogenerated