The test: test_rollout_new_ca will eventually time out.
I can manually reproduce it and see. Juju status:
Model Controller Cloud/Region Version SLA Timestamp
test localhost-localhost localhost/localhost 3.5.4 unsupported 10:40:53Z
App Version Status Scale Charm Channel Rev Exposed Message
opensearch-data waiting 1 opensearch 0 no TLS not fully configured in related 'main-orchestrator'.
opensearch-failover waiting 1 opensearch 1 no TLS not fully configured in related 'main-orchestrator'.
opensearch-main active 3 opensearch 2 no
self-signed-certificates active 1 self-signed-certificates latest/stable 155 no
Unit Workload Agent Machine Public address Ports Message
opensearch-data/0* maintenance idle 1 10.72.181.79 9200/tcp Waiting for TLS to be fully configured...
opensearch-failover/0* maintenance idle 2 10.72.181.174 9200/tcp Waiting for TLS to be fully configured...
opensearch-main/3 maintenance executing 3 10.72.181.181 9200/tcp (get-password) Waiting for TLS to be fully configured...
opensearch-main/4* active idle 4 10.72.181.111 9200/tcp
opensearch-main/5 maintenance idle 5 10.72.181.162 9200/tcp Waiting for TLS to be fully configured...
self-signed-certificates/0* active idle 0 10.72.181.203
Machine State Address Inst id Base AZ Message
0 started 10.72.181.203 juju-f60bff-0 ubuntu@22.04 Running
1 started 10.72.181.79 juju-f60bff-1 ubuntu@22.04 Running
2 started 10.72.181.174 juju-f60bff-2 ubuntu@22.04 Running
3 started 10.72.181.181 juju-f60bff-3 ubuntu@22.04 Running
4 started 10.72.181.111 juju-f60bff-4 ubuntu@22.04 Running
5 started 10.72.181.162 juju-f60bff-5 ubuntu@22.04 Running
The reason is because each unit perceives its own neighbors as in-rotating status. I can see it by entering one of the nodes with juju debug-hooks opensearch-main/3, and executing the dispatch script gives:
root@juju-f60bff-3:/var/lib/juju/agents/unit-opensearch-main-3/charm# ./dispatch
2024-11-11 10:39:05,637 DEBUG ops 2.16.1 up and running.
2024-11-11 10:39:05,702 DEBUG Re-emitting deferred event <CertificateAvailableEvent via OpenSearchOperatorCharm/TLSCertificates
RequiresV3[certificates]/on/certificate_available[485]>.
2024-11-11 10:39:05,735 DEBUG unit.unit-http TLS certificate available.
2024-11-11 10:39:06,573 DEBUG TLS CA rotation ongoing in unit <ops.model.Unit opensearch-main/4>, will not update tls certifica
tes.
2024-11-11 10:39:06,598 DEBUG Deferring <CertificateAvailableEvent via OpenSearchOperatorCharm/TLSCertificatesRequiresV3[certif
icates]/on/certificate_available[485]>.
2024-11-11 10:39:06,602 DEBUG Re-emitting deferred event <CertificateAvailableEvent via OpenSearchOperatorCharm/TLSCertificates
RequiresV3[certificates]/on/certificate_available[486]>.
2024-11-11 10:39:06,605 DEBUG unit.unit-transport TLS certificate available.
2024-11-11 10:39:07,363 DEBUG TLS CA rotation ongoing in unit <ops.model.Unit opensearch-main/4>, will not update tls certifica
tes.
2024-11-11 10:39:07,367 DEBUG Deferring <CertificateAvailableEvent via OpenSearchOperatorCharm/TLSCertificatesRequiresV3[certif
icates]/on/certificate_available[486]>.
2024-11-11 10:39:07,371 DEBUG Re-emitting deferred event <CertificateAvailableEvent via OpenSearchOperatorCharm/TLSCertificates
RequiresV3[certificates]/on/certificate_available[499]>.
2024-11-11 10:39:07,374 DEBUG unit.unit-http TLS certificate available.
2024-11-11 10:39:08,056 DEBUG TLS CA rotation ongoing in unit <ops.model.Unit opensearch-main/4>, will not update tls certifica
tes.
2024-11-11 10:39:08,061 DEBUG Deferring <CertificateAvailableEvent via OpenSearchOperatorCharm/TLSCertificatesRequiresV3[certif
icates]/on/certificate_available[499]>.
2024-11-11 10:39:08,065 DEBUG Re-emitting deferred event <CertificateAvailableEvent via OpenSearchOperatorCharm/TLSCertificates
RequiresV3[certificates]/on/certificate_available[500]>.
2024-11-11 10:39:08,068 DEBUG unit.unit-transport TLS certificate available.
2024-11-11 10:39:08,703 DEBUG TLS CA rotation ongoing in unit <ops.model.Unit opensearch-main/4>, will not update tls certificates.
2024-11-11 10:39:08,707 DEBUG Deferring <CertificateAvailableEvent via OpenSearchOperatorCharm/TLSCertificatesRequiresV3[certif
icates]/on/certificate_available[500]>.
2024-11-11 10:39:08,711 DEBUG Emitting Juju event get_password_action.
2024-11-11 10:39:08,719 DEBUG Executing command: openssl pkcs12 -in /var/snap/opensearch/current/etc/opensearch/certificates/ca
.p12 -passin pass:xxx
2024-11-11 10:39:08,740 DEBUG Executing command: openssl x509 -in /tmp/tmp3x85tfgt -noout -issuer
2024-11-11 10:39:08,751 DEBUG Executing command: openssl pkcs12 -in /var/snap/opensearch/current/etc/opensearch/certificates/un
it-transport.p12 -nodes -passin pass:xxx | openssl x509 -noout -issuer
Reproducer
sudo apt install -y python3-pip
sudo pip3 install tox poetry charmcraftcache
sudo snap install charmcraft --classic
sudo snap install juju --classic
sudo lxd init ## just removed the IPv6 and set storage to `dir`
juju bootstrap localhost
juju add-model test
git clone https://github.com/canonical/opensearch-operator
cd opensearch-operator
charmcraftcache pack
tox run -e integration -- 'tests/integration/tls/test_ca_rotation.py' --group='large' -m '' --model test
Initial Conclusions
The run on debug-hooks above shows that opensearch-main/3 is considering its leader unit as still executing the CA rotation:
2024-11-11 10:39:08,703 DEBUG TLS CA rotation ongoing in unit <ops.model.Unit opensearch-main/4>, will not update tls certificates.
However, we can see in the show-unit that the tls_{renewing,renewed} marks are gone in that unit. Therefore, the main problem here is the fact that opensearch-main/4 is executing its reset ca rotation state routine too early. That fact will make the check here to return as "unit XX is still doing its rotation", although in fact that unit finished its rotation entirely.
Our CI runs for CA rotation, specially the large deployments ones, are continuously timing out, e.g.: https://github.com/canonical/opensearch-operator/actions/runs/11599401070/job/32297803436
The test:
test_rollout_new_ca
will eventually time out.I can manually reproduce it and see. Juju status:
juju show-unit for each of the main units:
The reason is because each unit perceives its own neighbors as in-rotating status. I can see it by entering one of the nodes with
juju debug-hooks opensearch-main/3
, and executing thedispatch
script gives:Reproducer
Initial Conclusions
The run on debug-hooks above shows that
opensearch-main/3
is considering its leader unit as still executing the CA rotation:However, we can see in the show-unit that the
tls_{renewing,renewed}
marks are gone in that unit. Therefore, the main problem here is the fact that opensearch-main/4 is executing its reset ca rotation state routine too early. That fact will make the check here to return as "unit XX is still doing its rotation", although in fact that unit finished its rotation entirely.