canonical / opensearch-operator

OpenSearch operator
Apache License 2.0
10 stars 6 forks source link

Long delay on restart of units after each cert renewal following a CA renewal #417

Open Mehdi-Bendriss opened 2 weeks ago

Mehdi-Bendriss commented 2 weeks ago

After a successful CA renewal, 2 issues occur:

  1. After certificate renewal - for +4 minutes the following error occurs:
    
    unit-main-0: 15:14:59 ERROR unit.main/0.juju-log opensearch-peers:1: Cannot connect to the OpenSearch server...
    unit-main-0: 15:15:00 ERROR unit.main/0.juju-log opensearch-peers:1: [Errno 111] Connection refused
    unit-main-0: 15:15:03 ERROR unit.main/0.juju-log opensearch-peers:1: [Errno 111] Connection refused
    unit-main-0: 15:15:06 ERROR unit.main/0.juju-log opensearch-peers:1: [Errno 111] Connection refused
    unit-main-0: 15:15:09 ERROR unit.main/0.juju-log opensearch-peers:1: [Errno 111] Connection refused
    unit-main-0: 15:15:12 ERROR unit.main/0.juju-log opensearch-peers:1: [Errno 111] Connection refused
    unit-main-0: 15:15:15 DEBUG unit.main/0.juju-log opensearch-peers:1: Getting secret app:admin-password
    unit-main-0: 15:15:15 DEBUG unit.main/0.juju-log opensearch-peers:1: Starting new HTTPS connection (1): 10.122.32.198:9200
    unit-main-0: 15:15:15 DEBUG unit.main/0.juju-log opensearch-peers:1: Error when checking if host 10.122.32.198 is up: HTTP error self.response_code=None
    self.response_text="HTTPSConnectionPool(host='10.122.32.198', port=9200): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)')))"
    unit-main-0: 15:15:18 DEBUG unit.main/0.juju-log opensearch-peers:1: Getting secret app:admin-password
    unit-main-0: 15:15:18 DEBUG unit.main/0.juju-log opensearch-peers:1: Starting new HTTPS connection (1): 10.122.32.198:9200
    unit-main-0: 15:15:18 DEBUG unit.main/0.juju-log opensearch-peers:1: Error when checking if host 10.122.32.198 is up: HTTP error self.response_code=None
    self.response_text="HTTPSConnectionPool(host='10.122.32.198', port=9200): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)')))"
    unit-main-0: 15:15:21 DEBUG unit.main/0.juju-log opensearch-peers:1: Getting secret app:admin-password
    unit-main-0: 15:15:21 DEBUG unit.main/0.juju-log opensearch-peers:1: Starting new HTTPS connection (1): 10.122.32.198:9200
    unit-main-0: 15:15:21 DEBUG unit.main/0.juju-log opensearch-peers:1: Error when checking if host 10.122.32.198 is up: HTTP error self.response_code=None
    self.response_text="HTTPSConnectionPool(host='10.122.32.198', port=9200): Max r

....

unit-main-0: 15:18:13 DEBUG unit.main/0.juju-log Executing command: openssl pkcs12 -export -in /tmp/tmp1uvvhlyo.cert -inkey /tmp/tmp2hqex7t8.pem -out /var/snap/opensearch/current/etc/opensearch/certificates/app-admin.p12 -name app-admin -passout pass:xxx unit-main-0: 15:18:13 ERROR unit.main/0.juju-log err: No cert in -in file '/tmp/tmp1uvvhlyo.cert' matches private key 4007AF64F67E0000:error:05800074:x509 certificate routines:X509_check_private_key:key values mismatch:../crypto/x509/x509_cmp.c:405: / out: unit-main-0: 15:18:13 ERROR unit.main/0.juju-log Error storing the TLS certificates for app-admin: unit-main-0: 15:18:13 INFO unit.main/0.juju-log TLS certificate for app-admin stored.



2. An endless deferral events queue happen:
 <img width="533" alt="Screenshot 2024-08-27 at 21 07 52" src="https://github.com/user-attachments/assets/78340329-e90d-4d74-833e-c585bf11e034">
syncronize-issues-to-jira[bot] commented 2 weeks ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/DPE-5283.

This message was autogenerated

reneradoi commented 4 days ago

The issue comes from two separate root causes:

  1. When requesting new admin certificates (after the CA certificate has been updated), a new private key is generated as well. This leads to situations where the app-admin secret already contains the new private key, but not yet the new certificate (which can only be updated by the leader unit). This is addressed in https://github.com/canonical/opensearch-operator/pull/436.

  2. When processing the newly requested certificates (from 1.), the operator defers the CertificateAvailableEvent even after updating the certificate on disk and on the secret, in cases when the old-ca has not been removed from the truststore yet (see here). This is not necessary, but causes almost endless deferral loops.

reneradoi commented 2 days ago

Issues are addressed in https://github.com/canonical/opensearch-operator/pull/436.