canonical / tempo-k8s-operator

This charmed operator automates the operational procedures of running Grafana Tempo, an open-source tracing backend.
https://charmhub.io/tempo-k8s
Apache License 2.0
3 stars 3 forks source link

When related to a certificates operator, charm_tracing provides no way to wait for client charm to settle and form relations before TLS errors thrown #121

Closed shayancanonical closed 3 weeks ago

shayancanonical commented 1 month ago

Bug Description

When tempo-k8s is related to a certificates operator, the charm_tracing lib produces a flood of errors when the client application (mysql) is related to tempo-k8s but not the certificates operator. My understanding is that the trace_charm decorator does not check if the relation with the certificates operator exists before trying to export spans to tempo-k8s (it always calls and utilizes the server_cert kwarg in the trace_charm decorator)

To Reproduce

In a microk8s model: 1.git clone git@github.com:canonical/cos-lite-bundle.git

  1. tox -e render-edge
  2. juju deploy ./bundle.yaml --edge
  3. juju deploy -n 1 tempo-k8s --channel edge
  4. juju deploy -n 1 self-signed-certificates
  5. jhack imatrix fill
  6. juju offer tempo-k8s:tracing
  7. juju offer self-signed-certificates:certificates

In an lxd model:

  1. git clone git@github.com:canonical/mysql-operator.git
  2. git checkout 6817e0a
  3. tox -e build-production
  4. juju deploy -n 1 ./mysql_ubuntu-22.04-amd64.charm
  5. juju consume :admin/.tempo-k8s
  6. juju consume :admin/.self-signed-certificates
  7. juju relate mysql:tracing tempo-k8s:tracing

Environment

juju: 3.4.2 lxd: 5.21.1 LTS MicroK8s v1.27.13 revision 6744 cert_handler.py: v1.8

Relevant log output

When None is passed in as server_cert to the trace_charm decorator

unit-mysql-router-2: 12:55:55 ERROR unit.mysql-router/2.juju-log backend-database:43: Exception while exporting Span batch.
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-mysql-router-2/charm/venv/urllib3/connectionpool.py", line 714, in urlopen
    httplib_response = self._make_request(
  File "/var/lib/juju/agents/unit-mysql-router-2/charm/venv/urllib3/connectionpool.py", line 403, in _make_request
    self._validate_conn(conn)
  File "/var/lib/juju/agents/unit-mysql-router-2/charm/venv/urllib3/connectionpool.py", line 1053, in _validate_conn
    conn.connect()
  File "/var/lib/juju/agents/unit-mysql-router-2/charm/venv/urllib3/connection.py", line 419, in connect
    self.sock = ssl_wrap_socket(
  File "/var/lib/juju/agents/unit-mysql-router-2/charm/venv/urllib3/util/ssl_.py", line 453, in ssl_wrap_socket
    ssl_sock = _ssl_wrap_socket_impl(sock, context, tls_in_tls)
  File "/var/lib/juju/agents/unit-mysql-router-2/charm/venv/urllib3/util/ssl_.py", line 495, in _ssl_wrap_socket_impl
    return ssl_context.wrap_socket(sock)
  File "/usr/lib/python3.10/ssl.py", line 513, in wrap_socket
    return self.sslsocket_class._create(
  File "/usr/lib/python3.10/ssl.py", line 1100, in _create
    self.do_handshake()
  File "/usr/lib/python3.10/ssl.py", line 1371, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1007)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-mysql-router-2/charm/venv/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/var/lib/juju/agents/unit-mysql-router-2/charm/venv/urllib3/connectionpool.py", line 798, in urlopen
    retries = retries.increment(
  File "/var/lib/juju/agents/unit-mysql-router-2/charm/venv/urllib3/util/retry.py", line 592, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='10.0.0.44', port=4318): Max retries exceeded with url: /v1/traces (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1007)')))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-mysql-router-2/charm/venv/opentelemetry/sdk/trace/export/__init__.py", line 368, in _export_batch
    self.span_exporter.export(self.spans_list[:idx])  # type: ignore
  File "/var/lib/juju/agents/unit-mysql-router-2/charm/venv/opentelemetry/exporter/otlp/proto/http/trace_exporter/__init__.py", line 145, in export
    resp = self._export(serialized_data)
  File "/var/lib/juju/agents/unit-mysql-router-2/charm/venv/opentelemetry/exporter/otlp/proto/http/trace_exporter/__init__.py", line 114, in _export
    return self._session.post(
  File "/var/lib/juju/agents/unit-mysql-router-2/charm/venv/requests/sessions.py", line 637, in post
    return self.request("POST", url, data=data, json=json, **kwargs)
  File "/var/lib/juju/agents/unit-mysql-router-2/charm/venv/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/var/lib/juju/agents/unit-mysql-router-2/charm/venv/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/var/lib/juju/agents/unit-mysql-router-2/charm/venv/requests/adapters.py", line 517, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='10.0.0.44', port=4318): Max retries exceeded with url: /v1/traces (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1007)')))

When related with tempo-k8s, but no relation with self-signed-certificates while providing a server_cert kwarg:

unit-mysql-router-1: 12:10:56 ERROR unit.mysql-router/1.juju-log backend-database:34: Exception while exporting Span batch.
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-mysql-router-1/charm/venv/opentelemetry/sdk/trace/export/__init__.py", line 368, in _export_batch
    self.span_exporter.export(self.spans_list[:idx])  # type: ignore
  File "/var/lib/juju/agents/unit-mysql-router-1/charm/venv/opentelemetry/exporter/otlp/proto/http/trace_exporter/__init__.py", line 145, in export
    resp = self._export(serialized_data)
  File "/var/lib/juju/agents/unit-mysql-router-1/charm/venv/opentelemetry/exporter/otlp/proto/http/trace_exporter/__init__.py", line 114, in _export
    return self._session.post(
  File "/var/lib/juju/agents/unit-mysql-router-1/charm/venv/requests/sessions.py", line 637, in post
    return self.request("POST", url, data=data, json=json, **kwargs)
  File "/var/lib/juju/agents/unit-mysql-router-1/charm/venv/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/var/lib/juju/agents/unit-mysql-router-1/charm/venv/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/var/lib/juju/agents/unit-mysql-router-1/charm/venv/requests/adapters.py", line 458, in send
    self.cert_verify(conn, request.url, verify, cert)
  File "/var/lib/juju/agents/unit-mysql-router-1/charm/venv/requests/adapters.py", line 261, in cert_verify
    raise OSError(
OSError: Could not find a suitable TLS CA certificate bundle, invalid path: /var/run/tracing-ca.crt
unit-mysql-router-1: 12:11:01 ERROR unit.mysql-router/1.juju-log backend-database:34: Exception while exporting Span batch.
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-mysql-router-1/charm/venv/opentelemetry/sdk/trace/export/__init__.py", line 368, in _export_batch
    self.span_exporter.export(self.spans_list[:idx])  # type: ignore
  File "/var/lib/juju/agents/unit-mysql-router-1/charm/venv/opentelemetry/exporter/otlp/proto/http/trace_exporter/__init__.py", line 145, in export
    resp = self._export(serialized_data)
  File "/var/lib/juju/agents/unit-mysql-router-1/charm/venv/opentelemetry/exporter/otlp/proto/http/trace_exporter/__init__.py", line 114, in _export
    return self._session.post(
  File "/var/lib/juju/agents/unit-mysql-router-1/charm/venv/requests/sessions.py", line 637, in post
    return self.request("POST", url, data=data, json=json, **kwargs)
  File "/var/lib/juju/agents/unit-mysql-router-1/charm/venv/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/var/lib/juju/agents/unit-mysql-router-1/charm/venv/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/var/lib/juju/agents/unit-mysql-router-1/charm/venv/requests/adapters.py", line 458, in send
    self.cert_verify(conn, request.url, verify, cert)
  File "/var/lib/juju/agents/unit-mysql-router-1/charm/venv/requests/adapters.py", line 261, in cert_verify
    raise OSError(
OSError: Could not find a suitable TLS CA certificate bundle, invalid path: /var/run/tracing-ca.crt
unit-mysql-router-1: 12:11:01 ERROR unit.mysql-router/1.juju-log backend-database:34: Exception while exporting Span batch.
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-mysql-router-1/charm/venv/opentelemetry/sdk/trace/export/__init__.py", line 368, in _export_batch
    self.span_exporter.export(self.spans_list[:idx])  # type: ignore
  File "/var/lib/juju/agents/unit-mysql-router-1/charm/venv/opentelemetry/exporter/otlp/proto/http/trace_exporter/__init__.py", line 145, in export
    resp = self._export(serialized_data)
  File "/var/lib/juju/agents/unit-mysql-router-1/charm/venv/opentelemetry/exporter/otlp/proto/http/trace_exporter/__init__.py", line 114, in _export
    return self._session.post(
  File "/var/lib/juju/agents/unit-mysql-router-1/charm/venv/requests/sessions.py", line 637, in post
    return self.request("POST", url, data=data, json=json, **kwargs)
  File "/var/lib/juju/agents/unit-mysql-router-1/charm/venv/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/var/lib/juju/agents/unit-mysql-router-1/charm/venv/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/var/lib/juju/agents/unit-mysql-router-1/charm/venv/requests/adapters.py", line 458, in send
    self.cert_verify(conn, request.url, verify, cert)
  File "/var/lib/juju/agents/unit-mysql-router-1/charm/venv/requests/adapters.py", line 261, in cert_verify
    raise OSError(
OSError: Could not find a suitable TLS CA certificate bundle, invalid path: /var/run/tracing-ca.crt

Additional context

No response

PietroPasotti commented 1 month ago

I added last week functionality to only pop that error once per event. But you're still going to get the error if you give charm_tracing a path to a cert but the path isn't valid, or if you're not passing a cert but tempo is expecting https communication.

Indeed, charm_tracing has no knowledge of the relations your charm has. All it knows is whether you're passing a cert_path to it or not. If yes, it attempts to use it to talk to an https endpoint and fails if either the file isn't there, or the server rejects the connection.

We're aware of a race due to the timing of the root span creation. Issue is, the charm sets up the cert after init (while handling the certificates-relation-something event), but charm_tracing does its thing BEFORE charm init, so it will always give an error the first time. It should go away with the next event though.

Are you seeing this event after the relation is healthy and the cert has been written to disk?

@mmkay we could consider making that error less verbose and just saying 'failed to send trace over https', and put the whole traceback in debug. WDYT?

shayancanonical commented 1 month ago

Can confirm that we are seeing this error trace when the relation with the certificates operator has not yet been formed and the relation is not healthy if it has been formed

Consideration: would it be too convoluted for the trace_charm decorator to have a healthcheck like kwarg (similar to how we do self.tracing.is_ready() before returning a tracing endpoint), and only attempt to export traces to tempo if the healthcheck passes? the TracingEndpointRequirer can expose a method that returns whether the tempo unit has HTTPS enabled, which can be used in the healthcheck property (on the client charm code) to confirm that the relation with the certificates operator exists and that the CA cert has been saved to disk. Furthermore, we can move the self.tracing.is_ready() check to the healthcheck property as well

PietroPasotti commented 1 month ago

we can silence the warnings, that's not the issue, the question is how to do that without hiding too many actual errors. If your tracing instrumentation isn't sending traces, you need to know. Maybe getting a couple of false positives every time you are setting it up is unavoidable.

It feels hard to generalize what that healthcheck should look like, and you can already disable charm tracing by returning None:

    from charms.tempo_k8s.v1.charm_tracing import trace_charm

    @trace_charm(
        tracing_endpoint="my_tracing_endpoint",
    )
    class MyCharm(CharmBase):

    ...
        @property
        def charm_tracing_ready():
            # custom healthcheck
            return tracing.is_ready() and self.certificates.is_ready() and self.has_written_cert_to_disk() and ...  # custom checks

        @property
        def my_tracing_endpoint(self) -> Optional[str]:
            if self.charm_tracing_ready:
                return self.tracing.otlp_http_endpoint()
            else:
                return None  # charm tracing disabled for the run

does this work for you? Or does it not solve the issue?

shayancanonical commented 1 month ago

@PietroPasotti you're right, i hadnt thought of returning None for the tracing endpoint if the charm is not ready to send traces (i.e. the certificates relation is not in place). however, how would the charm be able to tell if tempo-k8s is using a TLS cert from a cert operator? would it still make sense to have TracingEndpointRequirer provide this information?

PietroPasotti commented 3 weeks ago

I think we have a neat solution cooking, stay tuned https://github.com/canonical/tempo-k8s-operator/pull/135