canonical / traefik-k8s-operator

https://charmhub.io/traefik-k8s
Apache License 2.0
11 stars 27 forks source link

Traefik ends up in error state. #331

Closed Abuelodelanada closed 5 months ago

Abuelodelanada commented 6 months ago

Bug Description

Traefik ends up in error state when deploying COS-Lite bundle using the TLS overlay.

This issue seems it related to https://github.com/canonical/traefik-k8s-operator/issues/330

To Reproduce

  1. Deploy cos-lite using the TLS and Offers overlays: juju deploy cos-lite --channel=edge --trust --overlay ./tls-overlay.yaml --overlay ./offers-overlay.yaml
  2. The the deployment is ready. check traefik is in error state:
    
    Model  Controller  Cloud/Region        Version  SLA          Timestamp
    cos    microk8s    microk8s/localhost  3.4.2    unsupported  12:17:40-03:00

App Version Status Scale Charm Channel Rev Address Exposed Message alertmanager 0.27.0 active 1 alertmanager-k8s latest/edge 109 10.152.183.146 no
ca active 1 self-signed-certificates latest/edge 135 10.152.183.30 no
catalogue active 1 catalogue-k8s latest/edge 38 10.152.183.67 no
grafana 9.5.3 active 1 grafana-k8s latest/edge 111 10.152.183.117 no
loki 2.9.5 active 1 loki-k8s latest/edge 135 10.152.183.142 no
prometheus 2.50.1 active 1 prometheus-k8s latest/edge 176 10.152.183.150 no
traefik v2.11.0 waiting 1 traefik-k8s latest/edge 180 192.168.1.250 no installing agent

Unit Workload Agent Address Ports Message alertmanager/0 active idle 10.1.165.27
ca/0
active idle 10.1.165.26
catalogue/0 active idle 10.1.165.37
grafana/0
active idle 10.1.165.41
loki/0 active idle 10.1.165.32
prometheus/0
active idle 10.1.165.23
traefik/0* error idle 10.1.165.48 hook failed: "certificates-relation-changed" for ca:certificates

3. After running `juju resolve --no-retry traefik/0` traefik ends up in `active` state.

Model Controller Cloud/Region Version SLA Timestamp cos microk8s microk8s/localhost 3.4.2 unsupported 12:23:22-03:00

App Version Status Scale Charm Channel Rev Address Exposed Message alertmanager 0.27.0 active 1 alertmanager-k8s latest/edge 109 10.152.183.146 no
ca active 1 self-signed-certificates latest/edge 135 10.152.183.30 no
catalogue active 1 catalogue-k8s latest/edge 38 10.152.183.67 no
grafana 9.5.3 active 1 grafana-k8s latest/edge 111 10.152.183.117 no
loki 2.9.5 active 1 loki-k8s latest/edge 135 10.152.183.142 no
prometheus 2.50.1 active 1 prometheus-k8s latest/edge 176 10.152.183.150 no
traefik v2.11.0 active 1 traefik-k8s latest/edge 180 192.168.1.250 no

Unit Workload Agent Address Ports Message alertmanager/0 active idle 10.1.165.27
ca/0
active idle 10.1.165.26
catalogue/0 active idle 10.1.165.37
grafana/0
active idle 10.1.165.41
loki/0 active idle 10.1.165.32
prometheus/0
active idle 10.1.165.23
traefik/0* active idle 10.1.165.48


### Environment

- juju 3.4.2
- microk8s v1.29.2
- traefik Rev 180

### Relevant log output

```shell
unit-traefik-0: 12:17:14.599 DEBUG unit.traefik/0.juju-log certificates:27: Emitting custom event <CertChanged via TraefikIngressCharm/CertHandler[trfk-server-cert]/on/cert_changed[77]>.
unit-traefik-0: 12:17:14.604 ERROR unit.traefik/0.juju-log certificates:27: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "./src/charm.py", line 1076, in <module>
    main(TraefikIngressCharm, use_juju_for_storage=True)
  File "/var/lib/juju/agents/unit-traefik-0/charm/venv/ops/main.py", line 544, in main
    manager.run()
  File "/var/lib/juju/agents/unit-traefik-0/charm/venv/ops/main.py", line 520, in run
    self._emit()
  File "/var/lib/juju/agents/unit-traefik-0/charm/venv/ops/main.py", line 509, in _emit
    _emit_charm_event(self.charm, self.dispatcher.event_name)
  File "/var/lib/juju/agents/unit-traefik-0/charm/venv/ops/main.py", line 143, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-traefik-0/charm/venv/ops/framework.py", line 352, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-traefik-0/charm/venv/ops/framework.py", line 851, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-traefik-0/charm/venv/ops/framework.py", line 941, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-traefik-0/charm/lib/charms/tls_certificates_interface/v3/tls_certificates.py", line 1801, in _on_relation_changed
    self.on.certificate_available.emit(
  File "/var/lib/juju/agents/unit-traefik-0/charm/venv/ops/framework.py", line 352, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-traefik-0/charm/venv/ops/framework.py", line 851, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-traefik-0/charm/venv/ops/framework.py", line 941, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-traefik-0/charm/lib/charms/observability_libs/v0/cert_handler.py", line 308, in _on_certificate_available
    self.on.cert_changed.emit()  # pyright: ignore
  File "/var/lib/juju/agents/unit-traefik-0/charm/venv/ops/framework.py", line 352, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-traefik-0/charm/venv/ops/framework.py", line 851, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-traefik-0/charm/venv/ops/framework.py", line 941, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-traefik-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 547, in wrapped_function
    return callable(*args, **kwargs)  # type: ignore
  File "./src/charm.py", line 394, in _on_cert_changed
    self._update_cert_configs()
  File "/var/lib/juju/agents/unit-traefik-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 547, in wrapped_function
    return callable(*args, **kwargs)  # type: ignore
  File "./src/charm.py", line 399, in _update_cert_configs
    cert, key, ca = self._get_certs()
  File "/var/lib/juju/agents/unit-traefik-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 547, in wrapped_function
    return callable(*args, **kwargs)  # type: ignore
  File "./src/charm.py", line 405, in _get_certs
    raise TLSNotEnabledError()
TLSNotEnabledError
unit-traefik-0: 12:17:14.864 ERROR juju.worker.uniter.operation hook "certificates-relation-changed" (via hook dispatching script: dispatch) failed: exit status 1

Additional context

No response

PietroPasotti commented 6 months ago

Hitting this too now.

Repro steps:

juju deploy traefik-k8s --channel edge traefik
juju deploy self-signed-certificates ssc
juju relate traefik-k8s:certificates ssc

jhack eval traefik/0 self._get_certs()
# None, None, None
PietroPasotti commented 6 months ago

I suspect traefik is looking up some data in secrets, but the remote end is publishing it via relation data, in the clear.

sed-i commented 6 months ago

Was trying to reproduce with what I currently have installed: Juju 3.4.2 on 8cpu16gb multipass vm. 3/3 attempts went totally fine. Seems like a code ordering issue or stable vs edge for self-signed-certificates.

bundle: kubernetes
applications:
  prom:
    charm: prometheus-k8s
    channel: latest/edge
    revision: 182
    scale: 1
    trust: true
  ssc:
    charm: self-signed-certificates
    channel: latest/edge
    revision: 137
    scale: 1
  trfk:
    charm: traefik-k8s
    channel: latest/edge
    revision: 184
    scale: 1
    trust: true
relations:
- - prom:ingress
  - trfk:ingress-per-unit
- - ssc:certificates
  - trfk:certificates

I'm seeing

INFO unit.trfk/0.juju-log Creating CSR for 10.43.8.188 with DNS [] and IPs ['10.43.8.188']

and prometheus is reachable:

$ curl -k 10.43.8.188/rwdrop-prom-0/api/v1/targets
{"status":"success","data":{"activeTargets": ...}}

Also, traefik-k8s from commit 5a1a160 doesn't have TLSNotEnabledError anywhere.

sed-i commented 6 months ago

But end-to-end TLS seems to be broken - curl https works from trfk container but not from outside:

relations:
- - prom:ingress
  - trfk:ingress-per-unit
- - ssc:certificates
  - trfk:certificates
- - ssc:certificates
  - prom:certificates
- - ssc:send-ca-cert
  - trfk:receive-ca-cert
$ juju ssh --container traefik trfk/0 curl https://prom-0.prom-endpoints.rwdrop.svc.cluster.local:9090/api/v1/targets
{"status":"success","data":{"activeTargets":...}}

$ curl -v -L -k https://10.43.8.188/rwdrop-prom-0/api/v1/targets
*   Trying 10.43.8.188:443...
* Connected to 10.43.8.188 (10.43.8.188) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* TLSv1.0 (OUT), TLS header, Certificate Status (22):
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS header, Unknown (21):
* TLSv1.3 (IN), TLS alert, unrecognized name (624):
* error:0A000458:SSL routines::tlsv1 unrecognized name
* Closing connection 0
curl: (35) error:0A000458:SSL routines::tlsv1 unrecognized name

And trfk itself complains:

2024-05-01T04:31:13.051Z [traefik] time="2024-05-01T04:31:13Z" level=debug msg="Serving default certificate for request: \"\""
2024-05-01T04:31:13.051Z [traefik] time="2024-05-01T04:31:13Z" level=debug msg="http: TLS handshake error from 10.43.8.188:41134: tls: no certificates configured"

...because certs config and certs are missing from /etc/traefik. Seems like another potential code ordering problem.

PietroPasotti commented 6 months ago

... and they're not there because they're not being pushed as _get_certs returns None,None,None. I think it's the same issue

PietroPasotti commented 6 months ago

I tried again after redeploying ssc from stable and traefik from edge; it seems that the data is in certificates relation data, and some of it gets transferred into a secret for traefik-internal usage, but not all of it. traefik stores a private key but not the ca cert or server cert.

PietroPasotti commented 6 months ago

This should be fixed by the certhandler 1.7

ca-scribner commented 5 months ago

This affects rev<186, including latest/stable at time of writing (rev180) as reported by SolQA. I've confirmed latest/stable has this issue, and that latest/candidate (rev191) is not affected, so I think the best resolution here is to get candidate promoted to stable

ca-scribner commented 5 months ago

Resolved by this promotion of candidate (rev191) to stable