canonical / traefik-k8s-operator

https://charmhub.io/traefik-k8s
Apache License 2.0
11 stars 26 forks source link

Issue with integration with self-signed-certificates since revision 174 on latest/stable #322

Closed Thanhphan1147 closed 5 months ago

Thanhphan1147 commented 5 months ago

Bug Description

Hi, thanks for your work on traefik!

I noticed that since revision 174 is promoted to latest/stable, integrations with the self-signed-certificates charm has been broken.

Here is an extract of the error:

unit-traefik-public-0: 12:23:02 ERROR unit.traefik-public/0.juju-log certificates:1: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "./src/charm.py", line 1032, in <module>
    main(TraefikIngressCharm, use_juju_for_storage=True)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/main.py", line 456, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/framework.py", line 352, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/framework.py", line 865, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/framework.py", line 955, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/lib/charms/tls_certificates_interface/v2/tls_certificates.py", line 1824, in _on_relation_changed
    self.on.certificate_available.emit(
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/framework.py", line 352, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/framework.py", line 865, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/framework.py", line 955, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/lib/charms/observability_libs/v0/cert_handler.py", line 305, in _on_certificate_available
    self.on.cert_changed.emit()  # pyright: ignore
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/framework.py", line 352, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/framework.py", line 865, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/framework.py", line 955, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 532, in wrapped_function
    return callable(*args, **kwargs)  # type: ignore
  File "./src/charm.py", line 378, in _on_cert_changed
    self._update_cert_configs()
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 532, in wrapped_function
    return callable(*args, **kwargs)  # type: ignore
  File "./src/charm.py", line 384, in _update_cert_configs
    self.traefik.update_cert_configuration(
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/src/traefik.py", line 154, in update_cert_configuration
    self.update_ca_certs()
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/src/traefik.py", line 195, in update_ca_certs
    self.restart()
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/src/traefik.py", line 578, in restart
    self._container.restart(self.service_name)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/model.py", line 2136, in restart
    self._pebble.restart_services(service_names)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/pebble.py", line 1962, in restart_services
    return self._services_action('restart', services, timeout, delay)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/pebble.py", line 1983, in _services_action
    raise ChangeError(change.err, change)
ops.pebble.ChangeError: cannot perform the following tasks:
- Start service "traefik" (cannot start service: exited quickly with code 1)
----- Logs from task 0 -----
2024-04-11T10:23:02Z INFO Service "traefik" has never been started.
----- Logs from task 1 -----
2024-04-11T10:23:02Z INFO Most recent service output:
    2024/04/11 10:23:02 command /bin/traefik error: field not found, node: [0]
2024-04-11T10:23:02Z ERROR cannot start service: exited quickly with code 1
-----
unit-traefik-public-0: 12:23:02 ERROR juju.worker.uniter.operation hook "certificates-relation-changed" (via hook dispatching script: dispatch) failed: exit status 1

It seems like there might be a syntax error somewhere in the traefik's yaml configuration.

To Reproduce

  1. juju deploy traefik-k8s traefik --channel=latest/edge --revision=174
  2. juju deploy self-signed-certificates --channel=latest/edge --revision=127
  3. juju relate traefik:certificates self-signed-certificates

Environment

latest/stable, revision 174

Relevant log output

unit-traefik-public-0: 12:23:02 ERROR unit.traefik-public/0.juju-log certificates:1: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "./src/charm.py", line 1032, in <module>
    main(TraefikIngressCharm, use_juju_for_storage=True)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/main.py", line 456, in main
    _emit_charm_event(charm, dispatcher.event_name)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/main.py", line 144, in _emit_charm_event
    event_to_emit.emit(*args, **kwargs)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/framework.py", line 352, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/framework.py", line 865, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/framework.py", line 955, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/lib/charms/tls_certificates_interface/v2/tls_certificates.py", line 1824, in _on_relation_changed
    self.on.certificate_available.emit(
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/framework.py", line 352, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/framework.py", line 865, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/framework.py", line 955, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/lib/charms/observability_libs/v0/cert_handler.py", line 305, in _on_certificate_available
    self.on.cert_changed.emit()  # pyright: ignore
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/framework.py", line 352, in emit
    framework._emit(event)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/framework.py", line 865, in _emit
    self._reemit(event_path)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/framework.py", line 955, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 532, in wrapped_function
    return callable(*args, **kwargs)  # type: ignore
  File "./src/charm.py", line 378, in _on_cert_changed
    self._update_cert_configs()
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/lib/charms/tempo_k8s/v1/charm_tracing.py", line 532, in wrapped_function
    return callable(*args, **kwargs)  # type: ignore
  File "./src/charm.py", line 384, in _update_cert_configs
    self.traefik.update_cert_configuration(
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/src/traefik.py", line 154, in update_cert_configuration
    self.update_ca_certs()
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/src/traefik.py", line 195, in update_ca_certs
    self.restart()
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/src/traefik.py", line 578, in restart
    self._container.restart(self.service_name)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/model.py", line 2136, in restart
    self._pebble.restart_services(service_names)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/pebble.py", line 1962, in restart_services
    return self._services_action('restart', services, timeout, delay)
  File "/var/lib/juju/agents/unit-traefik-public-0/charm/venv/ops/pebble.py", line 1983, in _services_action
    raise ChangeError(change.err, change)
ops.pebble.ChangeError: cannot perform the following tasks:
- Start service "traefik" (cannot start service: exited quickly with code 1)
----- Logs from task 0 -----
2024-04-11T10:23:02Z INFO Service "traefik" has never been started.
----- Logs from task 1 -----
2024-04-11T10:23:02Z INFO Most recent service output:
    2024/04/11 10:23:02 command /bin/traefik error: field not found, node: [0]
2024-04-11T10:23:02Z ERROR cannot start service: exited quickly with code 1
-----
unit-traefik-public-0: 12:23:02 ERROR juju.worker.uniter.operation hook "certificates-relation-changed" (via hook dispatching script: dispatch) failed: exit status 1


### Additional context

_No response_
PietroPasotti commented 5 months ago

thanks, I'll take a look

PietroPasotti commented 5 months ago

Looks like I can't reproduce on my localhost. image

Any additional information you can add? Platform? Juju version?

Thanhphan1147 commented 5 months ago

Yes, here is my environment

ubuntu@ubuntu:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:    22.04
Codename:   jammy

ubuntu@ubuntu:~$ sudo microk8s version
MicroK8s v1.28.7 revision 6532

ubuntu@ubuntu:~$ juju status --relations --storage
Model    Controller  Cloud/Region        Version  SLA          Timestamp
jenkins  controller  microk8s/localhost  3.1.7    unsupported  15:20:43+02:00

App                       Version  Status   Scale  Charm                     Channel        Rev  Address         Exposed  Message
self-signed-certificates           active       1  self-signed-certificates  latest/edge    127  10.152.183.92   no       
stable                             active       1  self-signed-certificates  latest/stable   72  10.152.183.233  no       
traefik-public            v2.11.0  waiting      1  traefik-k8s               latest/edge    174  10.9.2.103      no       installing agent

Unit                         Workload     Agent  Address       Ports  Message
self-signed-certificates/0*  active       idle   10.1.243.227         
stable/0*                    active       idle   10.1.243.207         
traefik-public/0*            maintenance  idle   10.1.243.215         restarting traefik...

Integration provider                   Requirer                     Interface         Type     Message
self-signed-certificates:certificates  traefik-public:certificates  tls-certificates  regular  
traefik-public:peers                   traefik-public:peers         traefik_peers     peer     

Storage Unit      Storage ID        Type        Pool        Mountpoint                              Size     Status    Message
traefik-public/0  configurations/0  filesystem  kubernetes  /var/lib/juju/storage/configurations/0  1.0 GiB  attached  Successfully provisioned volume pvc-7be1b3b7-c985-4d23-9e31-fe85077b42c1

ubuntu@ubuntu:~$ juju --version
3.1.7-genericlinux-amd64
Thanhphan1147 commented 5 months ago

juju status showed traefik restarting and I can see the error in debug-log:

ops.pebble.ChangeError: cannot perform the following tasks:
- Start service "traefik" (cannot start service: exited quickly with code 1)
----- Logs from task 1 -----
2024-04-11T13:16:24Z INFO Most recent service output:
    (...)
    time="2024-04-11T13:16:22Z" level=debug msg="Creating middleware" entryPointName=diagnostics middlewareName=metrics-entrypoint middlewareType=Metrics
    time="2024-04-11T13:16:22Z" level=debug msg="Creating middleware" middlewareType=Metrics entryPointName=web middlewareName=metrics-entrypoint
    time="2024-04-11T13:16:22Z" level=debug msg="Creating middleware" middlewareType=Metrics entryPointName=websecure middlewareName=metrics-entrypoint
    time="2024-04-11T13:16:24Z" level=info msg="I have to go..."
    time="2024-04-11T13:16:24Z" level=info msg="Stopping server gracefully"
    time="2024-04-11T13:16:24Z" level=debug msg="Waiting 10s seconds before killing connections." entryPointName=websecure
    time="2024-04-11T13:16:24Z" level=debug msg="Waiting 10s seconds before killing connections." entryPointName=web
    time="2024-04-11T13:16:24Z" level=error msg="accept tcp [::]:443: use of closed network connection" entryPointName=websecure
    time="2024-04-11T13:16:24Z" level=error msg="close tcp [::]:443: use of closed network connection" entryPointName=websecure
    time="2024-04-11T13:16:24Z" level=debug msg="Entry point websecure closed" entryPointName=websecure
    time="2024-04-11T13:16:24Z" level=debug msg="Waiting 10s seconds before killing connections." entryPointName=diagnostics
    time="2024-04-11T13:16:24Z" level=error msg="accept tcp [::]:80: use of closed network connection" entryPointName=web
    time="2024-04-11T13:16:24Z" level=error msg="close tcp [::]:80: use of closed network connection" entryPointName=web
    time="2024-04-11T13:16:24Z" level=error msg="accept tcp [::]:8082: use of closed network connection" entryPointName=diagnostics
    time="2024-04-11T13:16:24Z" level=error msg="close tcp [::]:8082: use of closed network connection" entryPointName=diagnostics
    time="2024-04-11T13:16:24Z" level=debug msg="Entry point web closed" entryPointName=web
    time="2024-04-11T13:16:24Z" level=debug msg="Entry point diagnostics closed" entryPointName=diagnostics
    time="2024-04-11T13:16:24Z" level=info msg="Server stopped"
    time="2024-04-11T13:16:24Z" level=info msg="Shutting down"
    2024/04/11 13:16:24 command /bin/traefik error: field not found, node: [0]
2024-04-11T13:16:24Z ERROR cannot start service: exited quickly with code 1
-----
PietroPasotti commented 5 months ago

that /bin/traefik error: field not found, node: [0] does ring a bell

Thanhphan1147 commented 5 months ago

I teared down the environment and redeploy the 2 charms and weirdly enough the error didn't show this time ....

facundofc commented 5 months ago

In case it helps, I'm pretty sure this introduced the issue.

PietroPasotti commented 5 months ago

It might be related, but as far as I can tell the syntax is correct. If you manage to reproduce it again, can you share juju ssh --container traefik traefik/0 cat /etc/traefik/traefik.yaml?

PietroPasotti commented 5 months ago

ah! this might be something. 'redirections' needs to be a dict, not a list (unsure about the yaml terminology)

should look like: image

looks like: image

still, weird that it only errors out sometimes?

Thanhphan1147 commented 5 months ago

I did teardown/deploy multiple times since and I couldn't reprduce it in the same environment. here is traefik.yaml when there's NO error :

ubuntu@ubuntu:~$ juju ssh --container traefik traefik-k8s/0 cat /etc/traefik/traefik.yaml
entryPoints:
  diagnostics:
    address: :8082
  web:
    address: :80
    http:
    - redirections:
        entryPoint:
          scheme: https
          to: websecure
  websecure:
    address: :443
log:
  level: DEBUG
metrics:
  prometheus:
    addRoutersLabels: true
    addServicesLabels: true
    entryPoint: diagnostics
ping:
  entryPoint: diagnostics
providers:
  file:
    directory: /opt/traefik/juju
    watch: true
ubuntu@ubuntu:~$
Thanhphan1147 commented 5 months ago

here's one of our CI that failed with the error : https://github.com/canonical/jenkins-k8s-operator/actions/runs/8639246022/job/23685277058?pr=140

PietroPasotti commented 5 months ago

ah yes, so in the revision that's giving trouble, the code is: image

which generates the wrong config

on main, that's been fixed already by @sed-i at 4d3ccc6160d0404601bfe995cdeb7b86d2d7f254

so I guess that's a matter of using edge, waiting until the fix reaches stable, or if it's a big issue for you, ask our magnanimous @simskij to speed up the release train.