Charm is in maintenance but does not recover

beliaev-maksim commented 3 months ago

Bug Description

I have the charm deployed and sometimes it does go into unrecoverable maintenance mode. I assume there might be some issue with LEGO on our side and it feel like the charm does not retry, which leads to the case where my workload switches to K8s self signed certs.

prod-cla-checker@enterprise-engineering-bastion-ps6:~$ juju status
Model             Controller    Cloud/Region              Version  SLA          Timestamp
prod-cla-checker  prodstack-is  k8s-prod-general/default  3.1.8    unsupported  11:54:29Z

App                       Version  Status   Scale  Charm                     Channel  Rev  Address       Exposed  Message
charmed-cla-checker                active       1  charmed-cla-checker       edge       1  10.87.244.2   no       
httprequest-lego-k8s               waiting      1  httprequest-lego-k8s      stable    40  10.87.26.217  no       waiting for units to settle down
nginx-ingress-integrator  24.2.0   active       1  nginx-ingress-integrator  stable    95  10.87.48.165  no       Ingress IP(s): 10.141.14.128

Unit                         Workload     Agent  Address          Ports  Message
charmed-cla-checker/0*       active       idle   192.168.100.249         
httprequest-lego-k8s/0*      maintenance  idle   192.168.102.41          
nginx-ingress-integrator/0*  active       idle   192.168.103.29          Ingress IP(s): 10.141.14.128

then I have to run

  374  2024-05-28 11:55:52 juju remove-unit httprequest-lego-k8s --num-units 1
  375  2024-05-28 11:56:09 juju add-unit httprequest-lego-k8s
  376  2024-05-28 11:56:13 juju status

to recover

can this be fixed?

To Reproduce

-

Environment

-

Relevant log output

Additional context

No response

ghislainbourgeois commented 3 months ago

Would you be able to provide the juju debug-log for this unit? One thing that could be happening is that we run lego as a separate process and wait for it to complete, maybe the timeout mechanism is broken.

beliaev-maksim commented 3 months ago

@ghislainbourgeois prodstack cannot extract per app logs, they are empty.

I can just run juju debug-log but that is pretty much useless

ghislainbourgeois commented 3 months ago

@beliaev-maksim in the debug-log, would you see the events that the unit received? I am mostly interested in the history of events before it went in that state.

beliaev-maksim commented 3 months ago

let me in meantime update to the latest revision.

but if you can look in parallel on what could happen, then it would be great

beliaev-maksim commented 3 months ago

debuglog.txt

@ghislainbourgeois if you can find something

gruyaume commented 3 months ago

I'm pretty sure this issue was fixed when we moved to using the collect status event handler. In other words, if you refresh the charm you should be good to go.

beliaev-maksim commented 3 months ago

@ghislainbourgeois @gruyaume now it is even worse. Now I see all the charms active, but there is no certificate

that is a UX disaster...

gruyaume commented 3 months ago

Yes the charm won't show up as in error/blocked if it did not provide a certificate to a request. We are planning to add a field in the status to mention the number of certificate requests fulfilled (see #154) but the charm status itself will remain Active as it is functioning correctly.

Status     Message
Active     "1/3 certificate requests fulfilled"

beliaev-maksim commented 3 months ago

@gruyaume what could be done for the charm to re-request the certs ?

I do not want to scale up/down every day

gruyaume commented 3 months ago

This is already done on update status events, every 5min (or however long the update status is set of the model), the charm will look at the outstanding certificate requests and re-request.

ghislainbourgeois commented 3 months ago

From what I investigated yesterday, the current version does not set the status to maintenance at all. So the previous issue should not reoccur.

I think we can definitely improve the logging and what we set in the status.

We also have some plans to get rid of the workload completely, making this charm k8s or machine agnostic, and it will also help us get more control on the certificate request process.

gruyaume commented 3 months ago

I'm going to close this as the original issue was addressed. The charm status message item is tracked through issue #154

beliaev-maksim commented 3 months ago

latest deployment

$ juju status
Model             Controller    Cloud/Region              Version  SLA          Timestamp
prod-cla-checker  prodstack-is  k8s-prod-general/default  3.1.8    unsupported  08:40:58Z

App                       Version  Status  Scale  Charm                     Channel  Rev  Address       Exposed  Message
charmed-cla-checker                active      1  charmed-cla-checker       edge       1  10.87.244.2   no       
httprequest-lego-k8s               active      1  httprequest-lego-k8s      stable    83  10.87.26.217  no       
nginx-ingress-integrator  24.2.0   active      1  nginx-ingress-integrator  stable    95  10.87.48.165  no       Ingress IP(s): 10.141.14.128

Unit                         Workload  Agent  Address          Ports  Message
charmed-cla-checker/0*       active    idle   192.168.100.249         
httprequest-lego-k8s/0*      active    idle   192.168.102.43          
nginx-ingress-integrator/0*  active    idle   192.168.103.29          Ingress IP(s): 10.141.14.128

gruyaume commented 3 months ago

Reopenning based on feedback from @beliaev-maksim

gruyaume commented 3 months ago

@beliaev-maksim can you pleas include more information as to what the problem actually is. You mentioned having to scale up/down the charm but that's a workaround to a problem. What is the problem?

Also can you please provide the following information

Debug Logs
Relation data between httprequest and the tls requirer (using jhack)

beliaev-maksim commented 3 months ago

@gruyaume my workload requires TLS on the connection. I use combination of LEGO with nginx to do it.

from juju status command all the workloads look to be green and active. However, after some time we start to receive an issue in production that requests fail due to self signed certificates.

I assume something gets corrupted on LEGO and my workload switches to Kubernetes self signed certs.

To recover proper certs I have to scale down and up the LEGO charm. That resolves the issue immediately.

debug logs you can find in the comment above: https://github.com/canonical/httprequest-lego-k8s-operator/issues/162#issuecomment-2139592403

I cannot use jhack. That is ProdStack, I do not have sudo access to install external tools

gruyaume commented 3 months ago

What is the workload that "switches to k8s self signed certs"? Could the issue be in that charm?

beliaev-maksim commented 3 months ago

@gruyaume I think it is nginx

@mthaddon any idea ?

beliaev-maksim commented 2 months ago

looks like there were a bunch of TLS issues on nginx

https://github.com/canonical/nginx-ingress-integrator-operator/issues/137 https://github.com/canonical/nginx-ingress-integrator-operator/issues/138 https://github.com/canonical/nginx-ingress-integrator-operator/issues/140

beliaev-maksim commented 2 months ago

close the issue for now, will reopen if observe certificate issues

canonical / httprequest-lego-k8s-operator