TwiN / gatus

⛑ Automated developer-oriented status page
https://gatus.io
Apache License 2.0
6.41k stars 426 forks source link

Caching of TLS Certificates #856

Open gat786 opened 2 months ago

gat786 commented 2 months ago

Describe the bug

If you have a Certificate expiry check configured and the certificate was renewed a few days ago, gatus will still keep the previous older certificate cache and continue to do checks on the older one even though a newer certificate exists on the endpoint that it is checking.

What do you see?

Gatus continues to use the previously cached certificate to do checks and which results in failed checks and false alerts being raised.

What do you expect to see?

Gatus should check for the certificate with each request made and always compare the checks with them instead of using a cached version of the certs.

List the steps that must be taken to reproduce this issue

  1. Configure a check with CERTIFICATE_EXPIRY condition

  2. Configure CERTIFICATE_EXPIRY > 240h (10 days)

  3. Configure Certificate renewal to be 15 days For Example we are using Certmanager so we set the ingress labels like this

    cert-manager.io/cluster-issuer: my-issuer # lets encrypt in my case
    cert-manager.io/renew-before: 360h
  4. Even if the certificate renewal was done 5 days ago, gatus still has the previous certificate cached and will continue to do checks with it and not get the latest cert from endpoint.

  5. After restart as the cache is removed gatus will pick latest cert and work as expected.

Version

5.11.0

Additional information

No response

TwiN commented 1 month ago

Even if the certificate renewal was done 5 days ago, gatus still has the previous certificate cached and will continue to do checks with it and not get the latest cert from endpoint.

This assumes that nothing went wrong, which is exactly what Gatus is supposed to help protect against.

What if somebody accidentally updated the certificate with a certificate that's 3 years old? The fact that the certificate is cached would hide this issue

renevo commented 1 month ago

@TwiN we have hit this issue as well. We have auto-renewing certificates via Smallstep, the certificates were updated and the endpoints are serving the updated certificate, but Gatus continues to show the certificate is expiring from the previous certificate cycle. The only way to fix the alert is to restart the Gatus instance.

To be clear, it seems that Gatus is caching certificates, which causes short lived TLS certs to alert and not clear without a restart of the Gatus instance.

Digging a bit more on our own side, this specific service with the "stuck" or "cached" certificate from the Gatus side might be related to our long timeouts on the server side for supporting server sent events. It is possible that the http client in Gatus is able to stay connected for long periods of time and not making a new connection with every request. Could potentially be a way to fix by making sure that the http client in Gatus closes connections after making a request, rather than letting the stdlib close out idle connections that might not ever happen on frequent health checks.

gat786 commented 1 month ago

I don't want to impose something on you which you have not planned against. But I wanted to know what do you mean when you say -

This assumes that nothing went wrong, which is exactly what Gatus is supposed to help protect against.

what could have gone wrong?, How is Gatus protecting me from something when it is using a cached version of a certificate which no longer is applied on the ingress from which it is doing a healthcheck?

TwiN commented 1 month ago

Gatus continues to show the certificate is expiring from the previous certificate cycle. The only way to fix the alert is to restart the Gatus instance.

~~@renevo I've never seen this happen, and I use Gatus myself to monitor a fairly diverse range of infrastructures 🤔 When you say short lived TLS certs, how "short lived" are we talking about? Months? Minutes?~~

what could have gone wrong?, How is Gatus protecting me from something when it is using a cached version of a certificate which no longer is applied on the ingress from which it is doing a healthcheck?

~~I think it would be easier for me if I saw an implementation of what you're asking for 😆 More specifically, how would Gatus detect that the endpoint's certificate has changed if Gatus caches the certificate, without fetching the new certificate? So I understand that if nothing has changed, you can just used the cached certificate every time, but I imagine that to deal wit the aforementioned situation, when the certificate is deemed as invalid, you'd have to invalidate the cache entry and re-fetch the certificate to ensure that the invalidity wasn't caused by an outdated cache entry, which would imply that if the certificate truly is invalid, you'd basically have to do 2 requests every time (1st request would be a normal health check, which would fail because the cached certificate is deemed as invalid, while the 2nd request would be to retrieve the certificate)?~~

~~I'm not sure if this makes sense; it's been a long week. What I'm trying to convey is that caching certificates would require handling a lot of edge cases (so increased complexity), and I'm not sure if the benefits outweighs the risk.~~

Ugh, it's been a long few weeks and I didn't read the issue properly. I thought you were requesting that certificates should be cached, but it seems you're saying that certificates are currently being cached and that is causing an issue.

I had no idea certificates were being cached, and I completely agree with you.

I think they shouldn't be cached, or at the very least, they shouldn't be cached for more than 24h.

I'm sorry about the misunderstanding.

gat786 commented 1 month ago

Its ok, I like your tool, I will see if I can find the code that does this and try to assist you.

gat786 commented 1 month ago

FYI @renevo

Hey I wasn't able to pinpoint the code which causes the caching but I was able to test that Gatus ~doesn't use same connections for tests~ uses different connections for each test that it performs.

What I did was ran an Nginx service with logging enabled in which I could see connection ids.

Format I used was as below -

log_format connection '$remote_addr - $remote_user [$time_local] '
                      '"$request" $status $body_bytes_sent '
                      '"$http_referer" "$http_user_agent" '
                      'conn=$connection conn_requests=$connection_requests '
                      'connection_time=$connection_time';

I saw that each time Gatus did a check it created a new connection id. I also created a Curl request to the same Nginx service which made 4 calls all from the same connection and was able to verify that while Curl made all the calls from the same connection Gatus each time created a new connection.

See the logs below -

*************
192.168.65.1 - - [29/Sep/2024:09:00:22 +0000] "GET / HTTP/1.1" 200 615 "-" "Gatus/1.0" conn=70 conn_requests=1 connection_time=0.007
192.168.65.1 - - [29/Sep/2024:09:00:32 +0000] "GET / HTTP/1.1" 200 615 "-" "Gatus/1.0" conn=71 conn_requests=1 connection_time=0.006
192.168.65.1 - - [29/Sep/2024:09:00:40 +0000] "GET / HTTP/1.1" 200 615 "-" "curl/8.7.1" conn=72 conn_requests=1 connection_time=0.022
192.168.65.1 - - [29/Sep/2024:09:00:40 +0000] "GET / HTTP/1.1" 200 615 "-" "curl/8.7.1" conn=72 conn_requests=2 connection_time=0.024
192.168.65.1 - - [29/Sep/2024:09:00:40 +0000] "GET / HTTP/1.1" 200 615 "-" "curl/8.7.1" conn=72 conn_requests=3 connection_time=0.024
192.168.65.1 - - [29/Sep/2024:09:00:40 +0000] "GET / HTTP/1.1" 200 615 "-" "curl/8.7.1" conn=72 conn_requests=4 connection_time=0.025
192.168.65.1 - - [29/Sep/2024:09:00:43 +0000] "GET / HTTP/1.1" 200 615 "-" "Gatus/1.0" conn=73 conn_requests=1 connection_time=0.003
192.168.65.1 - - [29/Sep/2024:09:00:53 +0000] "GET / HTTP/1.1" 200 615 "-" "Gatus/1.0" conn=74 conn_requests=1 connection_time=0.002
**********

So problems not there.1

gat786 commented 1 month ago

btw I was running gatus from main when I did these checks