Graylog2 / docker-compose

A set of Docker Compose files that allow you to quickly spin up a Graylog instance for testing or demo purposes.
Apache License 2.0
357 stars 134 forks source link

certificates expired for datanode after long downtime #63

Open snab opened 3 months ago

snab commented 3 months ago

Hi, I have a docket-setup of Graylog in my home lab. I was playing with it for some time, switched it then off. Before I restarted it the next time the selfmade certificates of the datanode expired. Now Graylog does not come up again. Here an quote from the log:

INFO [OpensearchProcessImpl] [2024-03-09T13:14:24,051][WARN ][o.o.h.AbstractHttpServerTransport] [opensearch] caught exception while handling client http traffic, closing connection Netty4HttpChannel{localAddress=/172.18.0.2:9200, remoteAddress=/172.18.0.4:47170} datanode_1 | 2024-03-09T13:14:24.053Z INFO [OpensearchProcessImpl] io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown

I think the mechanism to rotate the certs on it's own does only work when Graylog is up and running - but is it not starting here. It's a bit sad not to consider this case.

My suggested fix would be to check the certs during startup and rotate if necessary during startup.

cheers, Snab

j3k0 commented 4 weeks ago

I might have this issue. How to solve it?

FingerlessGlov3s commented 3 weeks ago

I have the same problem due to the short certificate times, my mongo db died, and when I came to fix it, I think the process that automaticly renews it then didn't happen, and now I can't get the stack back up.

I can't find any documentation that tells you how to manually renew the certificate when using a built in CA. Just how to do the initial setup.

j3k0 commented 3 weeks ago

Happened to me twice that I had to ditch the DB and start fresh (after a few days trying...). Recovery from problems isn't easy.

So I made a mongodump to save inputs, searches and dashboards. Then restored them after a full reset. I recently found out you can also export those into a "content pack" (System -> Content Pack).

The logs are lost, but it's OK if that only happens from time to time.

FingerlessGlov3s commented 3 weeks ago

I finally managed to fix it. Well a workaround.

I connected to the datanode https API instance on port 9200, to check when the certificate expired. Stopped the containers. I then disabled NTP on the host and then set the time on the server to one day before the expiry. Started all the containers, 3-4 minutes later, I went to 9200 again and could see the certificate had renewed. Luckily I didn't need to repeat this process again with an interim date, as the new certs date range included todays date, as it only expired 20 days ago. I then stopped the containers, renabled ntp again, which brought the date back to current. Started containers and then it all worked again.

Hopefully you get this problem again or someone else does, they can try the above workaround. A proper fix would indeed be better though, like a force renew command or something to temporary disable the cert check, so it can get in and renew it via the APIs, it doesn't trust due to the date being wrong.

todvora commented 1 week ago

Hi everyone, I just wanted to let you know that I am working on a better way of certificate renewal with your problems in my mind. Thanks for your feedback and reports!

Best regards, Tomas