Open lemrouch opened 1 year ago
The man page for SSL_get_error() list a number of functions to which it applies, which includes SSL_shutdown(), and so I think ERR_clear_error() should also be called before the two calls of SSL_shutdown(). keepalived does not use any of the other functions listed.
However, I think the patch will not work as a solution; in fact I think we have a severe problem.
The SSL_get_error(3) man page states:
In addition to ssl and ret, SSL_get_error() inspects the current thread's OpenSSL error queue. Thus, SSL_get_error() must be used in the same thread that performed the TLS/SSL I/O operation, and no other OpenSSL function calls should appear in between. The current thread's error queue must be empty before the TLS/SSL I/O operation is attempted, or SSL_get_error() will not work reliably.
The last sentence says we need to ensure that the thread's error queue is empty before we call SSL_connect(), SSL_read(), SSL_write() or SSL_shutdown(), so either we must call ERR_clear_error() before any of the SSL functions, or after an error we need to ensure we read all the errors, which is a solution I prefer since we ought to process all the errors and not just throw them away.
The problem is the second sentence above "... and no other OpenSSL function calls should appear in between."
. Apart from it being unclear what between
refers to, the real problem is that keepalived can have SSL connection open simultaneously for each SSLCHECK. If an error is returned from any SSL call, SSL_get_error() will return errors not only for the particular connection, but also for errors that have occurred on any other SSL connection that are on the OpenSSL error queue. Calling ERR_clearerror() before all SSL function calls could mean that errors are discarded without being processed.
There are two possible solutions that I can see (but only 1 can work for keepalived):
The problem with option 2 is that (except for DBus) we do not use pthreads, and this would require a not insignificant architecture change for keepalived.
I will think on about this and see if I can work out the best solution. It might need reading the openSSL code to really understand how it works, but I would rather avoid that.
Describe the bug I'm upgrading our servers from Debian Bullseye to Debian Bookworm. Some of them act as load balancers using keepalived. Right now I have one Bullseye and one Bookworm with the same configuration checking the same services. Several of our services are running on HTTPS therefore I'm using SSL_CHECK. I can see that the Bookworm one occasionally fails SSL_CHECK for several seconds on one service while the Bullseye does not report any problem at all. I've seen there was RST packet sent from keepalived to the service when the check failed.
To Reproduce That's hard to tell. It's quite rare - not even once per hour with 1s loop delay with tens of real servers. I would say services with longer certificate chain are more likely to fail.
Expected behavior Don't fail SSL_CHECK when the service has no problems.
Keepalived version
Distro (please complete the following information):
Configuration file:
Notify and track scripts
System Log entries
Did keepalived coredump? no
Additional context I've reported Debian bug but the maintainer asked me to fill upstream issue first.
I was looking for possible reason and I've found https://github.com/openssl/openssl/issues/20365 https://github.com/pjsip/pjproject/issues/3632 https://stackoverflow.com/questions/18179128/how-to-manage-the-error-queue-in-openssl-ssl-get-error-and-err-get-error
They are all basically saying that you can have multiple SSL errors left in error queue and you are supposed to run ERR_geterror() before calling SSL* functions.
I was able to solve this issue with simple patch. What do you think about it?