Intermittent failure of "Sample: ssl_pthread_server, openssl client, TLS 1.2" (ssl-opt.sh)

mpg commented 1 week ago

Observed on:

Internal CI 31 Oct on 69868299a843d1925f4799de6f43b1b31a4e9357: all_u16-test_malloc_0_null-o-cli-841.log - all_u16-test_malloc_0_null-o-srv-841.log

Output from ssl-opt.sh:

[2024-10-31T01:39:38.710Z] Sample: ssl_fork_server, gnutls client, TLS 1.3 ........................ PASS
[2024-10-31T01:39:40.306Z] Sample: ssl_pthread_server, ssl_client2 ................................ PASS
[2024-10-31T01:39:42.627Z] Sample: ssl_client1 with ssl_pthread_server ............................ PASS
[2024-10-31T01:39:44.626Z] Sample: ssl_pthread_server, openssl client, TLS 1.2 .................... FAIL
[2024-10-31T01:39:44.626Z]   ! pattern 'error' MUST NOT be present in the Server output
[2024-10-31T01:39:44.626Z]   ! outputs saved to o-XXX-841.log
[2024-10-31T01:39:46.706Z] Sample: ssl_pthread_server, gnutls client, TLS 1.2 ..................... PASS
[2024-10-31T01:39:48.226Z] Sample: ssl_pthread_server, openssl client, TLS 1.3 .................... PASS
[2024-10-31T01:39:50.307Z] Sample: ssl_pthread_server, gnutls client, TLS 1.3 ..................... PASS

Problematic excerpt from server log:

  [ #140393558570752 ]  . Closing the connection...  [ #140393558570752 ]  failed: mbedtls_ssl_close_notify returned -0xffffffb0
  [ #140393558570752 ]  Last error was: -0x0050 - NET - Connection was reset by peer

mpg commented 1 week ago

Cc @gilles-peskine-arm as you introduced this test so you might have ideas.

gilles-peskine-arm commented 1 week ago

The test as a whole didn't take particularly long, so this looks like a race condition, not a timeout.

The error indicates that the server expected to close the connection, but the client had already closed it. I encountered several problems like this when I was writing those tests, but not with ssl_pthread_server.

I did encounter a race condition in ssl_pthread_server, but it was of a different kind: 3abca9510a7fb40688a2be115e41e93c15f3b38b to avoid some missing logs.

Note that although ssl_pthread_server is multithreaded, the test only connects a single client, so the only concurrency should be between the server and the client, there shouldn't be any non-determinism inside the server.

A tempting workaround would be to ignore errors (or at least this error) in mbedtls_ssl_close_notify. I'm always weary of silencing an error that I don't fully understand: this could be a genuine bug in closing connections. But we could accept that closing connections (which happens a lot in the real world!) is something we don't test enough, and let the current testing focus on the ability to establish a connection. Testing the ability to establish a connection was the driving goal of adding the test cases in sample.sh.

mpg commented 1 week ago

I think ignoring this particular error (but not other errors) in close_notify would make sense - and probably not just as a work-around, also as a real-world practice. From the RFC:

Both parties need not wait to receive a "close_notify" alert before closing their read side of the connection

If the other party sends its close_notify and immediately closes their (read side of the) connection, I think the symptom is going to be exactly this: when we try sending our close_notify we'll get NET - Connection was reset by peer.

Mbed-TLS / mbedtls

Intermittent failure of "Sample: ssl_pthread_server, openssl client, TLS 1.2" (ssl-opt.sh) #9761