eclipse-californium / californium

CoAP/DTLS Java Implementation
https://www.eclipse.org/californium/
Other
730 stars 367 forks source link

Handshake with [x.x.x.x:xxxx] failed after session was established! Alert Protocol Level: FATAL Description: HANDSHAKE_FAILURE #2200

Closed jvermillard closed 10 months ago

jvermillard commented 11 months ago

This issue arrives, sporadically, on some production servers using Cf 3.9.1 :

Handshake with [x.x.x.x:xxxx] failed after session was established! Alert Protocol Level: FATAL Description: HANDSHAKE_FAILURE

This doesn't make sense to me because if the session was established, then we had a successful handshake, but the reported alert is handshake failure.

It's difficult to capture since it's only some error in a lot of traffic, and the IP address of the faulty client appears to reconnect gracefully later.

boaks commented 11 months ago

ESTABLISHED is reached with the FINISH message of the other side. The server-side sends its FINISH as response (normal handshake). In order to keep prepared for receiving a retransmitted FINISH, the server stays in ESTABLISHED until it receives the first application data or a timeout occurs (accumulated timeout of assumed retransmissions). On timeout, the handshake is just completed. Not sure, what causes the ALERT, I would guess some records received from the client. At least, I don't see, where Cf fails a handshake after ESTABLISHED on it's own.

boaks commented 11 months ago

I found that log-message in the DTLSConnector lines 748 and 3039. Do you know the line number of your logging message?

boaks commented 11 months ago

I guess, adding the HandshakeException message to the log-message should provide some more hints to the root-cause. I will do that next week, before the 3.10. release ;-).

jvermillard commented 11 months ago

I found that log-message in the DTLSConnector lines 748 and 3039. Do you know the line number of your logging message?

I don't, I found only those two lines, I wonder why line 768 doesn't log the HandShakeException

BTW the DTLS client is some version of mbedtls

boaks commented 11 months ago

OK, I guess, add some more information to the log will brings this a step ahead.

boaks commented 11 months ago

I wonder why line 768 doesn't log the HandShakeException

It's always too less, or too much information or at the wrong place. Usually, If I tried to debug something, I added some more information to the log to see the cause and then I usually removed it again ;-).

jvermillard commented 11 months ago

https://github.com/eclipse-californium/californium/pull/2201

boaks commented 11 months ago

I found two places, where in my opinion a HANDSHAKE_FAILURE may be caused after the session is already ESTABLISHED:

unexpected message type

handshake message left

For the second I added some more details to see, what's left. See PR #2202

boaks commented 10 months ago

The release 3.10.0 is available.

boaks commented 10 months ago

Any update?

jvermillard commented 10 months ago

still waiting to reproduce

boaks commented 10 months ago

Thanks for the update. Looking forward, when the failure reoccurs.

jvermillard commented 10 months ago

Finally got a couple of:

org.eclipse.californium.scandium.dtls.HandshakeException: Received 'fatal alert/HANDSHAKE_FAILURE'
    at org.eclipse.californium.scandium.DTLSConnector.processAlertRecord(DTLSConnector.java:2503)
    at org.eclipse.californium.scandium.DTLSConnector.processRecord(DTLSConnector.java:2314)
    at org.eclipse.californium.scandium.DTLSConnector$17.run(DTLSConnector.java:2116)
    at org.eclipse.californium.elements.util.SerialExecutor$1.run(SerialExecutor.java:292)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
boaks commented 10 months ago

Hm, that looks more like the device don't like the handshake messages from Californium/Scandium.

Not sure, how to continue. What would help would be an ip-capture. In this case I agree, this will hardly be possible. Next would be investigation in the client's stack, to find the cases, where such "alert/HANDSHAKE_FAILURE" is triggered. But at least for now, I'm not able to spend time in such an investigation on an other stack.

We may add also some more infos, e.g. "negotiated cipher suite" and "current number of flight" , to narrow down the issue. Not sure, if that helps.

jvermillard commented 10 months ago

Let's close the issue since now it's done from a Cf point of view: the logs are clearer now: the client sent a fatal alert. It could be anything in the client. Thanks for the help.

boaks commented 10 months ago

You're welcome.

the client sent a fatal alert. It could be anything in the client.

Yes. Unfortunately, the client may be right. But without more details, that's hard to say, what's the root-cause of the failure.

boaks commented 10 months ago

Only for information:

I currently update the interop-tests to support mbedtls 3.5.1 (PR #2207).

Using the RepeatingTestRunner I've execute the interoptest 1000 times without any failure. I still have no idea of the root-cause of this issue.