Closed jvermillard closed 10 months ago
ESTABLISHED is reached with the FINISH message of the other side. The server-side sends its FINISH as response (normal handshake). In order to keep prepared for receiving a retransmitted FINISH, the server stays in ESTABLISHED until it receives the first application data or a timeout occurs (accumulated timeout of assumed retransmissions). On timeout, the handshake is just completed. Not sure, what causes the ALERT, I would guess some records received from the client. At least, I don't see, where Cf fails a handshake after ESTABLISHED on it's own.
I found that log-message in the DTLSConnector
lines 748 and 3039.
Do you know the line number of your logging message?
I guess, adding the HandshakeException message to the log-message should provide some more hints to the root-cause. I will do that next week, before the 3.10. release ;-).
I found that log-message in the DTLSConnector lines 748 and 3039. Do you know the line number of your logging message?
I don't, I found only those two lines, I wonder why line 768 doesn't log the HandShakeException
BTW the DTLS client is some version of mbedtls
OK, I guess, add some more information to the log will brings this a step ahead.
I wonder why line 768 doesn't log the HandShakeException
It's always too less, or too much information or at the wrong place. Usually, If I tried to debug something, I added some more information to the log to see the cause and then I usually removed it again ;-).
I found two places, where in my opinion a HANDSHAKE_FAILURE
may be caused after the session is already ESTABLISHED:
For the second I added some more details to see, what's left. See PR #2202
The release 3.10.0 is available.
Any update?
still waiting to reproduce
Thanks for the update. Looking forward, when the failure reoccurs.
Finally got a couple of:
org.eclipse.californium.scandium.dtls.HandshakeException: Received 'fatal alert/HANDSHAKE_FAILURE'
at org.eclipse.californium.scandium.DTLSConnector.processAlertRecord(DTLSConnector.java:2503)
at org.eclipse.californium.scandium.DTLSConnector.processRecord(DTLSConnector.java:2314)
at org.eclipse.californium.scandium.DTLSConnector$17.run(DTLSConnector.java:2116)
at org.eclipse.californium.elements.util.SerialExecutor$1.run(SerialExecutor.java:292)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Hm, that looks more like the device don't like the handshake messages from Californium/Scandium.
Not sure, how to continue. What would help would be an ip-capture. In this case I agree, this will hardly be possible. Next would be investigation in the client's stack, to find the cases, where such "alert/HANDSHAKE_FAILURE" is triggered. But at least for now, I'm not able to spend time in such an investigation on an other stack.
We may add also some more infos, e.g. "negotiated cipher suite" and "current number of flight" , to narrow down the issue. Not sure, if that helps.
Let's close the issue since now it's done from a Cf point of view: the logs are clearer now: the client sent a fatal alert. It could be anything in the client. Thanks for the help.
You're welcome.
the client sent a fatal alert. It could be anything in the client.
Yes. Unfortunately, the client may be right. But without more details, that's hard to say, what's the root-cause of the failure.
Only for information:
I currently update the interop-tests to support mbedtls 3.5.1 (PR #2207).
Using the RepeatingTestRunner
I've execute the interoptest 1000 times without any failure.
I still have no idea of the root-cause of this issue.
This issue arrives, sporadically, on some production servers using Cf 3.9.1 :
Handshake with [x.x.x.x:xxxx] failed after session was established! Alert Protocol Level: FATAL Description: HANDSHAKE_FAILURE
This doesn't make sense to me because if the session was established, then we had a successful handshake, but the reported alert is handshake failure.
It's difficult to capture since it's only some error in a lot of traffic, and the IP address of the faulty client appears to reconnect gracefully later.