eclipse-cyclonedds / cyclonedds

Eclipse Cyclone DDS project
https://projects.eclipse.org/projects/iot.cyclonedds
Other
864 stars 354 forks source link

Communication freeze with secure connections #1895

Open kasars opened 10 months ago

kasars commented 10 months ago

Hello,

We have a distributed system that has applications on multiple servers and which communicate with encrypted dds. We are experiencing some connection issues when re-connecting after network resets, reboots or suspending the computer. Without encryption there are no issues, but when we enable the encryption we get a debug log:

"fsm: handshake (lguid=... rguid=...) failed: (1) Timed out"

and after that the connection does not recover.

The connection breaks are rare and not very easy to trigger in normal use. We did however find a way to trigger it semi-reliably by rebooting one of machines while the system is running.

Attached are two files: One where the connection fails after reboot and the other where the connections is OK after the reboot.

The network traffic is not exactly idle at the time of the reboot, and filtering the logs for lines containing "handshake" does show that we get "event=EVENT_TIMEOUT" early on in the handshake of the failing case.

We are running CycloneDDS-CXX 0.10.2 and CycloneDDS 0.10.2 with a crash fix for CycloneDDS, that we got from https://github.com/eclipse-cyclonedds/cyclonedds/issues/1548 We did also try the 0.10.4 tags, but that did not help.

How can we be of assistance in finding/fixing the issue?

Best Regards, Kåre Särs

2023-12-04-2_fail_filtered.log 2023-12-04-2_ok_filtered.log

eboasson commented 10 months ago

We did also try the 0.10.4 tags, but that did not help.

I don't think anything changed on the security side recently, so with:

Without encryption there are no issues

and

but when we enable the encryption we get a debug log

I would not expect there'd be a more recent version that addresses the issue. Unfortunately.

How can we be of assistance in finding/fixing the issue?

I think the first thing that needs doing is making sense of the log. You've already provided some useful info on that, but as always, the devil is in the details and so I think I (or perhaps @MarcelJordense , if he has some time) need to first dig through the logs to see if it already provides enough information to find the cause.

Then we take it from there.

MarcelJordense commented 10 months ago

I suspect that the problem is related to an asymmetrical disconnect. In this case myapp2 disconnects myapp1 because the lease of myapp1 expires, however myapp1 still sees myapp2 alive. When myapp2 sees myapp1 again it will start an authorization handshake with myapp1. However myapp1 was not aware of the disconnect and discards the handshake messages from myapp2. Currently this is an issue in the security implementation.

kasars commented 5 months ago

Hi,

Here's an update. In the attached tar-file, I have a simple test-application-pair that can reproduce the issue reliably. handshake_bug.tar.gz

I know the configuration of different lease times is wrong, but I have a hunch that the original problem is due to a race, where one manages to se a timeout and the other not. (in our "real" setup they have the same lease duration)

How to reproduce on Linux: Start the two applications, pause the subscriber by pressing ctrl-z and then after 2 seconds let it continue by typing "fg" (any other means of pausing it should be fine). Then notice that the HelloWorldData message communication is lost. After a bit over a minute the publisher prints fsm: handshake (lguid=.... rguid=...) failed: (1) Timed out

I hope this can help in fixing or pointing out a workaround or a configuration error.

Br, Kåre

kasars commented 5 months ago

And one more update A really rough attempt to workaround the issue: CycloneDDS-lease-race-workaround-001.tar.gz A tar.gz as the this comment thingie does not allow .diff files :)

guleonseon commented 3 months ago

I suspect that the problem is related to an asymmetrical disconnect. In this case myapp2 disconnects myapp1 because the lease of myapp1 expires, however myapp1 still sees myapp2 alive. When myapp2 sees myapp1 again it will start an authorization handshake with myapp1. However myapp1 was not aware of the disconnect and discards the handshake messages from myapp2. Currently this is an issue in the security implementation.

I have also encountered this problem recently, which has been bothering me for a long time. Is there any plan to fix this bug?

kasars commented 3 months ago

@guleonseon have you tried the above patch, and if yes does it work for you?

guleonseon commented 3 months ago

@kasars Yeah, I tried the patch above, but it didn't seem to be helpful for the issue I encountered. Maybe the reason for the problem I encountered is different from yours, they are just having the same output: "fsm: handshake (lguid=... rguid=...) failed: (1) Timed out".

guleonseon commented 3 months ago

My testing application is very simple: one publisher (A) and two subscribers (B, C) for topic HelloWorld with encryption enabled in three difference devices. The publisher app A up first, then B up and C up last. B can receive A's messages normally, but C can not. For a while (about 2 minutes), C has output like: "fsm: handshake (lguid=... rguid=...) failed: (1) Timed out". But if I change the up order of B and C, the result is: C can receive A's messages normally, but B can not. Sometimes both B and C can receive messages from A normally if I reboot or unplugin A's network cable and restore it in a few seconds. I'm not sure if it's my configuration issue.

kasars commented 3 months ago

That sounds a bit like there might be a configuration issue or a firewall that blocks some port. They get different ports depending on the number of other nodes

MarcelJordense commented 3 months ago

@guleonseon When the issue still exists could you please provide the cyclone log files?

guleonseon commented 3 months ago

@MarcelJordense Yeah, the issue is still exists and I find it it redundant network(Domain/General/RedundantNetworking) relevant. If I disable RedundantNetworking, it is working good. But if I enable it, this issue will recur again.