eProsima / Fast-DDS

The most complete DDS - Proven: Plenty of success cases. Looking for commercial support? Contact info@eprosima.com
https://eprosima.com
Apache License 2.0
2.06k stars 738 forks source link

Deadlock appears be caused by write data and discovery entity threads using tcp transport on the writer side. #4203

Open chunyisong opened 6 months ago

chunyisong commented 6 months ago

Is there an already existing issue for this?

Expected behavior

Discovering new entities and writing data should not be stucked!

Current behavior

Deadlock appears be caused by write data and discovery entity threads using tcp transport on the writer side.

I write a simple test program test-dds to reproduce this bug.

To reproduce this bug, open two different consoles: In the first one for publisher: ./test-dds pub Then edit DEFAULT_FASTRTPS_PROFILES.xml ,change listen port of tcpv4 to 0 or other different port number. In the second one for subscriber: ./test-dds sub

Then the deadlock will most likely occur.If no stuckting,restart second console. From console of publisher,_currentMatchedPubs and _totalPubOkDatas log are not changed. From console of subscriber,_currentMatchedSubs and _totalSubValidDatas logs are alse not changed.

Additionally,other tests produce same stucked dealock:

  1. Change Reliability of writers or readers
  2. Using discovery server (This is actual deployment.Here using init peers for simplicity.)
  3. Cmake options to compile fastdds,such is FASTDDS_STATISTICS or STRICT_REALTIME.
  4. Less topics and datas lead to a lower probability of deadlock.(such as ./test-dds sub 100)
  5. Open multi consoles of sub lead to hight probability of deadlock.

Fast DDS version/commit

FastDDS v2.13.0/v2.13.1

Platform/Architecture

Ubuntu Focal 20.04 amd64

Transport layer

TCPv4

XML configuration file

DEFAULT_FASTRTPS_PROFILES.xml

Relevant log output

image

Network traffic capture

No response

Mario-DL commented 6 months ago

Hi @chunyisong

Thanks for the report. Could you please check if with the latest release v2.13.1 the issue persists ? Some improvements were made in tcp transport, check the release notes

chunyisong commented 6 months ago

Hi @Mario-DL

I tested test-dds with fastdds v2.13.1.Unfortunately,stucked deadlock reappeared! However,with this version, the deadlock is more difficult to trigger.Only starting more subscribers (200 readers per sub) and one publisher(200 writers) and not killing writers can not reproduce the issue (after about 30 trials of simple test,may be lucky).But following steps more likely trigger deadlock:

  1. Start discovery server in one console (fast-discovery-server -i 0 -t 10.8.8.6 -q 17480)
  2. Edit DEFAULT_FASTRTPS_PROFILES.xml ,change listen port of tcpv4 to 0 ;
  3. Start two subscriber(200 readers per sub) in two new consoles (./test-dds sub);
  4. Start one publisher (200 writers) in new console (./test-dds pub);
  5. Wait 30 seconds and kill publisher process and restart publisher,then more likely deadlock may occur (if no deadlock, kill and restart publisher once more),**and write operation are stucked as follows: image Actually,the publisher process is killed and restarted once,so only 200 writer topics,but the subscribers still had 400 writers.

Additionally,other issues as follows through tests:

  1. Sometimes "Matching unexisting participant from writer" error occured (line 1062 in /workspace/fastdds/src/fastrtps/src/cpp/rtps/builtin/discovery/database/DiscoveryDataBase.cpp ) after killing publisher.
  2. Sometimes or When Discovery server error occured ,the server will never drop killed parcipant.
  3. After killing publisher, DataReaderListeners almost never be called to notify discovery or match info.
JesusPoderoso commented 5 months ago

Hi @chunyisong, thanks for your report! We will try to reproduce it in the following weeks and come back to you with some feedback.

chunyisong commented 5 months ago

Today,I have reviewed some issues related to tcp,some issues (colsed but may still exist, #4099 #4026 #4033 #3621 #3496 ) may be about the same tcp deadlock.

JesusPoderoso commented 4 months ago

Hi @chunyisong, thanks for your patience. We've just released Fast DDS v2.14.0 with some TCP improvements and fixes (see release notes). I think that the TCPSendResources cleanup may have fixed your issue. Could you check if it persists, please?

chunyisong commented 3 months ago

Sorry for late reply. I will take some time to test it as soon as possible.

At 2024-03-21 14:44:07, "Jesús Poderoso" @.***> wrote:

Hi @chunyisong, thanks for your patience. We've just released Fast DDS v2.14.0 with some TCP improvements and fixes (see release notes). I think that the TCPSendResources cleanup may have fixed your issue. Could you check if it persists, please?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

chunyisong commented 2 months ago

@JesusPoderoso Sorry for late test.I test fastdds today with master using same method,the problem still exists. Below image is one process for a sub,one process for a pub ,one process for discover server,when kill pub process and restart pub immediately.After a while,sub did removed matched writers,but pub did not.And strangly pub rmoved connection to discovery server! image Below image kill sub side(add statistic for participants) image

Note:xml using Reliable/No initPeers/TCP

chunyisong commented 2 months ago

Below image shows two subs on hostA(220.8), a discovery server and tow pubs on host B(0.202).

  1. start 5 process
  2. After a minite,kill and immediately restart pub1/pub2/sub1/sub2 one by one as quick as you can.
  3. Then the left most sub and the right most pub actually already are stucked,even though loging thread still print statistic messages.The sub matched No pub,but the pub matched all subs but No data published!
  4. After a minite,kill the sub and pub in the middle
  5. Then try to kill discovery server ,but the server stucked and can not kill by ctrl-c!
  6. Then try to kill stucked pub and sub ,but erver stucked and can not kill by ctrl-c too! 1715140996744

@JesusPoderoso May be pdp or epd issues. Note: profile.xml should be modified to use different host.

chunyisong commented 2 months ago

@JesusPoderoso Tested with fastdds 2.14.1 today, and deadlock easily appeared. I tested on localhost with discovery server follows:

  1. Start discover server in a console : bin/fast-discovery-server -i 0 -t 127.0.0.1 -q 17480
  2. Start two pub processes with 200 topics in two consoles : ./test-dds pub
  3. Start two sub processes with same 200 topics in sequence as quickly as possible in two consoles: ./test-dds sub
  4. Unfortunately I met deadlock just during first two tests.Now kill discovery server (ctrl-c) will be stucked.
  5. If luckily all processes transfer data normally during a test,then kill discovery server,and after one minite restart it,and deadlock appeared. After some minites,I killed specific one pub that published data count was unchanged then discovery server process ended automatically. Below image shows the deadlock of step 5.The right most pub processes log shows that the _totalPubOkDatas:11440323 is unchange.Now,discovery server can be killed normally if I first killed the stucked pub. image
JesusPoderoso commented 2 months ago

Hi @chunyisong, thanks for the reproducer! We are taking a look at it and will bring some feedback.

chunyisong commented 1 month ago

@JesusPoderoso According to document of max_blocking_time,writer should return with timeout. But in fact,writers will always get stucked with the deadlock, and will never return.So, this situation is not in line with the design.

wangzm-R commented 1 month ago

image

the deadlock stack;

use 2.11.2, if deadlock occurs, it cannot recover, but use 2.14.0, if a deadlock occurs, then recreate the subscriber, deadlock will exit after about 15 minutes ;

chunyisong commented 1 month ago

image

the deadlock stack;

use 2.11.2, if deadlock occurs, it cannot recover, but use 2.14.0, if a deadlock occurs, then recreate the subscriber, deadlock will exit after about 15 minutes ;

Hi @wangzm-R , did you try v2.14.1?

My test with v2.14.1 still can not recover until the sucked reader/writer is killed (after that the sucked discover server will recover).

wangzm-R commented 1 month ago

image the deadlock stack; use 2.11.2, if deadlock occurs, it cannot recover, but use 2.14.0, if a deadlock occurs, then recreate the subscriber, deadlock will exit after about 15 minutes ;

Hi @wangzm-R , did you try v2.14.1?

My test with v2.14.1 still can not recover until the sucked reader/writer is killed (after that the sucked discover server will recover).

The publisher is on linux, the subscriber is on linux, the subscriber is on the window, when any one subscriber device is powered down (the subscriber kill process is invalid), the publisher thread will block until the subscriber restarts;

when subscriber device is powered down , about after 1min30sec, the publisher thread will block at write;

chunyisong commented 2 weeks ago

Today I test fastdds v2.14.2 with one publisher connected to two subscribers,using 2000 topics, and the deadlock almost always appeared.Also test built-in LARGE-DATA mode and custom larg-data.All attempts have shown no signs of improvement.

Through testing, it is suspected that the problem is occurring at TCP EDP stage. @JesusPoderoso Has there been any progress on this issue?

Tested with fastdds 2.14.1 today, and deadlock easily appeared. I tested on localhost with discovery server follows:

  1. Start discover server in a console : bin/fast-discovery-server -i 0 -t 127.0.0.1 -q 17480
  2. Start two pub processes with 200 topics in two consoles : ./test-dds pub
  3. Start two sub processes with same 200 topics in sequence as quickly as possible in two consoles: ./test-dds sub
  4. Unfortunately I met deadlock just during first two tests.Now kill discovery server (ctrl-c) will be stucked.
  5. If luckily all processes transfer data normally during a test,then kill discovery server,and after one minite restart it,and deadlock appeared. After some minites,I killed specific one pub that published data count was unchanged then discovery server process ended automatically.