RobotecAI / ros2-for-unity

High-performance ROS2 solution for Unity3D
Apache License 2.0
448 stars 58 forks source link

Calling Dispose() on Subscriber raises an Exception after few calls #24

Closed TimoKettunen closed 2 years ago

TimoKettunen commented 2 years ago

This issue #21 was fixed previously so that calling Dispose does not cause immediate crash. But same error still occurs if topics are subscribed/disposed in fast manner. I attached a C++ test program that publishes several topics and also publishes add/remove commands to tell what topics should be subscribed/disposed. Also attached a sample Unity code that receives those topics and subscribes/disposes topics according to received messages. This code causes the error RuntimeError: subscription's implementation is invalid, at /tmp/binarydeb/ros-foxy-rcl-1.1.11/src/rcl/subscription.c:462, at /tmp/binarydeb/ros-foxy-rcl-1.1.11/src/rcl/wait.c:324 to happen in few seconds.

dispose.zip

TimoKettunen commented 2 years ago

I found out I was using old plugins in test. This test does not cause the problem with latest plugins. We see the problem with new plugins in our application when it runs day or so. I must modify the C++ sender so that it resembles more how our application uses it. So wait until I come back with better test program.

TimoKettunen commented 2 years ago

When testing, I found out that Dispose has no impact with latest plugins. When viewing with ROS2 topic info, the topic remains subscribed after Disposing. This is another problem than title and it resembles the behavior of the plugins before they went open source. Does anyone know if this is known behavior still?

adamdbrw commented 2 years ago

Last time I investigated, all the rcl calls are going through properly. I would imagine there is a case of rcl API being non-symmetrical in some way (when higher layers are involved as well). If you have a reproducer, please submit. Otherwise, the task is to reproduce first and isolate both circumstances and implications (perhaps it is only the case of ros2 topic info not updating correctly?).

I believe the necessary fix might be in ros2 itself in the end. But it also might be possible to use some rclcpp-level api on r2fu / ros2cs side to circumvent / fix it.

TimoKettunen commented 2 years ago

New information: The Dispose works correctly with default rmw. We have been using cyclone rmw since it was recommended sometime in the past. Upgrading to latest ros from Ubuntu repo does not seem to help. Can it be that current cyclone rmw is broken?

TimoKettunen commented 2 years ago

While switching to default rmw fixed the subscription problem, it did not fix the original exception problem. We switched to latest ros packages and default rmw but we still get exception after a while. I have not been able to verify this problem with test program, it only appears with our app.

adamdbrw commented 2 years ago

@TimoKettunen thank you for researching the problem. Would you be able to take this issue on? This looks like there is a couple of issues based on your description, could you confirm?

  1. CycloneDDS seems to have some kind of asymmetrical behavior when used through rcl API
  2. There is some kind of synchronization issue or a missing check at least for a valid subscription implementation.
TimoKettunen commented 2 years ago

Yes, the failing Dispose seems to be separate issue. I will create separate github issue for that with the test program. What you mean by "Would you be able to take this issue on" ? Do you mean that would start to fix it? I have spent too much time already in order to reproduce it. If I get some hints how to setup debugging so that I would see more accurately what is the case, I can still investigate. But I am strongly considering switching to ROS2 support provided by Unity if this is not solved this week.

adamdbrw commented 2 years ago

I believe we can take this on our side on Thursday. Can't promise this will be solved by the end of the week though.

TimoKettunen commented 2 years ago

I can easily reproduce the exception problem with replay and our app. But I am not able to reproduce it with test app. If you have a hunch where the problem may be, please have a look tomorrow if you have time.

BTW, about the Dispose with cyclone rmw, after computer boot it worked OK. So no new issue with that.

TimoKettunen commented 2 years ago

test_sender.tar.gz Added updated test program that better reflects the application that causes the problem. I have not been able to reproduce the problem with this either but maybe the structure gives a hint about the conditions. There is one executable (topic_publisher) that publishes the topics that should be published and subscribed. Another executable (publisher_test) receives this as command to publish and Unity program receives this as command to subscribe. Topics are removed correspondingly with another command. In real application, the removal of subscription (calling Dispose) causes randomly the exception and all ros2 subscriptions stop working. This requires running app day or two. It is not resource exhaustion, at least not for memory or cpu. In real application these event do not happen so often as in test program but some removals may happen within same second. I hope this helps a bit to study the problem. I will help any way I can, this is a big problem for us.

pijaro commented 2 years ago

I did some extensive testing with a modified version of your dispose.zip example code (from first post) on foxy + cyclonedds/fastrtps and indeed, I was not able to reproduce the issue.

Did you encounter these issues on foxy, or both on foxy and galactic?

I will look into the updated test program and will continue diagnosing the issue.

TimoKettunen commented 2 years ago

We use only foxy.

TimoKettunen commented 2 years ago

I looked on ros2 sources about this exception. The wait.c:324 is in function rcl_wait_set_add_subscription. This is strange since the error happens always on disposing the subscription, not when adding new subscription. Last time I saw this, several minutes were passed from last added subscription. Then there was six removal of subscriptions, last one caused the exception.

TimoKettunen commented 2 years ago

Now we integrated new plugin in another app and got the same exception. So now I am certain that this is a bug in ros2cs and not in our app. In addition, we see now that the subscriptions are not always forgotten. So disposed subscriptions are still alive and send data to same call back than the new one.

adamdbrw commented 2 years ago

I checked before that the rcl_subscription_fini is properly called.

Quick question: is there any scenario in which you could be doing something of this kind (in pseudocode), considering threads that you are running:

Sub A = Node.CreateSubscription A = Node.CreateSubscription // Again A.Dispose()

It is also a good idea to share your threading model. We were unable to reproduce the issue so far.

However, I do have one theory which could explain what is happening, @pijaro will be checking it promptly. If it proves correct, you should have a fix by the end of the day (as a single dll to replace).

TimoKettunen commented 2 years ago

Every time a new subscription is created, old one is first disposed and then set to null. So, no, I do not think that the scenario happens. We have no threads of our own. The ROS2 callbacks are called with separate threads so data is protected by locks or ConcurrentQueue is used to pass messages to Unity thread.

adamdbrw commented 2 years ago

https://github.com/RobotecAI/ros2cs/blob/5b4d831ca99d169f816562ad6c4f6c61866ada08/src/ros2cs/ros2cs_core/WaitSet.cs#L85 lines 80-86 should be locked by the same mutex that is used in the Dispose() body of Subscription. We will try to reproduce using this hint.

TimoKettunen commented 2 years ago

I noticed that the node has method RemoveSubscription. We do not call it in any phase. It just removes subscription from Nodes HashSet of subscriptions. I did not find any code in ros2cs that removes the subscription. If we just Dispose the subscription, will this subscription remain in hash set forever? Can it cause any problem when node processes the items in hash set? Should we call the RemoveSubscription?

adamdbrw commented 2 years ago

Yes, indeed, you could either RemoveSubscription and Dispose it, or just Dispose it, but the second option leaves "zombie" subs that inflate the hash set (even though they are not processed). Thus it is preferable to go with the first option.

I think that we should also call subscriptions' Dispose in there, need to check it.

It is a bit challenging to make everything perfect because of different memory models (C# / C)

pijaro commented 2 years ago

@TimoKettunen I think I managed to reproduce the issue, could you please try to replace your ros2cs_core.dll with this one: ros2cs_core.zip and see if the problem still persist?

pijaro commented 2 years ago

The solution is based on @adamdbrw suggestion from this comment.

TimoKettunen commented 2 years ago

The new ros2cs_core is now in test run. I noted that it does not fix the problem of non-working Dispose. So the app gets messages from topics even though the subscriber of that topic has been disposed. As I noted earlier, I saw this my small test program but then it went away after reboot. But now it appears always in real app, regardless of reboots.

TimoKettunen commented 2 years ago

The patch has now been in test run in four setups and we have not seen any exceptions. This really seems to fix the problem. Good job! The problem of Dispose not actually removing subscriptions still persists but that is another problem, I will continue to investigate it and create new issue when I can reproduce it.

adamdbrw commented 2 years ago

@pijaro submit PR for review, we want it merged so we can close this.

TimoKettunen commented 2 years ago

Just to verify, test environments running over weekend and no problems detected. The other problem related to Dispose not working was also clarified. It was bug in app, yield in wrong place caused double subscription.

adamdbrw commented 2 years ago

@TimoKettunen Excellent news, we seem to have a clean situation now. Thank you for your help with fixing this one