Running execution takes lot of time compared to ROS1

RPRKLR commented 4 months ago

When using FlexBE with ROS1, the execution of a large behavior (with nested concurrency containers and numerous behaviors) starts within seconds, typically within 5 seconds. However, with ROS2 Iron and the Flexbe WebUI, the same behavior takes more than 5 minutes to start.

Steps to Reproduce:

Start a large behavior with nested concurrency containers in FlexBE using ROS2 Iron and the Flexbe WebUI.
Press the start execution button.
Observe that the WebUI indicates that it starts the behavior, but then jumps back to the starting screen, allowing the start execution button to be pressed again.
Wait for approximately 1 minute until logger messages, such as "service is not available," appear in the terminal.
Wait an additional 3-5 minutes for the states to start building, showing a large number of entries.
Once the state building finishes, press the Attach button in the Runtime control panel.
Expected Behavior:

The behavior execution should start within seconds, similar to the performance observed with ROS1.

Actual Behavior:

The behavior execution takes more than 5 minutes to start, with the WebUI incorrectly indicating that the behavior can be started again, and significant delays in logger messages and state building.

Picture of the high amount of building state entries

Screenshot from 2024-07-10 08-46-30

Environment:

ROS Version: ROS2 Iron FlexBE Engine Version: 3.0.4 WebUI Version: beta Operating System: Ubuntu 22.04

dcconner commented 4 months ago

This sounds similar to https://github.com/FlexBE/flexbe_app/issues/100 that you reported, so you are seeing the issue with both app and webui.

Clearly unacceptable, but I can't think of FlexBE reason for this.

I suspect we have issue with DDS discovery as it is bringing the messages online.

Can I ask you to do two things.

1) Try the DDS discovery server https://fast-dds.docs.eprosima.com/en/latest/fastdds/ros2/discovery_server/ros2_discovery_server.html

IIUC, this basically reverts us back to ROS 1 mode with a centralized roscore. This is probably not the way you want to run, but just to help us debug.

2) Try a different DDS implementation. https://docs.ros.org/en/iron/Installation/DDS-Implementations.html. Again, there may be good reasons for you not to switch your system over, but just to eliminate some issues during testing.

Questions:

0) What DDS setup are you using, and what is your computing and networking setup (wired vs. wireless, number of machines, local network in lab or part of larger building)?

1) with the image above, is that the point it hangs? Can you then capture an image showing the messages after the timeout is elapsed and it continues? I'd like to get a better understanding of exactly were it is hanging up.

2) While this delay is going on, can you try doing some other ros2 related DDS task like "ros2 topic echo" of some unrelated topic? (Maybe, but not necessarily, that would point to DDS issue).

dcconner commented 4 months ago

Separate from the timing issues, there is a UI issue

Observe that the WebUI indicates that it starts the behavior, but then jumps back to the starting screen, allowing the start execution button to be pressed again.

I am noting that and will look into that while I wait on feedback from networking test.

RPRKLR commented 3 months ago

We were running the standard DDS setup which comes with ROS2

Try the DDS discovery server

After setting up the DDS server, the startup time was the same, if not more. I have also noticed another issue, that flexbe was complaining in the terminal, the message was the following: [WARN] Topic test for state None not yet available. it does not do it for every topic just a few. I have also tried it with and without super client, because I was suspecting that maybe flexbe can not see the topics. Both with and without super client I got the same warning messages.

Try a different DDS implementation.

I'm yet to try out other type of DDS, I will try to set up a test for you as soon as I have the time.

What DDS setup are you using, and what is your computing and networking setup (wired vs. wireless, number of machines, local network in lab or part of larger building)?

The default Fastrtps, network is wireless, maximum 4 machines, in a lab.

with the image above, is that the point it hangs? Can you then capture an image showing the messages after the timeout is elapsed and it continues? I'd like to get a better understanding of exactly were it is hanging up.

It does not actually hang for a long time, when starting the behavior, it takes around 40 seconds until I get some kind of a terminal message, and then it is not actually freezing, I am constantly getting the messages for every state which is being created. It just takes a lot of time, until it finishes the building of the behavior.

While this delay is going on, can you try doing some other ros2 related DDS task like "ros2 topic echo" of some unrelated topic? (Maybe, but not necessarily, that would point to DDS issue).

I tried to do topic list and echo, and they are working while it is building the behavior.

I have one question too, when it finishes the building of the behavior/state is it normal if I am getting a message like 1450 entries and 1 collision? What does the collision mean?

dcconner commented 3 months ago

last question first. The system uses a "state_id" assigned by hashing the path to state in the behavior. There is a simple method for resolving collisions, and I keep track of the number of collisions that are processed as a debugging message. So the message you are referring to just say it processed 1450 states and only had to resolve one collision. If it fails to resolve the collision it throws an error, so the message you are seeing is perfectly normal. I'll consider rewording this message to make that clearer.

dcconner commented 3 months ago

At this point, I've always logged the messages to console as each state is processed. With 1450 states, the IO does add a non-trival amount of processing time. I'll look into making this optional, or just eliminate. But that's still on the order of seconds of savings, not minutes. With regard to the "Service not available" timeouts. Normally, I write states that use 0 wait timeout in "on_start", and then have a wait for connection in on_enter to avoid these start up delays. If you are establishing service/action connections in on_start and assigning a wait time there it could account for some delay.

RPRKLR commented 3 months ago

I am also having another issue while using FastDDS Discovery server, I have a topic with these QoS settings:

self._qos = QoSProfile(depth = 1,
                               history = QoSHistoryPolicy.KEEP_LAST,
                               reliability = QoSReliabilityPolicy.RELIABLE,
                               durability = QoSDurabilityPolicy.TRANSIENT_LOCAL)

I can echo the topic and I am getting back the value, but when I create a state which is subscribing to the topic, I am getting a warning message which says: Topic /test_topic/test for state None not yet available. I can see the topic, and I can echo it, and both the subscriber and the publisher are having the same QoS settings.

RPRKLR commented 3 months ago

I have made some discoveries since my last comment, the issue is that if I am using the FastDDS Discovery Server, I am creating a topic type and name check with the help of

self._node.get_topic_names_and_types()

But it only returns the flexbe related topics like /flexbe/hearthbeat and /flexbe/log and the /rosout. I also tried to get the topics with the super client, but it still does not work. For now I don't know if it is just an issue while using flexbe or is it actually a ros2 problem.

dcconner commented 3 months ago

What is your network configuration? One local machine, or are you running get_topic_names_and_types on a different computer?

RPRKLR commented 3 months ago

What is your network configuration? One local machine, or are you running get_topic_names_and_types on a different computer?

First I tried to run the onboard on the robot and ran the behavior on another computer, and then I tried to run both the onboard and the behavior on the robot, I got the same results. I communicate with the robot with WiFi, and I am using a SUPER_CLIENT, so I can list and echo the topics. I have opened an issue and got a temporary solution, which is to spin the node one more time before calling the function. With the basic node it works but I have not had the chance yet, to try it out with flexbe, that is what I am going to do now.

RPRKLR commented 3 months ago

I have tried to run the flexbe code by spinning one more time before running the function. It breaks the flexbe, and it deadlocked after the first state. Good news is that I read the FastDDS documentation one more time, and implemented the SUPER_CLIENT in a different way, and now I don't have to spin the node one more time inside the flexbe code, on other nodes where I don't use flexbe, I have to spin it. After setting up the SUPER_CLIENT the startup of the big flexbe behavior is faster, now it is 2 mins and 20 seconds. It is faster but still not fast enough. This was tested with behavior engine version 3.0. Today I am planning to also test it with the version 4.0

RPRKLR commented 3 months ago

After updating the flexbe behavior engine to 4.0.0, the big behavior startup time decreased a lot, good job on the new release. Now if I start the onboard engine, and then start the ros2 run flexbe_widget be_launcher -b BigBehavior, it starts in 1 min and 12 seconds without using the FastDDS Discovery Server. With the Discovery Server it starts in 48 seconds. It's still a lot compared to ROS1 startup time, but it is a great improvement. Good job!

dcconner commented 1 month ago

A suggestion for anyone seeing this later

create subscriptions/action client in __init__or on_start with a 0.0 timeout, then check status in on_enter with some timeout.

FlexBE / flexbe_behavior_engine

Running execution takes lot of time compared to ROS1 #23

Steps to Reproduce:

Expected Behavior:

Actual Behavior: