dora-rs / dora

DORA (Dataflow-Oriented Robotic Architecture) is middleware designed to streamline and simplify the creation of AI-based robotic applications. It offers low latency, composable, and distributed dataflow capabilities. Applications are modeled as directed graphs, also referred to as pipelines.
https://dora-rs.ai
Apache License 2.0
1.5k stars 79 forks source link

dora node packet loss #515

Closed RuPingCen closed 2 months ago

RuPingCen commented 4 months ago

Describe the bug The dora is V0.3.2.

I use two nodes. Node A is written in Python. Node B is written in c/c++.

Node A reads GPS data at a 20ms cycle and sends the data via a json string. Node B performs coordinate conversion by receiving data from node A, but I found that node B will lose data packets when receiving data from node A.

Here I print the seq number of the data pub by node A, the message count received by node B, and the time when the message was received. There was no packet loss at the beginning, but after running for a while, the message count received by node B was not equal to the seq value of the message.

And, the period of node B to receive data should normally be 20ms, which depends on the period of node A sending data. However, the following picture shows that the time difference between the two events is 40ms. (lower left corner)

here is the test code: https://github.com/dora-rs/autoware.universe/blob/feature/autoware_dora/localization/ekf_localizer/test_ekf_dataflow.yml

数据丢包

haixuanTao commented 4 months ago

For the latency, could you try removing the sleep here: https://github.com/dora-rs/autoware.universe/blob/db29af5314dbe4c3c9c5c79ab78e33f2a8bbd327/localization/ekf_localizer/src/gnss_ekf.cc#L65

RuPingCen commented 4 months ago

The _“EKF_fusionpthread“” thread is the sleep of the thread that sends data and has nothing to do with data reception. Data packet loss has occurred when receiving data from other nodes in the function *“int run(void dora_context) ”**.

phil-opp commented 4 months ago

Dora uses a default queue size of 10 for each input. If the receiver cannot keep up with the sender, dora deliberately drops some inputs (oldest first). If you set the RUST_LOG environment variable to debug you will see a "dropped X inputs because event queue was too full" in the command-line output.

You can set a custom queue size in your YAML file, e.g.: https://github.com/dora-rs/dora/blob/ea47a556a77f6c64f29ab07939d747edc8573671/examples/python-dataflow/dataflow.yml#L5-L8

RuPingCen commented 4 months ago

As you said I set the parameter queue_size: 100, but data packet loss still exists.

I check the logs for the python node "gnss" , which is normal pub the data with 4643. 2024-05-29 10-25-16 的屏幕截图

The second node "gnss_poser" with c/c++ finally received only 4229 data_seq.

2024-05-29 10-22-17 的屏幕截图

This phenomenon started to occur when receiving the 33rd data seq. 2024-05-29 10-23-10 的屏幕截图

here is the xxx.yaml file:

nodes:
  - id: gnss
    operator:
      python: ../../dora-hardware/vendors/gnss/CGI_610/CGI_610_driver_dora_with_file.py  #读取文件用于测试
      #python: ../../dora-hardware/vendors/gnss/CGI_610/CGI_610_driver_dora.py
      inputs:
        tick: dora/timer/millis/20
      outputs:
        - DoraSentence
        - DoraNavSatFix
        - DoraQuaternionStamped
        - DoraTwistStamped
    env:
      DEVICE_INDEX: 0

  - id: gnss_poser  #坐标系转化
    custom:
      source: ../../sensing/gnss_poser/build/gnss_poser_core
      inputs:
        DoraNavSatFix: gnss/DoraNavSatFix
      queue_size: 100 
      outputs:
        - DoraNavSatFix 

  - id: gnss_rviz2  #在RVIZ2中显示转换后的轨迹 (E-N-U)
    operator:
      python: ../../sensing/gnss_poser/src/gps_to_ros2.py
      inputs:
        DoraNavSatFix: gnss_poser/DoraNavSatFix
      queue_size: 100 

  - id: gnss_ekf  # c/c++节点
    custom:
      source: build/gnss_ekf
      inputs:
              DoraNavSatFix: gnss_poser/DoraNavSatFix
              DoraQuaternionStamped: gnss/DoraQuaternionStamped
      outputs:
            - DoraGnssPose
      queue_size: 100 

  - id: ekf_odom_rviz2  #在rviz中显示数据
    operator:
      python: src/ekf_odom_to_ros2.py
      inputs:
          DoraGnssPose: gnss_ekf/DoraGnssPose
      queue_size: 100 
bobd988 commented 4 months ago

Hu has also tested with 100K queue size today and can reproduce this issue.

phil-opp commented 4 months ago

Thanks for the details!

Could you please also share the full output of the terminal on the top left? I see that there are some error in the output, so the full content might help debugging this. Also, setting RUST_LOG=debug before dora up should expose more details in the logs, including information when events are dropped.

RuPingCen commented 4 months ago

The error in the terminal on the upper left corner is that we did not use a real GPS sensor during debugging, but used the data in the txt file to simulate GPS data. This error is generated after the data in the txt file has been read and the node exits. Therefore, this error should not affect data loss.

The details can be found in line 26 of file CGI_610_driver_dora_with_file.py( https://github.com/dora-rs/autoware.universe/blob/feature/autoware_dora/dora-hardware/vendors/gnss/CGI_610/CGI_610_driver_dora_with_file.py)

phil-opp commented 4 months ago

I just noticed that you added the queue_size parameter as a child of the the custom: object. It has no effect there. Instead, you need to add it as a child field to an input. For that, add another level of nesting underneath the input, specify the source mapping as a source field, and then add the queue_size as a second field. I opened a small PR for that: https://github.com/dora-rs/autoware.universe/pull/6. Please retry again with these changes.

starlitxiling commented 4 months ago

Hello! I tested this change and the packet loss problem still exists. image here is the yaml file:

nodes:
  - id: gnss
    operator:
      python: ../../dora-hardware/vendors/gnss/CGI_610/CGI_610_driver_dora_with_file.py  #读取文件用于测试
      #python: ../../dora-hardware/vendors/gnss/CGI_610/CGI_610_driver_dora.py
      inputs:
        # tick: dora/timer/millis/20
        tick:
          source: dora/timer/millis/20
          queue_size: 1000
      # queue_size: 100
      outputs:
        - DoraSentence
        - DoraNavSatFix
        - DoraQuaternionStamped
        - DoraTwistStamped
    env:
      DEVICE_INDEX: 0

  # - id: gnss_sub  #数据格式转换 json
  #   operator:
  #     python: ../../dora-hardware/vendors/gnss/CGI_610/nmea_subscribeTopic.py
  #     inputs:
  #             DoraSentence: gnss/DoraSentence
  #             DoraNavSatFix: gnss/DoraNavSatFix
  #             DoraQuaternionStamped: gnss/DoraQuaternionStamped
  #             DoraTwistStamped: gnss/DoraTwistStamped
  #     outputs:
  #       - DoraNavSatFix
  #       - DoraQuaternionStamped

  - id: gnss_poser  #坐标系转化
    custom:
      source: ../../sensing/gnss_poser/build/gnss_poser_core
      inputs:
        DoraNavSatFix: 
          source: gnss/DoraNavSatFix
          queue_size: 1000
      outputs:
        - DoraNavSatFix 

  - id: gnss_rviz2  #在RVIZ2中显示转换后的轨迹 (E-N-U)
    operator:
      python: ../../sensing/gnss_poser/src/gps_to_ros2.py
      inputs:
        DoraNavSatFix: 
          source: gnss_poser/DoraNavSatFix
          queue_size: 1000
##############################################################################################
  - id: gnss_ekf  # python 实现
    # custom:
    #   source: build/gnss_ekf
    operator:
      python: ../../localization/ekf_localizer/src/gnss_ekf.py
      inputs:
        DoraNavSatFix: 
          source: gnss_poser/DoraNavSatFix
          queue_size: 1000 
        DoraQuaternionStamped: 
          source: gnss/DoraQuaternionStamped
          queue_size: 1000
        tick: 
          source: dora/timer/millis/20
          queue_size: 1000 
      outputs:
            - DoraGnssPose

  # - id: gnss_ekf  # c/c++节点
  #   # custom:
  #   #   source: build/gnss_ekf
  #     inputs:
  #             DoraNavSatFix: gnss_poser/DoraNavSatFix
  #             DoraQuaternionStamped: gnss/DoraQuaternionStamped
  #     outputs:
  #           - DoraGnssPose

  - id: ekf_odom_rviz2  #在rviz中显示数据
    operator:
      python: src/ekf_odom_to_ros2.py
      inputs:
        DoraGnssPose: 
          source: gnss_ekf/DoraGnssPose
          queue_size: 1000
phil-opp commented 4 months ago

For debugging, it would be useful to have a minimal example that still triggers this issue. Could you try the following:

After these steps, we ideally have a very small example that we can try easily on our machines without working into the autoware build details.

starlitxiling commented 4 months ago

Hello,This is the complete reproduction process of the above example: https://github.com/dora-rs/autoware.universe/blob/feature/autoware_dora/localization/ekf_localizer/Packet_loss_test.md For minimal example,I write demo to reproduce this proble,the code is here: https://github.com/dora-rs/autoware.universe/tree/feature/autoware_dora/tools/C_Python_test. And in this process, you may also encounter issue516. Because the process involves c node, it is inevitable to encounter some build processes.

bobd988 commented 4 months ago

The second link is the minimal example. The link for all examples is

https://github.com/dora-rs/autoware.universe/tree/feature/autoware_dora/tools

Here are some minimal test tools helping reproduce issues.
Folder C_Python_test
It is a minimal example the one for reproduce #515 and #516. For instructions please go to folder and you will see the steps.

Folder C_node_test
It is a minimal example the one for reproduce #502. For instructions please go to folder and you will see the steps.
bobd988 commented 4 months ago

Here are some new results for #515 after making changes to philip's comments - "I just noticed that you added the queue_size parameter as a child of the the custom: object. It has no effect there. Instead, you need to add it as a child field to an input.". You can see that change does not solve the issue.


queue_size: 1000

First restart:
  First time: package loss 1
  Second time: package loss 0
  Third time: package loss 3
  Fourth time: package loss 0
  Fifth time: package loss 4

Second restart:
  First time: package loss 1
  Second time: package loss 43
  Third time: package loss 4
  Fourth time: package loss 4
  Fifth time: package loss 5

queue_size: 100

First restart:
  First time: package loss 2
  Second time: package loss 6
  Third time: package loss 3
  Fourth time: package loss 637
  Fifth time: package loss 3

Second restart:
  First time: package loss 193
  Second time: package loss 259
  Third time: package loss 1
  Fourth time: package loss 1
  Fifth time: package loss 4
bobd988 commented 4 months ago

The easiest way to reproduce issue is this minimal test example - https://github.com/dora-rs/autoware.universe/tree/feature/autoware_dora/tools/C_Python_test.

phil-opp commented 3 months ago

Ok, so I tried your example and I can reproduce the dropped messages. If you spawn the coordinator and daemon manually and set RUST_LOG=debug, you see a lot of "dropped X inputs because event queue was too full" log messages. This occurs for both the noda_A/tick input (a lot) and for the node_B/input-1.

When I increase the queue_size of node_B/input-1 to 10000, all 10k messages reach the Python node:

Input Data length: 3
Received messages: 10000/10000
=======================

The reason why this happens mostly when combining a sending C node with a Python receiver is probably that Python is slower, so it cannot keep up with the inputs.

phil-opp commented 3 months ago

I think a change such as #533 would make it more obvious what's happening.

haixuanTao commented 2 months ago

After further investigation, there were no packet loss and it was a problem in the C++ String with no null termination.