dora-rs / dora

DORA (Dataflow-Oriented Robotic Application) is middleware designed to streamline and simplify the creation of AI-based robotic applications. It offers low latency, composable, and distributed dataflow capabilities. Applications are modeled as directed graphs, also referred to as pipelines.
https://dora-rs.ai
Apache License 2.0
1.35k stars 67 forks source link

Ignore-quicker-pending-drop-token #568

Closed haixuanTao closed 3 hours ago

haixuanTao commented 6 days ago

This PR reduces the time required to ignore unsent drop token from 30s to 1s as in Python this error is linked to a drop token race condition with the GIL.

haixuanTao commented 5 days ago

CI is blocked by: https://community.anaconda.cloud/t/cant-download-anaconda-say-violating-terms-of-service/76074/14

Opened at: https://github.com/conda-incubator/setup-miniconda/issues/357

haixuanTao commented 13 hours ago

So, @phil-opp if you want to reproduce the race condition, you can use the following example of a dataflow based on the python-dataflow example:

nodes:
  - id: webcam
    custom:
      source: ./webcam.py
      inputs:
        tick:
          source: dora/timer/millis/50
          queue_size: 1000
      outputs:
        - image

  - id: object_detection
    custom:
      source: ./object_detection.py
      inputs:
        image: webcam/image
      outputs:
        - bbox

  - id: object_detection_1
    custom:
      source: ./object_detection.py
      inputs:
        image: webcam/image
      outputs:
        - bbox
  - id: object_detection_2
    custom:
      source: ./object_detection.py
      inputs:
        image: webcam/image
      outputs:
        - bbox
  - id: object_detection_3
    custom:
      source: ./object_detection.py
      inputs:
        image: webcam/image
      outputs:
        - bbox
  - id: object_detection_4
    custom:
      source: ./object_detection.py
      inputs:
        image: webcam/image
      outputs:
        - bbox
  - id: plot
    custom:
      source: ./plot.py
      inputs:
        image: webcam/image
        bbox: object_detection/bbox
  - id: plot_1
    custom:
      source: ./plot.py
      inputs:
        image: webcam/image
        bbox: object_detection/bbox

  - id: plot_2
    custom:
      source: ./plot.py
      inputs:
        image: webcam/image
        bbox: object_detection/bbox

  - id: plot_3
    custom:
      source: ./plot.py
      inputs:
        image: webcam/image
        bbox: object_detection/bbox

By starting, waiting couple of seconds and sending a stop signal either by using ctl-c or stop command the nodes are not able to exit within the grace duration, and exit after 30s for the timeout of drop token to kick in.


[ERROR]
Dataflow 0190772a-f3e5-75a6-b7db-ddce5fad5a4e failed:

Node `object_detection` failed: node was killed by dora because it didn't react to a stop message in time (SIGKILL)
Node `object_detection_1` failed: node was killed by dora because it didn't react to a stop message in time (SIGKILL)
Node `object_detection_3` failed: node was killed by dora because it didn't react to a stop message in time (SIGKILL)
Node `object_detection_4` failed: node was killed by dora because it didn't react to a stop message in time (SIGKILL)
Node `object_detection_2` failed: node was killed by dora because it didn't react to a stop message in time (SIGKILL)

I have tried many, many things to solve it and so far, the only way I can make a graceful exit is by reducing the timeout on those receivers.

I would love to have a better solution, but haven't figured out any.