Closed haixuanTao closed 3 hours ago
So, @phil-opp if you want to reproduce the race condition, you can use the following example of a dataflow based on the python-dataflow
example:
nodes:
- id: webcam
custom:
source: ./webcam.py
inputs:
tick:
source: dora/timer/millis/50
queue_size: 1000
outputs:
- image
- id: object_detection
custom:
source: ./object_detection.py
inputs:
image: webcam/image
outputs:
- bbox
- id: object_detection_1
custom:
source: ./object_detection.py
inputs:
image: webcam/image
outputs:
- bbox
- id: object_detection_2
custom:
source: ./object_detection.py
inputs:
image: webcam/image
outputs:
- bbox
- id: object_detection_3
custom:
source: ./object_detection.py
inputs:
image: webcam/image
outputs:
- bbox
- id: object_detection_4
custom:
source: ./object_detection.py
inputs:
image: webcam/image
outputs:
- bbox
- id: plot
custom:
source: ./plot.py
inputs:
image: webcam/image
bbox: object_detection/bbox
- id: plot_1
custom:
source: ./plot.py
inputs:
image: webcam/image
bbox: object_detection/bbox
- id: plot_2
custom:
source: ./plot.py
inputs:
image: webcam/image
bbox: object_detection/bbox
- id: plot_3
custom:
source: ./plot.py
inputs:
image: webcam/image
bbox: object_detection/bbox
By starting, waiting couple of seconds and sending a stop signal either by using ctl-c or stop command the nodes are not able to exit within the grace duration, and exit after 30s for the timeout of drop token to kick in.
[ERROR]
Dataflow 0190772a-f3e5-75a6-b7db-ddce5fad5a4e failed:
Node `object_detection` failed: node was killed by dora because it didn't react to a stop message in time (SIGKILL)
Node `object_detection_1` failed: node was killed by dora because it didn't react to a stop message in time (SIGKILL)
Node `object_detection_3` failed: node was killed by dora because it didn't react to a stop message in time (SIGKILL)
Node `object_detection_4` failed: node was killed by dora because it didn't react to a stop message in time (SIGKILL)
Node `object_detection_2` failed: node was killed by dora because it didn't react to a stop message in time (SIGKILL)
I have tried many, many things to solve it and so far, the only way I can make a graceful exit is by reducing the timeout on those receivers.
I would love to have a better solution, but haven't figured out any.
This PR reduces the time required to ignore unsent drop token from 30s to 1s as in Python this error is linked to a drop token race condition with the GIL.