dora-rs / dora

DORA (Dataflow-Oriented Robotic Architecture) is middleware designed to streamline and simplify the creation of AI-based robotic applications. It offers low latency, composable, and distributed dataflow capabilities. Applications are modeled as directed graphs, also referred to as pipelines.
https://dora-rs.ai
Apache License 2.0
1.5k stars 79 forks source link

failed to stop dataflow #292

Closed meua closed 1 month ago

meua commented 1 year ago

Describe the bug dora-daemon hangs up due to heartbeat timeout, but dora-coodinator is running normally, then I restart dora-daemon, when the dataflow is closed by dora stop uuid, it cannot be closed.

(dora3.7) jarvis@jia:~/coding/dora_home/dora$ conda activate py310
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli up
started dora coordinator
started dora daemon

(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli -V
dora-cli 0.2.3-rc6
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli check
Dora Coordinator: ok
Dora Daemon: ok

(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli start examples/python-operator-dataflow/dataflow.yml --attach --hot-reload
10af7c98-604d-4808-b48a-7e028cb3d733
  2023-05-19T03:53:57.743423Z  WARN dora_coordinator: daemon at `` did not react as expected to watchdog message

Caused by:
   0: failed to send watchdog message to daemon
   1: Broken pipe (os error 32)

Location:
    /home/jarvis/coding/dora_home/dora/binaries/coordinator/src/lib.rs:550:10
    at binaries/coordinator/src/lib.rs:468

open new terminal and kill dora-daemon, simulate the daemon process to hang up abnormally

(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli list
Running dataflows:
- [nappy-back] 10af7c98-604d-4808-b48a-7e028cb3d733
(py310) jarvis@jia:~/coding/dora_home/dora$ ps -ef | grep dora
jarvis     22117       1  0 11:41 pts/12   00:00:00 dora-coordinator
jarvis     22131       1  0 11:41 pts/12   00:00:01 dora-daemon
jarvis     24461   18206  0 11:53 pts/12   00:00:00 dora-cli start dataflow.yml --attach --hot-reload
jarvis     24464   22131  7 11:53 pts/12   00:00:01 python3 -c import dora; dora.start_runtime() # webcam
jarvis     24467   22131  8 11:53 pts/12   00:00:01 python3 -c import dora; dora.start_runtime() # plot
jarvis     24598   22333  0 11:53 pts/3    00:00:00 grep --color=auto dora
(py310) jarvis@jia:~/coding/dora_home/dora$ kill -15 22131
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli list
Running dataflows:
- [nappy-back] 10af7c98-604d-4808-b48a-7e028cb3d733
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli stop 
> Choose dataflow to stop: [nappy-back] 10af7c98-604d-4808-b48a-7e028cb3d733
no daemon connection
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli up
started dora daemon
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli check
Dora Coordinator: ok
Dora Daemon: ok

(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli list
Running dataflows:
- [nappy-back] 10af7c98-604d-4808-b48a-7e028cb3d733
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli stop
> Choose dataflow to stop: [nappy-back] 10af7c98-604d-4808-b48a-7e028cb3d733
failed to stop dataflow
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli list
Running dataflows:
- [nappy-back] 10af7c98-604d-4808-b48a-7e028cb3d733
(py310) jarvis@jia:~/coding/dora_home/dora$ dora-cli -V
dora-cli 0.2.3-rc6
(py310) jarvis@jia:~/coding/dora_home/dora$ 

To Reproduce Steps to reproduce the behavior:

  1. Dora start coodinator and daemon: dora-cli up
  2. Start a new dataflow: dora-cli start examples/python-operator-dataflow/dataflow.yaml --attach --hot-reload
  3. Kill dora-daemon: kill -15 pid_dora_daemon
  4. Dora start daemon: dora-cli up
  5. Destroy dataflow: dora-cli stop uuid_your_dataflow

Expected behavior I expect dora-coodinator and dora-daemon to live and die together, and they can automatically restart when the heartbeat times out, Or dora-daemon hangs up, and dataflow is also destroyed.

Environments (please complete the following information):

haixuanTao commented 1 year ago

Can I ask why are you killing the daemon?

We do not support auto-restarting daemon at the moment.

meua commented 1 year ago

Can I ask why are you killing the daemon?

We do not support auto-restarting daemon at the moment.

Because, there are some reasons due to custom nodes and operators, which will cause dora-daemon to hang innocently. I kill the dora-daemon process to simulate this situation.

haixuanTao commented 1 year ago

Do you have any ideas or context you can share about why dora-daemon to hang innocently?

meua commented 1 year ago

Do you have any ideas or context you can share about why dora-daemon to hang innocently?

I am not running in source debug mode,after dora up, run RUST_LOG=true dora start graphs/tutorials/webcam.yaml --attach --hot-reload --name webcam, dataflow cannot be stopped

(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora list
Running dataflows:
- [webcam] 2eeba0b6-4cfa-438a-bc7f-0747664e06f3
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora stop
> Choose dataflow to stop: [webcam] 2eeba0b6-4cfa-438a-bc7f-0747664e06f3
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora list
Running dataflows:
- [webcam] 2eeba0b6-4cfa-438a-bc7f-0747664e06f3
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora -V
dora-cli 0.2.3
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora logs 2eeba0b6-4cfa-438a-bc7f-0747664e06f3 webcam
>     │ Logs from webcam.
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ could not get webcam.
   2 │ could not get webcam.
   3 │ could not get webcam.
   4 │ could not get webcam.
   5 │ could not get webcam.
   6 │ could not get webcam.
   7 │ could not get webcam.
   8 │ could not get webcam.
   9 │ could not get webcam.
  10 │ could not get webcam.
  11 │ could not get webcam.
  12 │ could not get webcam.
  13 │ could not get webcam.
  14 │ could not get webcam.
haixuanTao commented 1 month ago

This should have been fixed with grace duration