dora-rs / dora

DORA (Dataflow-Oriented Robotic Architecture) is middleware designed to streamline and simplify the creation of AI-based robotic applications. It offers low latency, composable, and distributed dataflow capabilities. Applications are modeled as directed graphs, also referred to as pipelines.
https://dora-rs.ai
Apache License 2.0
1.54k stars 89 forks source link

Accidentally dora command is unresponsive and stuck #253

Closed meua closed 3 weeks ago

meua commented 1 year ago

Describe the bug Accidentally dora command is unresponsive and stuck

To Reproduce Steps to reproduce the behavior:

  1. Few cases,when executing command dora up , dora start dataflow.yaml, dora stop, dora destroy

Environments (please complete the following information):

phil-opp commented 1 year ago

Could you give us more details on how to reproduce this issue?

haixuanTao commented 1 year ago

I think I can reproduce the issue with:

dora up
# started dora coordinator
# started dora daemon

dora destroy
# Send destroy command to dora-coordinator

dora up # <--- This hangs

I think it is due the coordinator waiting for something which makes it unable to respond to other request.

phil-opp commented 1 year ago

Hmm, I tried it multiple times but I couldn't reproduce the issue on the main branch. Which dora version are you using @haixuanTao ?

haixuanTao commented 1 year ago

Yep, I think, I will investigate on my end if you cannot reproduce. I used the main branch.

phil-opp commented 1 year ago

Thanks!

meua commented 1 year ago

image reproduce the problem again, the steps are as shown above.

The premise is that dora start has an exception, as shown below: image

meua commented 1 year ago

@phil-opp

meua commented 1 year ago

image reproduce the problem again, the steps are as shown above.

The premise is that dora start has an exception, as shown below: image

After this dora-cli will not respond

haixuanTao commented 1 year ago

I think it's probably linked to the operator yolov5 not accessing github being the GFW, stucking the initialisation function. But it's going to be very hard for Philipp to reproduce.

haixuanTao commented 1 year ago

Having retested this issue, this is the stack trace:

(base) ~/D/C/dora ❯❯❯ RUST_LOG=trace dora destroy                                           (base) fix-coordinator-loop ✭
  2023-04-25T08:21:11.484181Z TRACE dora_coordinator::control: Control connection closed
    at binaries/coordinator/src/control.rs:90

  2023-04-25T08:21:11.484197Z TRACE dora_coordinator: Handling event Control(IncomingRequest { request: Destroy, reply_sender: Sender { inner: Some(Inner { state: State { is_complete: false, is_closed: false, is_rx_task_set: true, is_tx_task_set: false } }) } })
    at binaries/coordinator/src/lib.rs:142

  2023-04-25T08:21:11.484227Z  INFO dora_coordinator: Received destroy command
    at binaries/coordinator/src/lib.rs:403

  2023-04-25T08:21:11.484359Z  INFO dora_daemon: received destroy command -> exiting
    at binaries/daemon/src/lib.rs:331
    in dora_daemon::run_inner with self.machine_id: 

Send destroy command to dora-coordinator
  2023-04-25T08:21:11.484604Z TRACE dora_coordinator::control: Control connection closed
    at binaries/coordinator/src/control.rs:90

It seems to be due to this TRACE: Control connection closed which happens because there is an ErrorKind::UnexpectedEof

But, looking at running process the dora daemon has exited.

This is probably linked to an error on sending a confirmation of the dora daemon to the coordinator to have been successfully destroyed.

phil-opp commented 1 year ago

It seems to be due to this TRACE: Control connection closed which happens because there is an ErrorKind::UnexpectedEof

This is expected, as the CLI closes it's control connection to the coordinator when it exits.

phil-opp commented 1 year ago

The premise is that dora start has an exception, as shown below: image

This seems to be the real issue here. The python operator seems to require GLIBCXX_3.4.29 (required by matplotlib), but it is not found. This error brings down the whole runtime node. I'm not sure why the daemon does not detect this error, but my guess is that it is stuck waiting for the node to finish initialization (for the synchronized start introduced in #236).

So I think there are two things that we need to look into:

haixuanTao commented 1 year ago

I opened #271 to track this issue: Why doesn't the dora daemon detect the operator/node initialization error?

phil-opp commented 1 year ago

Does this issue still happen on the latest version (i.e. with #271 merged)?

meua commented 1 year ago

Does this issue still happen on the latest version (i.e. with #271 merged)?

The situation described above has not happened again, but there is still a situation where dora stop cannot stop dataflow. This problem occurs because an exception occurs inside an operator that dataflow depends on, as shown below:

(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora list
Running dataflows:
- [YOLOv8] 4aba7bb7-7966-4839-921d-72c575f7ea33
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora stop 
> Choose dataflow to stop: [YOLOv8] 4aba7bb7-7966-4839-921d-72c575f7ea33
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora list
Running dataflows:
- [YOLOv8] 4aba7bb7-7966-4839-921d-72c575f7ea33
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora -V
dora-cli 0.2.3
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ 

vi webcam_yolov8.yaml

nodes:
  - id: webcam
    operator:
      python: ../../operators/webcam_op.py
      inputs:
        tick: dora/timer/millis/100
      outputs:
        - image
    env:
      DEVICE_INDEX: 2

  - id: yolov8
    operator: 
      outputs:
        - bbox
      inputs:
        image: webcam/image
      python: ../../operators/yolov8_op.py
    env:
      PYTORCH_DEVICE: "cuda"
#      YOLOV8_PATH: $DORA_DEP_HOME/dependencies/YOLOv8/
#      YOLOV8_WEIGHT_PATH: $DORA_DEP_HOME/dependencies/YOLOv8/weights/yolov8n.pt

  - id: plot
    operator:
      python: ../../operators/plot.py
      inputs:
        image: webcam/image
        obstacles_bbox: yolov8/bbox
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ RUST_LOG=true dora start graphs/tutorials/webcam_yolov8.yaml --attach --hot-reload --name YOLOv8
4aba7bb7-7966-4839-921d-72c575f7ea33
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ 

dora logs 4aba7bb7-7966-4839-921d-72c575f7ea33 yolov8

...
─────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     │ Logs from yolov8.
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Ultralytics YOLOv8.0.122 🚀 Python-3.7.16 torch-1.11.0 CUDA:0 (NVIDIA GeForce RTX 3080 Ti, 12037MiB)
   2 │ YOLOv8n summary (fused): 168 layers, 3151904 parameters, 0 gradients, 8.7 GFLOPs
   3 │ ^Mval: Scanning /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/labels/val2017.cache... 0 images, 0 backgrounds, 5000 corrupt: 100%|██████████| 5000/5000 [00:00<?, ?it/s]^Mva
l: Scanning /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/labels/val2017.cache... 0 images, 0 backgrounds, 5000 corrupt: 100%|██████████| 5000/5000 [00:00<?, ?it/s]
   4 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000139.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000139.jpg'
   5 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000285.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000285.jpg'
   6 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000632.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000632.jpg'
   7 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000724.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000724.jpg'
   8 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000776.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000776.jpg'
   9 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000785.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000785.jpg'
  10 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000802.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000802.jpg'
  11 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000872.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000872.jpg'
  12 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000885.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000885.jpg'
  13 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001000.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001000.jpg'
  14 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001268.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001268.jpg'
  15 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001296.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001296.jpg'
  16 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001353.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001353.jpg'
  17 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001425.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001425.jpg'
  18 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001490.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001490.jpg'
  19 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001503.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001503.jpg'
  20 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001532.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001532.jpg'
  21 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001584.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001584.jpg'
  22 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001675.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001675.jpg'
  23 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001761.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001761.jpg'
...
meua commented 1 year ago

Does this issue still happen on the latest version (i.e. with #271 merged)?

v0.2.3 problem still exists, dora-cli unresponsive.

(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora destroy
  2023-06-27T09:32:54.865121Z  WARN dora_daemon::node_communication: failed to send event to daemon

Location:
    /home/runner/work/dora/dora/binaries/daemon/src/node_communication/mod.rs:490:26
    at binaries/daemon/src/node_communication/mod.rs:253

  2023-06-27T09:32:54.865152Z  WARN dora_daemon::node_communication: failed to receive reply from daemon

Location:
    /home/runner/work/dora/dora/binaries/daemon/src/node_communication/mod.rs:494:30
    at binaries/daemon/src/node_communication/mod.rs:253

Send destroy command to dora-coordinator
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora list

To Reproduce Steps to reproduce the behavior:

  1. Dora start daemon: dora up
  2. Start a new dataflow: dora start graphs/tutorials/webcam_yolov5.yaml --attachml --attach

Screenshots or Video image 企业微信截图_16878598167164

Environments (please complete the following information):

You need to kill the coodinator and restart it to return to normal. image