Closed meua closed 3 weeks ago
Could you give us more details on how to reproduce this issue?
I think I can reproduce the issue with:
dora up
# started dora coordinator
# started dora daemon
dora destroy
# Send destroy command to dora-coordinator
dora up # <--- This hangs
I think it is due the coordinator waiting for something which makes it unable to respond to other request.
Hmm, I tried it multiple times but I couldn't reproduce the issue on the main
branch. Which dora version are you using @haixuanTao ?
Yep, I think, I will investigate on my end if you cannot reproduce. I used the main branch.
Thanks!
reproduce the problem again, the steps are as shown above.
The premise is that dora start has an exception, as shown below:
@phil-opp
reproduce the problem again, the steps are as shown above.
The premise is that dora start has an exception, as shown below:
After this dora-cli will not respond
I think it's probably linked to the operator yolov5 not accessing github being the GFW, stucking the initialisation function. But it's going to be very hard for Philipp to reproduce.
Having retested this issue, this is the stack trace:
(base) ~/D/C/dora ❯❯❯ RUST_LOG=trace dora destroy (base) fix-coordinator-loop ✭
2023-04-25T08:21:11.484181Z TRACE dora_coordinator::control: Control connection closed
at binaries/coordinator/src/control.rs:90
2023-04-25T08:21:11.484197Z TRACE dora_coordinator: Handling event Control(IncomingRequest { request: Destroy, reply_sender: Sender { inner: Some(Inner { state: State { is_complete: false, is_closed: false, is_rx_task_set: true, is_tx_task_set: false } }) } })
at binaries/coordinator/src/lib.rs:142
2023-04-25T08:21:11.484227Z INFO dora_coordinator: Received destroy command
at binaries/coordinator/src/lib.rs:403
2023-04-25T08:21:11.484359Z INFO dora_daemon: received destroy command -> exiting
at binaries/daemon/src/lib.rs:331
in dora_daemon::run_inner with self.machine_id:
Send destroy command to dora-coordinator
2023-04-25T08:21:11.484604Z TRACE dora_coordinator::control: Control connection closed
at binaries/coordinator/src/control.rs:90
It seems to be due to this TRACE: Control connection closed
which happens because there is an ErrorKind::UnexpectedEof
But, looking at running process the dora daemon
has exited.
This is probably linked to an error on sending a confirmation of the dora daemon to the coordinator to have been successfully destroyed.
It seems to be due to this TRACE:
Control connection closed
which happens because there is anErrorKind::UnexpectedEof
This is expected, as the CLI closes it's control connection to the coordinator when it exits.
The premise is that dora start has an exception, as shown below:
This seems to be the real issue here. The python operator seems to require GLIBCXX_3.4.29 (required by matplotlib), but it is not found. This error brings down the whole runtime node. I'm not sure why the daemon does not detect this error, but my guess is that it is stuck waiting for the node to finish initialization (for the synchronized start introduced in #236).
So I think there are two things that we need to look into:
I opened #271 to track this issue: Why doesn't the dora daemon detect the operator/node initialization error?
Does this issue still happen on the latest version (i.e. with #271 merged)?
Does this issue still happen on the latest version (i.e. with #271 merged)?
The situation described above has not happened again, but there is still a situation where dora stop cannot stop dataflow. This problem occurs because an exception occurs inside an operator that dataflow depends on, as shown below:
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora list
Running dataflows:
- [YOLOv8] 4aba7bb7-7966-4839-921d-72c575f7ea33
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora stop
> Choose dataflow to stop: [YOLOv8] 4aba7bb7-7966-4839-921d-72c575f7ea33
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora list
Running dataflows:
- [YOLOv8] 4aba7bb7-7966-4839-921d-72c575f7ea33
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora -V
dora-cli 0.2.3
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$
vi webcam_yolov8.yaml
nodes:
- id: webcam
operator:
python: ../../operators/webcam_op.py
inputs:
tick: dora/timer/millis/100
outputs:
- image
env:
DEVICE_INDEX: 2
- id: yolov8
operator:
outputs:
- bbox
inputs:
image: webcam/image
python: ../../operators/yolov8_op.py
env:
PYTORCH_DEVICE: "cuda"
# YOLOV8_PATH: $DORA_DEP_HOME/dependencies/YOLOv8/
# YOLOV8_WEIGHT_PATH: $DORA_DEP_HOME/dependencies/YOLOv8/weights/yolov8n.pt
- id: plot
operator:
python: ../../operators/plot.py
inputs:
image: webcam/image
obstacles_bbox: yolov8/bbox
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ RUST_LOG=true dora start graphs/tutorials/webcam_yolov8.yaml --attach --hot-reload --name YOLOv8
4aba7bb7-7966-4839-921d-72c575f7ea33
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$
dora logs 4aba7bb7-7966-4839-921d-72c575f7ea33 yolov8
...
─────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│ Logs from yolov8.
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ Ultralytics YOLOv8.0.122 🚀 Python-3.7.16 torch-1.11.0 CUDA:0 (NVIDIA GeForce RTX 3080 Ti, 12037MiB)
2 │ YOLOv8n summary (fused): 168 layers, 3151904 parameters, 0 gradients, 8.7 GFLOPs
3 │ ^Mval: Scanning /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/labels/val2017.cache... 0 images, 0 backgrounds, 5000 corrupt: 100%|██████████| 5000/5000 [00:00<?, ?it/s]^Mva
l: Scanning /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/labels/val2017.cache... 0 images, 0 backgrounds, 5000 corrupt: 100%|██████████| 5000/5000 [00:00<?, ?it/s]
4 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000139.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000139.jpg'
5 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000285.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000285.jpg'
6 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000632.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000632.jpg'
7 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000724.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000724.jpg'
8 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000776.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000776.jpg'
9 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000785.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000785.jpg'
10 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000802.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000802.jpg'
11 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000872.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000872.jpg'
12 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000885.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000000885.jpg'
13 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001000.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001000.jpg'
14 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001268.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001268.jpg'
15 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001296.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001296.jpg'
16 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001353.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001353.jpg'
17 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001425.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001425.jpg'
18 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001490.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001490.jpg'
19 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001503.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001503.jpg'
20 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001532.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001532.jpg'
21 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001584.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001584.jpg'
22 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001675.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001675.jpg'
23 │ val: WARNING ⚠️ /home/jarvis/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001761.jpg: ignoring corrupt image/label: [Errno 2] No such file or directory: '/home/jarv
is/coding/pyhome/mikel-brostrom/yolo_tracking/datasets/coco/images/val2017/000000001761.jpg'
...
Does this issue still happen on the latest version (i.e. with #271 merged)?
v0.2.3 problem still exists, dora-cli unresponsive.
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora destroy
2023-06-27T09:32:54.865121Z WARN dora_daemon::node_communication: failed to send event to daemon
Location:
/home/runner/work/dora/dora/binaries/daemon/src/node_communication/mod.rs:490:26
at binaries/daemon/src/node_communication/mod.rs:253
2023-06-27T09:32:54.865152Z WARN dora_daemon::node_communication: failed to receive reply from daemon
Location:
/home/runner/work/dora/dora/binaries/daemon/src/node_communication/mod.rs:494:30
at binaries/daemon/src/node_communication/mod.rs:253
Send destroy command to dora-coordinator
(dora3.7) jarvis@jia:~/coding/pyhome/github.com/dora-rs/dora-drives$ dora list
To Reproduce Steps to reproduce the behavior:
dora up
dora start graphs/tutorials/webcam_yolov5.yaml --attachml --attach
Screenshots or Video
Environments (please complete the following information):
ubuntu 20.04 LTS
v0.2.3
You need to kill the coodinator and restart it to return to normal.
Describe the bug Accidentally dora command is unresponsive and stuck
To Reproduce Steps to reproduce the behavior:
dora up
,dora start dataflow.yaml
,dora stop
,dora destroy
Environments (please complete the following information):