blakeblackshear / frigate

NVR with realtime local object detection for IP cameras
https://frigate.video
MIT License
18.94k stars 1.73k forks source link

Process shutdown deadlocks (possibly due to TPU error) #1762

Closed transplier closed 3 years ago

transplier commented 3 years ago

Describe the bug I'm affected by https://github.com/blakeblackshear/frigate/discussions/1194 - the first time frigate starts on my machine after a cold boot, the USB TPU isn't found. In this situation frigate tries to shut itself down, but it ends up hanging forever. I need to manually docker restart frigate to fix it. There appear to be several zombie processes:

user@nvr:~$ docker exec -it frigate /bin/bash
root@3a2bef06ed45:/opt/frigate# ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.1  6.2 1360364 114444 ?      Ssl  12:56   0:06 python3 -u -m frigate
root          19  0.0  0.1  16412  2416 ?        Ss   12:56   0:00 nginx: master process /usr/sbin/nginx
nobody        20  0.2  0.3  17584  6224 ?        S    12:56   0:12 nginx: worker process
root          33  0.0  0.5  14508  9460 ?        S    12:56   0:00 /usr/bin/python3 -c from multiprocessing.resource_tracker import main;main(35)
root          37  0.0  0.0      0     0 ?        Z    12:56   0:00 [frigate.process] <defunct>
root          38  0.0  3.7 933596 68572 ?        Sl   12:56   0:00 frigate.process:cave
root          41  0.0  0.0      0     0 ?        Z    12:56   0:00 [frigate.process] <defunct>
root          43  0.0  0.0      0     0 ?        Z    12:56   0:00 [frigate.process] <defunct>
root          48  0.0  0.0      0     0 ?        Z    12:56   0:01 [python3] <defunct>
root          56  0.0  0.0      0     0 ?        Z    12:56   0:01 [python3] <defunct>
root         318  0.3  0.1   4248  3456 pts/0    Ss   14:37   0:00 /bin/bash
root         326  0.0  0.1   5904  2924 pts/0    R+   14:37   0:00 ps aux

It appears process 38 (frigate.process:cave) is the culprit. It's in interruptible sleep but doesn't react to SIGTERM. I have to SIGKILL it, whereupon it terminates and the docker container restarts properly.

Version of frigate 0.8.4-5043040

Config file Let me know if you need the config. It's fairly gigantic.

Frigate container logs

frigate.edgetpu                INFO    : No EdgeTPU detected.
Process detector:coral:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/tflite_runtime/interpreter.py", line 152, in load_delegate
    delegate = Delegate(library, options)
  File "/usr/local/lib/python3.8/dist-packages/tflite_runtime/interpreter.py", line 111, in __init__
    raise ValueError(capture.message)
ValueError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/frigate/frigate/edgetpu.py", line 124, in run_detector
    object_detector = LocalObjectDetector(tf_device=tf_device, num_threads=num_threads)
  File "/opt/frigate/frigate/edgetpu.py", line 63, in __init__
    edge_tpu_delegate = load_delegate('libedgetpu.so.1.0', device_config)
  File "/usr/local/lib/python3.8/dist-packages/tflite_runtime/interpreter.py", line 154, in load_delegate
    raise ValueError('Failed to load delegate from {}\n{}'.format(
ValueError: Failed to load delegate from libedgetpu.so.1.0

frigate.watchdog               INFO    : Detection appears to have stopped. Exiting frigate...
frigate.app                    INFO    : Stopping...
frigate.object_processing      INFO    : Exiting object processor...
frigate.events                 INFO    : Exiting event processor...
frigate.record                 INFO    : Exiting recording maintenance...
frigate.events                 INFO    : Exiting event cleanup...
frigate.watchdog               INFO    : Exiting watchdog...
frigate.stats                  INFO    : Exiting watchdog...
peewee.sqliteq                 INFO    : writer received shutdown request, exiting.
root                           INFO    : Waiting for detection process to exit gracefully...
frigate.video                  INFO    : garage: exiting subprocess

Frigate stats

502 Bad Gateway

FFprobe from your camera

Run the following command and paste output below

ffprobe <stream_url>

Computer Hardware

Camera Info:

Additional context Add any other context about the problem here.

ozett commented 3 years ago

Config file Let me know if you need the config. It's fairly gigantic.

Maybe the issue-template can contain a dropdown for gigiantic configs? Heres the markup-code sample:

<details><summary>Click here for Config-Details</summary>
\```
CONFIG information comes here
in full Detail
and gigantic rows
\```
</details>

Try it:

Click here for Config-Details ``` CONFIG information comes here in full Detail and gigantic rows ```
ozett commented 3 years ago

i have similar shutdowns of the frigate-container when my second m2.coral.tpu has some unknown error. with not restart of frigate. it does not overcome the faulty second coral.tpu. i see similar messages ValueError: Failed to load delegate from libedgetpu.so.1.0 but have not yet sorted them out. Restarting the whole ubunut-maschine helps. maybe i find a better way. would like to see frigate handling coral.tpu errors with stable restart or the core-framework or fallback to CPU-detection

blakeblackshear commented 3 years ago

It should exit the container and restart. For most users falling back to CPU would saturate all resources and break everything else running on the server.

ozett commented 3 years ago

It should exit the container and restart.

on my occasion it shuts down and does not restart. as i have more debug-information i will come back how to overcame that exit without restart

For most users falling back to CPU would saturate all resources and break everything else running on the server.

in case of error and exiting my second coral.tpu stays not available and frigate restarts, and shuts down immeditatly again. thats a race-condition that leads to misfunktion of whole frigate but could be overcome with a configurable failure-counter on that hardware-issue with a CPU-fallback. that would keep the system not going in shutdown-circle. at least in my case

transplier commented 3 years ago

For the moment I just have a script run docker exec frigate ps aux every few minutes looking for zombie processes. If it finds any it does a docker restart frigate. It hasn't happened since I filed the ticket (of course, sigh) but I'll try to get a stack trace of the stuck process next time.

transplier commented 3 years ago

Slightly different repro today. Something went wrong after a couple weeks uptime:

watchdog.doorbell              INFO    : No frames received from doorbell in 20 seconds. Exiting ffmpeg...
watchdog.doorbell              INFO    : Waiting for ffmpeg to exit gracefully...
watchdog.doorbell              INFO    : FFmpeg didnt exit. Force killing...
frigate.video                  INFO    : doorbell: ffmpeg sent a broken frame. memoryview assignment: lvalue and rvalue have different structures
frigate.video                  INFO    : doorbell: ffmpeg process is not running. exiting capture thread...
[mov,mp4,m4a,3gp,3g2,mj2 @ 0x55931e0d9140] moov atom not found
/tmp/cache/doorbell-20210927115124.mp4: Invalid data found when processing input
frigate.events                 INFO    : bad file: doorbell-20210927115124.mp4
ffmpeg.doorbell.detect         ERROR   : [h264 @ 0x55ae1dff1980] Failed setup for format vaapi_vld: hwaccel initialisation returned error.
ffmpeg.doorbell.detect         ERROR   : More than 1000 frames duplicated
watchdog.doorbell              INFO    : No frames received from doorbell in 20 seconds. Exiting ffmpeg...
watchdog.doorbell              INFO    : Waiting for ffmpeg to exit gracefully...
frigate.video                  INFO    : doorbell: ffmpeg sent a broken frame. memoryview assignment: lvalue and rvalue have different structures
frigate.video                  INFO    : doorbell: ffmpeg process is not running. exiting capture thread...
ffmpeg.doorbell.detect         ERROR   : rtsp://admin:7GNKRkVaN4Ajql10@192.168.77.50/cam/realmonitor?channel=1&subtype=1: Immediate exit requested

Frigate didn't auto restart, had to do it manually. No zombies this time though.

blakeblackshear commented 3 years ago

The latest RC builds use a much more robust way to clean up processes and exit the container. If you aren't on a recent build, try upgrading.

transplier commented 3 years ago

Will do!

ozett commented 3 years ago

i have two corals, running fine until update to rc-5. now one is not detected, but its there. any hints do debug before a hard-reset? made a soft-restart of the whole ubuntu, but that does not helped.

my docker-compose has this in it:

    devices:
      - /dev/bus/usb:/dev/bus/usb
      - /dev/dri/renderD128              # for intel hwaccel, needs to be updated for your hardware
      - /dev/dri:/dev/dri                # help hw decoding?
      - /dev/apex_0:/dev/apex_0                      # corel-board pci 1
      - /dev/apex_1:/dev/apex_1                      # corel-board pci 2
root@ub2004-fri5:/usr/src# lspci -nn | grep 089a
01:00.0 System peripheral [0880]: Global Unichip Corp. Coral Edge TPU [1ac1:089a]
03:00.0 System peripheral [0880]: Global Unichip Corp. Coral Edge TPU [1ac1:089a]

root@ub2004-fri5:/usr/src# ls -la /dev/ap*
crw-rw---- 1 root apex 120, 0 Sep 28 07:41 /dev/apex_0
crw-rw---- 1 root apex 120, 1 Sep 28 07:41 /dev/apex_1

outside the container on the host:

root@ub2004-fri5:/usr/src# apt show libedgetpu1-max
Package: libedgetpu1-max
Version: 16.0
Priority: optional
Section: misc
Source: libedgetpu
Maintainer: Coral <coral-support@google.com>
Installed-Size: 1210 kB
Provides: libedgetpu1 (= 16.0)
Depends: libc6, libgcc1, libstdc++6, libusb-1.0-0, debconf (>= 0.5) | debconf-2.0
Conflicts: libedgetpu1, libedgetpu1-legacy
Homepage: https://coral.ai/
Download-Size: 388 kB
APT-Sources: https://packages.cloud.google.com/apt coral-edgetpu-stable/main amd64 Packages
Description: Support library for Edge TPU
 Support library (max speed) for the Edge TPU

the log shows, that only one Coral is detected by frigate:


frigate    | [2021-09-28 08:39:59] frigate.app                    INFO    : Capture process started for 6-einfahrt-134: 296
frigate    | [2021-09-28 08:39:59] frigate.app                    INFO    : Capture process started for 7-hof-108: 303
frigate    | [2021-09-28 08:39:59] ws4py                          INFO    : Using epoll
frigate    | [2021-09-28 08:39:59] detector.coral-2               INFO    : Starting detection process: 219
frigate    | [2021-09-28 08:39:59] frigate.edgetpu                INFO    : Attempting to load TPU as pci:1
frigate    | [2021-09-28 08:40:12] frigate.edgetpu                ERROR   : No EdgeTPU was detected. If you do not have a Coral device yet, you must configure CPU detectors.
frigate    | Process detector:coral-2:
frigate    | Traceback (most recent call last):
frigate    |   File "/usr/lib/python3/dist-packages/tflite_runtime/interpreter.py", line 160, in load_delegate
frigate    |     delegate = Delegate(library, options)
frigate    |   File "/usr/lib/python3/dist-packages/tflite_runtime/interpreter.py", line 119, in __init__
frigate    |     raise ValueError(capture.message)
frigate    | ValueError
frigate    |
frigate    | During handling of the above exception, another exception occurred:
frigate    |
frigate    | Traceback (most recent call last):
frigate    |   File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
frigate    |     self.run()
frigate    |   File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
frigate    |     self._target(*self._args, **self._kwargs)
frigate    |   File "/opt/frigate/frigate/edgetpu.py", line 156, in run_detector
frigate    |     object_detector = LocalObjectDetector(
frigate    |   File "/opt/frigate/frigate/edgetpu.py", line 64, in __init__
frigate    |     edge_tpu_delegate = load_delegate("libedgetpu.so.1.0", device_config)
frigate    |   File "/usr/lib/python3/dist-packages/tflite_runtime/interpreter.py", line 162, in load_delegate
frigate    |     raise ValueError('Failed to load delegate from {}\n{}'.format(
frigate    | ValueError: Failed to load delegate from libedgetpu.so.1.0
frigate    |
frigate    | [2021-09-28 08:40:14] peewee.sqliteq                 DEBUG   : received query INSERT INTO "recordings" ("id", "camera", "path", "start_time", "end_time", "duration") VALUES (?, ?, ?, ?, ?, ?)
frigate    | [2021-09-28 08:40:14] peewee                         DEBUG   : ('INSERT INTO "recordings" ("id", "camera", "path", "start_time", "end_time", "duration") VALUES (?, ?, ?, ?, ?, ?)', ['1632811201.0-iva82h', 'haustuer-228', '/media/frigate/recordings/2021-09/28/08/haustuer-228/40.01.mp4', 1632811201.0, 1632811209.24, 8.24])
frigate    | [2021-09-28 08:40:14] peewee.sqliteq                 DEBUG   : received query INSERT INTO "recordings" ("id", "camera", "path", "start_time", "end_time", "duration") VALUES (?, ?, ?, ?, ?, ?)
frigate    | [2021-09-28 08:40:14] peewee                         DEBUG   : ('INSERT INTO "recordings" ("id", "camera", "path", "start_time", "end_time", "duration") VALUES (?, ?, ?, ?, ?, ?)', ['1632811201.0-mwc1mz', 'cat-kueche-130', '/media/frigate/recordings/2021-09/28/08/cat-kueche-130/40.01.mp4', 1632811201.0, 1632811211.015, 10.015])
frigate    | [2021-09-28 08:40:15] peewee.sqliteq                 DEBUG   : received query INSERT INTO "recordings" ("id", "camera", "path", "start_time", "end_time", "duration") VALUES (?, ?, ?, ?, ?, ?)
frigate    | [2021-09-28 08:40:15] peewee                         DEBUG   : ('INSERT INTO "recordings" ("id", "camera", "path", "start_time", "end_time", "duration") VALUES (?, ?, ?, ?, ?, ?)', ['1632811201.0-bgjioz', '6-einfahrt-134', '/media/frigate/recordings/2021-09/28/08/6-einfahrt-134/40.01.mp4', 1632811201.0, 1632811210.967, 9.967])
frigate    | [2021-09-28 08:40:15] peewee.sqliteq                 DEBUG   : received query INSERT INTO "recordings" ("id", "camera", "path", "start_time", "end_time", "duration") VALUES (?, ?, ?, ?, ?, ?)
frigate    | [2021-09-28 08:40:15] peewee                         DEBUG   : ('INSERT INTO "recordings" ("id", "camera", "path", "start_time", "end_time", "duration") VALUES (?, ?, ?, ?, ?, ?)', ['1632811201.0-k8rc1c', 'haustuer-81', '/media/frigate/recordings/2021-09/28/08/haustuer-81/40.01.mp4', 1632811201.0, 1632811209.013, 8.013])
frigate    | [2021-09-28 08:40:15] peewee.sqliteq                 DEBUG   : received query INSERT INTO "recordings" ("id", "camera", "path", "start_time", "end_time", "duration") VALUES (?, ?, ?, ?, ?, ?)
frigate    | [2021-09-28 08:40:15] peewee                         DEBUG   : ('INSERT INTO "recordings" ("id", "camera", "path", "start_time", "end_time", "duration") VALUES (?, ?, ?, ?, ?, ?)', ['1632811200.0-wul75h', 'cat-hilook-132', '/media/frigate/recordings/2021-09/28/08/cat-hilook-132/40.00.mp4', 1632811200.0, 1632811209.954, 9.954])
frigate    | [2021-09-28 08:40:15] peewee.sqliteq                 DEBUG   : received query INSERT INTO "recordings" ("id", "camera", "path", "start_time", "end_time", "duration") VALUES (?, ?, ?, ?, ?, ?)
frigate    | [2021-09-28 08:40:15] peewee                         DEBUG   : ('INSERT INTO "recordings" ("id", "camera", "path", "start_time", "end_time", "duration") VALUES (?, ?, ?, ?, ?, ?)', ['1632811200.0-hnzz9z', '5-strasse-117', '/media/frigate/recordings/2021-09/28/08/5-strasse-117/40.00.mp4', 1632811200.0, 1632811210.004, 10.004])
frigate    | [2021-09-28 08:40:15] peewee.sqliteq                 DEBUG   : received query INSERT INTO "recordings" ("id", "camera", "path", "start_time", "end_time", "duration") VALUES (?, ?, ?, ?, ?, ?)
frigate    | [2021-09-28 08:40:15] peewee                         DEBUG   : ('INSERT INTO "recordings" ("id", "camera", "path", "start_time", "end_time", "duration") VALUES (?, ?, ?, ?, ?, ?)', ['1632811200.0-sdmy28', '4-strasse-119', '/media/frigate/recordings/2021-09/28/08/4-strasse-119/40.00.mp4', 1632811200.0, 1632811207.96, 7.96])
frigate    | [2021-09-28 08:40:19] frigate.watchdog               INFO    : Detection appears to have stopped. Exiting frigate...
frigate    | [cont-finish.d] executing container finish scripts...
frigate    | [cont-finish.d] done.
frigate    | [s6-finish] waiting for services.
frigate    | [s6-finish] sending all processes the TERM signal.
frigate    | [s6-finish] sending all processes the KILL signal and exiting.
frigate exited with code 0
ozett commented 3 years ago

i got the second pci-coral back, without hard-reset, this hint helped. did this only with software on command-line on the ongoing powered system.

doing this:

lspci -nn | grep 089a
echo "1" > /sys/bus/pci/devices/0000\:03\:00.0/remove
echo "1" > /sys/bus/pci/rescan
dmesg

[ 9090.099545] pci 0000:03:00.0: [1ac1:089a] type 00 class 0x0000ff
[ 9090.099623] pci 0000:03:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit pref]
[ 9090.099654] pci 0000:03:00.0: reg 0x18: [mem 0x00000000-0x000fffff 64bit pref]
[ 9090.099989] pci 0000:03:00.0: 2.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s x1 link at 0000:00:1c.1 (capable of 4.000 Gb/s with 5 GT/s x1 link)
[ 9090.126098] pci 0000:03:00.0: BAR 2: assigned [mem 0xf0000000-0xf00fffff 64bit pref]
[ 9090.126136] pci 0000:03:00.0: BAR 0: assigned [mem 0xf0100000-0xf0103fff 64bit pref]

maybe that could be a fallback for frigate itself after loosing hardware? maybe with a configurable success/failure counter, witch switches after 10 times fallback to CPU?

and not shutting all down forever after this error-circle?

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.