blakeblackshear / frigate

NVR with realtime local object detection for IP cameras
https://frigate.video
MIT License
18.02k stars 1.65k forks source link

Coral USB Resets on Proxmox #2607

Closed flowchartsman closed 2 years ago

flowchartsman commented 2 years ago

Describe the problem you are having

Went outside tonight to test an automation on person detected in front yard zone. Nothing. Opened debug view with all options turned no. No bounding boxes drawn of any kind. No detection and no events.

Logs only show this, from earlier:

[2022-01-07 21:08:16] frigate.watchdog               INFO    : Detection appears to be stuck. Restarting detection process...
[2022-01-07 21:08:16] root                           INFO    : Waiting for detection process to exit gracefully...
[2022-01-07 21:08:46] root                           INFO    : Detection process didnt exit. Force killing...
[2022-01-07 21:08:46] detector.coral                 INFO    : Starting detection process: 73182
[2022-01-07 21:08:46] frigate.edgetpu                INFO    : Attempting to load TPU as usb
[2022-01-07 21:08:49] frigate.edgetpu                INFO    : TPU found

And nothing until my websocket connection at 2022-01-08 00:03:17

Proxmox logs indicate the USB device has been "reset"

[Thu Jan  6 15:51:33 2022] usb 1-6: reset full-speed USB device number 6 using xhci_hcd
[Fri Jan  7 21:08:35 2022] usb 2-6: reset SuperSpeed USB device number 3 using xhci_hcd
[Fri Jan  7 21:08:35 2022] usb 2-6: LPM exit latency is zeroed, disabling LPM.

Frigate now appears to be in a state where running, but simply not detecting anything.

Some digging has offered only sparse suggestions of using a shorter, better quality cable, however this cable is only 2-3 inches long, and supposedly rated for 10Gbps data:

https://www.amazon.com/gp/product/B08NPSX7FF/ref=ppx_yo_dt_b_search_asin_title?ie=UTF8&th=1

The machine is a Beelink Mini PC SEi10, 16GB RAM Running the latest proxmox with frigate in LXC, and it functions perfectly well for most of the day.

I'm confused as to where to go from here. Is this an issue with LPM perhaps?

Version

0.9.4-26AE608

Frigate config file

#logger:
#  default: debug
#  logs:
#    peewee: error
mqtt:
  host: <hass host>
  user: <mqtt user>
  topic_prefix: frigate
  password: "{FRIGATE_MQTT_PASSWORD}"
ffmpeg:
  global_args:
    -hide_banner
    -loglevel error
  input_args:
    - -avoid_negative_ts
    - make_zero
    - -fflags
    - nobuffer+genpts+discardcorrupt
    - -flags
    - low_delay
    - -strict
    - experimental
    - -analyzeduration
    - 1000M
    - -probesize
    - 1000M
    - -rw_timeout
    - "5000000"
  hwaccel_args:
    - -hwaccel
    - qsv
    - -qsv_device
    - /dev/dri/renderD128
    - -hwaccel_output_format
    - yuv420p
detectors:
  coral:
    type: edgetpu
    device: usb
model:
  labelmap:
    2: vehicle
    3: vehicle
    5: vehicle
    7: vehicle
    15: animal
    16: animal
    17: animal
objects:
  track:
    - person
    - vehicle
record:
  enabled: True
  # Do not retain non-events
  retain_days: 0
  events:
    max_seconds: 300
    pre_capture: 5
    post_capture: 5
    retain:
      default: 5
snapshots:
  enabled: True
  timestamp: True
  bounding_box: False
  crop: False
  retain:
    default: 4
birdseye:
  enabled: True
  width: 1280
  height: 720
  quality: 8
  mode: objects
cameras:
  front:
    ffmpeg:
      inputs:
        - path: http://<camera ip>/flv?port=1935&app=bcs&stream=channel0_ext.bcs&user={FRIGATE_CAM_USER}&password={FRIGATE_CAM_PASSWORD}
          roles:
            - record
            - detect
            - rtmp
    detect:
        width: 896
        height: 512
        fps: 5
    zones:
      driveway:
        coordinates: 0,298,116,207,184,138,273,57,381,45,282,161,190,290,117,412
        objects:
          - vehicle
      yard:
        coordinates: 0,512,0,166,57,114,252,55,421,24,647,21,896,50,896,512
        objects:
          - person
    motion:
      mask:
        - 714,16,477,19,236,56,0,128,0,0,896,0,896,49
    mqtt:
      required_zones:
        - driveway
        - yard
    record:
        events:
          required_zones:
            - yard
            - driveway
    snapshots:
      required_zones:
        - yard
        - driveway
  back:
    ffmpeg:
      inputs:
        - path: http://<camera ip>/flv?port=1935&app=bcs&stream=channel0_main.bcs&user={FRIGATE_CAM_USER}&password={FRIGATE_CAM_PASSWORD}
          roles:
            - record
        - path: http://<camera ip>/flv?port=1935&app=bcs&stream=channel0_ext.bcs&user={FRIGATE_CAM_USER}&password={FRIGATE_CAM_PASSWORD}
          roles:
            - detect
            - rtmp
    detect:
        width: 896
        height: 672
        fps: 5
    zones:
      yard:
        coordinates: 0,672,0,70,434,30,896,88,896,672
        objects:
          - person
    mqtt:
      required_zones:
        - yard
    record:
        events:
          required_zones:
            - yard
    snapshots:
      required_zones:
        - yard
  bedside:
    ffmpeg:
      inputs:
        - path: http://<camera ip>/flv?port=1935&app=bcs&stream=channel0_main.bcs&user={FRIGATE_CAM_USER}&password={FRIGATE_CAM_PASSWORD}
          roles:
            - record
        - path: http://<camera ip>/flv?port=1935&app=bcs&stream=channel0_ext.bcs&user={FRIGATE_CAM_USER}&password={FRIGATE_CAM_PASSWORD}
          roles:
            - detect
            - rtmp
    detect:
        width: 896
        height: 672
        fps: 5
    motion:
      mask:
        - 518,0,398,34,250,104,356,142,478,123,530,297,598,288,564,114,661,87,824,146,833,101,639,0
    zones:
      yard:
        coordinates: 896,672,896,191,631,127,0,200,0,672
        objects:
          - person
    mqtt:
      required_zones:
        - yard
    record:
        events:
          required_zones:
            - yard
    snapshots:
      required_zones:
        - yard
  garage:
    ffmpeg:
      inputs:
        - path: http://<camera ip>/flv?port=1935&app=bcs&stream=channel0_main.bcs&user={FRIGATE_CAM_USER}&password={FRIGATE_CAM_PASSWORD}
          roles:
            - record
        - path: http://<camera ip>/flv?port=1935&app=bcs&stream=channel0_ext.bcs&user={FRIGATE_CAM_USER}&password={FRIGATE_CAM_PASSWORD}
          roles:
            - detect
            - rtmp
    detect:
        width: 896
        height: 672
        fps: 5
    motion:
      mask:
        - 0,0,406,0,405,71,365,106,334,140,305,200,276,229,268,307,328,426,405,578,347,672,0,672
        - 829,0,896,0,896,49
    zones:
      yard:
        coordinates: 725,306,670,379,632,473,582,672,361,604,329,495,316,437,247,389,225,334,217,273,361,68,609,58,726,72,727,21,896,48,896,240
        objects:
          - person
    mqtt:
      required_zones:
        - yard
    record:
        events:
          required_zones:
            - yard
    snapshots:
      required_zones:
        - yard

Relevant log output

:see above:

FFprobe output from your camera

n/a

Frigate stats

n/a

Operating system

Proxmox

Install method

Docker Compose

Coral version

USB

Network connection

Wired

Camera make and model

reolinks

Any other information that may be helpful

No response

flowchartsman commented 2 years ago

The only similar thing I was able to find was: https://github.com/google-coral/edgetpu/issues/166

Related?

flowchartsman commented 2 years ago

On a hunch this was a usb autosuspend issue, I've followed the advice here and disabled it and rebooted. cat /sys/module/usbcore/parameters/autosuspend on the proxmox host now returns -1, when before it returned 2, but in the proxmox dmesg, yet again I see:

[  +2.991700] usb 2-6: reset SuperSpeed USB device number 2 using xhci_hcd
[  +0.020168] usb 2-6: LPM exit latency is zeroed, disabling LPM.

And, going outside and waving my arms around, still no bounding boxes, no events, so I'm fresh out of ideas.

ozett commented 2 years ago

related? https://github.com/blakeblackshear/frigate/issues/2178#issuecomment-962084817 https://github.com/blakeblackshear/frigate/issues/1807 use a powerd-usb-hub?

flowchartsman commented 2 years ago

related? #2178 (comment) #1807

I'm not sure, but it doesn't seem like it. In both of those cases, the users cannot use their Coral. I can, but it stops working.

use a powerd-usb-hub?

Maybe? But I need something more solid to go on to decide if I need to spend money on a new hub, a new cable, or both. "Maybe it's using too much power" or "maybe you need a powered hub" aren't very satisfying answers. Is there any way to know more?

flowchartsman commented 2 years ago

Went ahead and wrote to the coral.ai help address, and they said it sounded like it might be heat related. Given that the Frigate dockerfile runs it with the max performance library, this might make sense on the surface, but I'm still a little suspicious, since this article didn't seem to indicate any heat problems, and I would think this would be more widespread otherwise. I will say that my unit seems quite hot compared to the observations in the article (measurements to follow).

I ordered a passive copper heatsink that should cover the unit, and which comes with a binding pad to test this out. Hopefully the addition of a heatsink will at least provide a data point for or against temperature being the issue. If it does seem to be heat, that seems to suggest my unit is hotter than normal for unknown reasons. It would either be a defect in my unit or something off about my environment, and I would suspect the environment before the hardware. Very confusing.

blakeblackshear commented 2 years ago

I have been running mine at max speed for years and never seen a heat related failure even in a warm server rack.

flowchartsman commented 2 years ago

As I said, I’m pretty skeptical that’s what’s happening, but I don’t have any viable, testable alternatives at the moment. All I know is that the device is “reset” and frigate stops working. I really want this to work, so I won’t stop digging on it until I have a viable solution, but at least the coral.ai help address gave me something to go on.

sakalauskas commented 2 years ago

Looks like my Coral USB is resetting for me as well:

proxmox logs:

reset high-speed USB device number 3 using ehci-pci

I only noticed this after I started optimizing my containers. It seems like Coral is not really being used. When I was running Frigate with Coral on another machine, I didn't notice 20-40% CPU usage for frigate.detectors.coral.

__ EDITED: Actually, Coral is working correctly, even it is disconnecting. Changing the detector to CPU, the load increases to 200-400%. The load is simply because there is a lot of motion due to Christmas lights.

flowchartsman commented 2 years ago

I bought a powered hub and a new cable, but now Frigate cannot see the Coral. Likely because I need to change the mounts, but this brings up a question I thought I'd ask here: is it possible there is some conflict with the way the qemu-based VM (hassos) and the Frigate LXC container are getting USB passed through?

The Home assistant container has a single USB device mapped through in proxmox (a zigbee/zwave combination stick), while the LXC container (privilieged) is getting the entire bus:

/etc/pvc/lxc/<container number>.conf

lxc.mount.entry: /dev/bus/usb/002/ dev/bus/usb/002/ none bind,optional,create=dir 0,0
lxc.cgroup2.devices.allow: c 189:* rwm
lxc.apparmor.profile: unconfined
lxc.cgroup2.devices.allow: a
lxc.cap.drop:
lxc.mount.auto: cgroup:rw
lxc.cgroup2.devices.allow: c 226:0 rwm
lxc.cgroup2.devices.allow: c 226:128 rwm
lxc.cgroup2.devices.allow: c 29:0 rwm
lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir
lxc.mount.entry: /dev/fb0 dev/fb0 none bind,optional,create=file

compose

    devices:
      - /dev/bus/usb:/dev/bus/usb

I recall seeing in another issue here that the Coral requires the entire bus, and was wondering why that is, and whether it could be the source of conflicts.

I am really regretting buying a single box to run proxmox, and am very close to giving up and just running this on a PI4 I have sitting around if only so I can be done with it until the M2 Coral and the Odyssey boxes come back in stock.

cmmh commented 2 years ago

For what its worth, I have the Coral USB device exposed through proxmox and I do not get any errors. I get errors, ironically, on other devices, but not the Coral. I also pass-through and NVIDIA TESLA M4 without issue.
I did have to tweak some settings to make the pass-through work for both the PCIe device and USB.

jcastro commented 2 years ago

I'm setting up the Coral USB as it's being shown by the lsusbcommand Bus 002 Device 002: ID 1a6e:089a Global Unichip Corp. Then I restart to apply the changes and the ID/Vendor changes as Bus 002 Device 003: ID 18d1:9302 Google Inc. Is anyone experiencing a similar issue or knows what's happening?

flowchartsman commented 2 years ago

I gave up and purchased a seeed odyssey, as recommended in the docs. I put ubuntu server on it bare metal, exclusively to run Frigate, and connected my coral to a powered usb3 hub with a high speed cable. I am still getting errors, the latest of which is this:

[2022-01-30 16:39:58] frigate.app                    INFO    : Camera processor started for front: 228
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/tflite_runtime/interpreter.py", line 160, in load_delegate
    delegate = Delegate(library, options)
  File "/usr/lib/python3/dist-packages/tflite_runtime/interpreter.py", line 119, in __init__
    raise ValueError(capture.message)
ValueError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/frigate/frigate/edgetpu.py", line 156, in run_detector
    object_detector = LocalObjectDetector(
  File "/opt/frigate/frigate/edgetpu.py", line 64, in __init__
    edge_tpu_delegate = load_delegate("libedgetpu.so.1.0", device_config)
  File "/usr/lib/python3/dist-packages/tflite_runtime/interpreter.py", line 162, in load_delegate
    raise ValueError('Failed to load delegate from {}\n{}'.format(
ValueError: Failed to load delegate from libedgetpu.so.1.0

I'm starting to think Either my Coral is defective, the Coral is a poor choice of hardware, or there is some underlying bug with the software or supporting libraries, but in any case it's nothing I can seem to do anything about. This is a very frustrating experience, and I don't even have any insight on how to improve it. Hope someone can help me out.

blakeblackshear commented 2 years ago

I would suggest trying the getting started tutorial here: https://coral.ai/docs/accelerator/get-started

If that doesn't work, you may have a defective device. This would be the first time I have heard of that happening.

flowchartsman commented 2 years ago

Your confidence is reassuring, since you have vastly more experience with this device than I do, however is there anything on the page you linked which the frigate image has not already done, save for me manually running a model? Are you suggesting perhaps that I try and get it to crash by running it through its paces on the command line? Perhaps a loop of image classification?

These errors are definitely not constant, though they have happened repeatedly today alone. I've looked up the error, which seems to suggest the device is unavailable or the libraries are not installed correctly, neither of which makes sense if it works sometimes. One mention of it in the issues here suggested the device not getting enough power, which I thought I would have solved with the powered hub. I might even be inclined to ignore them if Frigate hadn't failed to notify me of a package delivery today, after perfectly reporting the first three vehicle alerts.

flowchartsman commented 2 years ago

Both of the package notification failures that I recall also happened at night. Is it possible that these crashes are transitory and unrelated and that there is poor model performance at night, when the camera switches to black and white?

blakeblackshear commented 2 years ago

These errors are related to the communication with the device. If it only happens sometimes, it may be related to times of high utilization or high temperatures. I have never had that problem with my USB devices even when running at 100fps for weeks on end in a warm server cabinet. The metal part of the USB is a heat sink that should be connected with a thermal pad to the chip. Given you have tried this on multiple machines and one was bare metal, I still think this points to a defective device. There isn't anything I can think of that could cause black and white to make a difference. The part of frigate that uses the Coral runs in a dedicated process and only has one isolated job of processing raw image data in memory.

lensherm commented 2 years ago

I'm setting up the Coral USB as it's being shown by the lsusbcommand Bus 002 Device 002: ID 1a6e:089a Global Unichip Corp. Then I restart to apply the changes and the ID/Vendor changes as Bus 002 Device 003: ID 18d1:9302 Google Inc. Is anyone experiencing a similar issue or knows what's happening?

I have two USB Corals and I'm seeing identical symptoms as you. My setup is Proxmox-->Ubuntu Server-->Docker

jcastro commented 2 years ago

@lensherm very weird indeed! Looks like once Frigate have access to the USB it might be loading some drivers and the ID/Vendor changes because of that? I'm just assuming things from my completely ignorance here. I'm really looking forward to find a solution to this

cmmh commented 2 years ago

I think this stackoverflow posting describes what you're experiencing:

https://stackoverflow.com/questions/56632485/coral-google-edge-tpu-usb-accelerator-not-recognized-virtualbox-workaround

lensherm commented 2 years ago

I think this stackoverflow posting describes what you're experiencing:

https://stackoverflow.com/questions/56632485/coral-google-edge-tpu-usb-accelerator-not-recognized-virtualbox-workaround

That looks eerily similar to what I'm seeing. Will keep digging, when I have the time. For now, I have to pass through the "wrong" device to the VM, start it up, shut it down, pass in the updated correct device, start up the VM and all is fine with the world, until the next power down of the ProxMox machine.

simone-desantis commented 2 years ago

Similar setup, I am using a coral usb, proxmox, ubuntu 21 vm , and having frigate in docker. I am passing the usb through the proxmox ui (no special command issued). I can tell that I always experienced the id change that someone mentioned(even using the coral in a raspberry pi), so I thought it was expected. I also see in the proxmox log, instances of: reset SuperSpeed USB device number 3 using xhci_hcd So same for me I see the error message: Detection appears to be stuck. Restarting detection process...

jcastro commented 2 years ago

Someone from the Coral support team send me this link, I still need to look at this but wanted to share in case anyone is able to fix it https://www.reddit.com/r/Proxmox/comments/nmsknx/proxmox_vm_ubuntu_2004_connect_google_coral_usb/ (look for the comment with the solution, not the main comment)

simone-desantis commented 2 years ago

@jcastro Okay, passing the pcie usb controller seems to be stable, it's some hours now that it's running fine. I am wondering if there could be any configuration to set when passing through just the usb port that would work too.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ronluna commented 1 year ago

I'm experiencing the same behavior on proxmox 7. It does not matter how the usb is passthrough to the VM, by device vendor/id or by usb port or by passing the entire pci device to the VM, the result is always the same (even after adding a powered USB 3 hub) . the Coral USB will perform between 1-3 requests and after that it will start flashing/blinking white. dmesg will show the following:

[ 19.464072] usb 1-1: config 1 interface 0 altsetting 0 bulk endpoint 0x81 has invalid maxpacket 1024 [ 19.464073] usb 1-1: config 1 interface 0 altsetting 0 bulk endpoint 0x82 has invalid maxpacket 1024 [ 19.464076] usb 1-1: New USB device found, idVendor=18d1, idProduct=9302, bcdDevice= 1.00 [ 19.464078] usb 1-1: New USB device strings: Mfr=0, Product=0, SerialNumber=0 [ 21.474247] usb 1-1: reset high-speed USB device number 3 using ehci-pci [ 24.250245] usb 1-1: reset high-speed USB device number 3 using ehci-pci [ 26.950252] usb 1-1: reset high-speed USB device number 3 using ehci-pci [ 29.754260] usb 1-1: reset high-speed USB device number 3 using ehci-pci [ 29.916573] usb 1-1: usbfs: process 1256 (python3) did not claim interface 0 before use

any ideas?

NickM-27 commented 1 year ago

the Coral USB will perform between 1-3 requests and after that it will start flashing/blinking white.

The coral blinking means it's working / processing and is normal behavior.

As far as the error, not sure why that would happen.

ronluna commented 1 year ago

What I've noticed is that the Coral USB will blink while processing a request then will stop blinking. Although this is a steady blinking/flashing. It will stop processing requests right after the error in my previous comment shows in dmesg . If I unplug the coral and plug it back in. It will get recognize again and will detect object once of twice and then the error will show again in dmesg and it won't process any more request and will continue blinking/flashing indefinitely wether there is motion/object or not.

adamburgoyne commented 10 months ago

Shame this is still an issue over a year later. At least Frigate is now detecting the coral and rebooting detection - so it does continue to work but just has 30 blips of no detection throughout the day and the associated spam logs. I'm not sure if it's the version of Proxmox or just unavoidable, I've tried a separate VM to the HA addon but it's the same problem.

M700 tiny so can't pass through the usb controller. Anyone ever find a solution?

michalcharvat commented 10 months ago

I dont have problem use Coral in Proxmox LXC container with docker for few hours however after some time it is probably disconnected. It is almost similar behaviour I have on VMware ESXi VM. Currently I am thinking about replacing it with M.2 TPU but not sure if does solve my issue...

bob454522 commented 8 months ago

@michalcharvat - the M.2 TPU will NOT work on esxi - It's been documented in other posts around the web, as well as my own personal experience, the m.2 coral issue revolves around the fact that the TPU has to be flashed each time its power cycled and ESXI does not handle the PCIE device ID changing during this flash process (re: passthrough, now needs to be done of a new device ID post flash, which will require a reboot -> thus loop entire process over again).

I would very much like to know @blakeblackshear your exact hardware setup (The one he describes as working great for years in a warm server cabinet and frequently at 100 FPS of coral detection) - ie what is the hardware, and how are you running frigate? (ie is docker used? any virtualization on-top of docker?) Any details you can provide would be very helpful as I've been troubleshooting these issues for nearly a month. (btw- These issues are not frigates fall, frigate is amazing excellent software, they are Google slash corals issues).

The only stable frigate setup I have gotten is when I use a older USB-A to USB-C cable (from various pciE usb3 cards im testing) - the cable only allows USB-2 / 480mbit usb speeds, however with this "limit" frigate and coral have been stable for WEEKS of uptime - although inference speed is 30-35ms and inference FPS is limited to about 30 FPS (which is better than coral reboots / unstable coral).

I have several high quality supermicro servers (x9 and x11 based dual cpu servers) + a nvidia p1000 gpu , and 4x different USB3 pcie cards (along with 3x usb corals and 1x pcie based coral) -- all for testing / trying to get a stable setup. (*note im only using one of these each at a time, not mixing multiples, But this is what I've accumulated over the past month)

thanks!

NickM-27 commented 8 months ago

The 0.13 docs have a new getting started guide that outlines exactly how to set up in a way similar to the way Blake runs (it was written by Blake) https://deploy-preview-6262--frigate-docs.netlify.app/guides/getting_started

bob454522 commented 8 months ago

The 0.13 docs have a new getting started guide that outlines exactly how to set up in a way similar to the way Blake runs (it was written by Blake) https://deploy-preview-6262--frigate-docs.netlify.app/guides/getting_started

thanks, that is helpful - i am more curious as to the hardware he is using though (i assume debian 12 is baremetal, not a guest , and what kind of USB3 pcie card or is a USB3 powered hub in use ).

but for now will be re-doing my tests with debian bookworm as the OS (guest on esxi). will update if debian makes a difference (vs ubuntu). tks

NickM-27 commented 8 months ago

The hardware is described in the docs as well https://docs.frigate.video/frigate/hardware#server

michalcharvat commented 8 months ago

@bob454522 my current setup is M.2 TPU in Wi-Fi slot in old HP 705 G4 mini inside Proxmox 7.4 in LXC container. I dont remember why but I was not able run it in Proxmox 8.

leccelecce commented 8 months ago

Some of the issues being discussed in this thread are not necessarily anything to do with Proxmox, but rather Debian or Linux itself.

I initially ran Frigate in an LXC container on Proxmox 7, and saw these USB issues (both with and without the PCI object passed through).

[Thu Dec 21 23:37:35 2023] usb 2-1: reset SuperSpeed USB device number 3 using xhci_hcd
[Thu Dec 21 23:37:35 2023] usb 2-1: LPM exit latency is zeroed, disabling LPM.

I recently reinstalled Frigate on vanilla Debian 12.4, on a bare-metal Dell Optiplex 3080 Micro Form Factor and moved by USB Coral over to it. I still get exactly the same messages I got in Proxmox.

Obviously if people have issues to do with PCI passthrough or running in VMs, that's potentially different, but a few people here are possibly blaming Proxmox or LXC when vanilla Debian bare metal still has an issue.

bob454522 commented 8 months ago

This is great, to hear- i have been troubleshooting these exact issues for weeks now- To the point where I now have 3x USB corals, and 1x m.2 coral in a PCIe converter. I also have 4x different pcie USB cards (as pass-ing through the pcie USB is the only stable option, which does make sense). I also bought a usb cable tester (Less of a tester and more of a pin-out / pin-continuity detector, so that I'm able to determine if XYZ USB-cable can support USB at 480 megabit, 5 gigabit, or 10 gigabit)

All of my hardware is server grade Supermicro x9 or x11 based boards, with 256g+ of ecc ram, enterprise ssds ect - (in server chassis, with dual super-micro power supplies).

Im using esxi (and baremetal at times only for short tests, as i only plan to run this as a virtual machine).

A key finding i have recently discovered - The setup is very stable if you FORCE usb2 speeds to the coral (480mbit) (ie using a non usb3 cable accomplishes this) - Of course the inference latency increases quite a bit, and thus the total detection FPS is reduced alot, but the entire setup is very stable. Ive even tested 2x corals to one frigate at 480mbit, and its very stable but not 2x the coral performance. (ive also tested 2x frigate containers, 1x coral to each, both sharing the p1000, and that too is very stable for days )

(i also Recall reading another post here from a user who was having coral reset issues and thought the Google included USB cable was the issues - as when he switched to a different cable the entire system was now stable and the coral ran much cooler temperature wise, however it was pointed out that he was now just running the coral at USB2 and not USB3 speeds due to the cable)

so it's really starting to seem like it's some kind of USB throughput issue and possibly the underlying OS, or perhaps how docker interacts with the USB driver? (also seems like frigate is pushing alot of bandwidth to the coral, Which is of course expected / normal)

At first it very much seemed like usb power issues, But I have tested and controlled for that and it does not seem to be a lack of power slash current from the USB to the coral.

forcing usb at 480mbit - very stable for days (even seen 1 week uptime on frigate container);

# lsusb -t
/: Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 5000M
/: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 480M
|__ Port 2: Dev 3, If 0, Class=Vendor Specific Class, Driver=usbfs, 480M

If i use USB3 at 5gbit to the coral (lsusb -t), frigate will run anywhere from 2min to 30min , with excellent performance , but will then restart (just the frigate contaner), with a few of these in dmesg:


dmesg;
usb 2-1.3: reset SuperSpeed USB device number 4 using xhci_hcd
usb 2-1.3: LPM exit latency is zeroed, disabling LPM.

If i use a USB card capable of 10gbit (but still linked to the coral's max of 5gbit), frigate will run anywhere from 1min to 4min , then restart (just the frigate container not the os / vm), with several of these this in dmesg:

lsusb -t
/: Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 10000M
|__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/4p, 10000M
|__ Port 3: Dev 8, If 0, Class=Vendor Specific Class, Driver=usbfs, 5000M
|__ Port 2: Dev 3, If 0, Class=Hub, Driver=hub/4p, 10000M
|__ Port 3: Dev 5, If 0, Class=Vendor Specific Class, Driver=, 5000M

dmesg;
xhci_hcd 0000:13:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 1
xhci_hcd 0000:13:00.0: Looking for event-dma 000000000144a060 trb-start 000000000144bfe0 trb-end 000000000144bfe0 seg-start 000000000144b000 seg-end 000000000144bff0

next up- i will be testing with using debian 12.x as the guest OS (still will use docker) , then if not resolved will test with same hardware but on baremetal.

I have do have much higher bandwidth services / hardware running on esxi / vSphere using pcie Passthrough - un-related to frigate, with years of uptime, so i dont think esxi passthrough is the problem / limit here (but it is possible!)

bob454522 commented 8 months ago

to update my post above- im seeing the same exact USB errors on debian 12 , as i do ubuntu 20 and 22 - (this when im using the coral linked at usb3, 5gbit, on a pcie usb card capable of 10gbit - this is the card im using: https://a.co/d/cCuS298 + with a 10gbit capable usbC to usbC cable to the coral ) - see below :

root@dockerD-deb12:/vmDS/noREPLCssdz2# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 12 (bookworm)
Release:        12
Codename:       bookworm

root@dockerD-deb12:/vmDS/noREPLCssdz2# lsusb
Bus 002 Device 003: ID 2109:0822 VIA Labs, Inc. USB3.1 Hub
Bus 002 Device 004: ID 18d1:9302 Google Inc.
Bus 002 Device 002: ID 2109:0822 VIA Labs, Inc. USB3.1 Hub
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 003: ID 2109:2822 VIA Labs, Inc. USB2.0 Hub
Bus 001 Device 002: ID 2109:2822 VIA Labs, Inc. USB2.0 Hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

root@dockerD-deb12:/vmDS/noREPLCssdz2# lsusb -t
/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 10000M
    |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/4p, 10000M
        |__ Port 3: Dev 4, If 0, Class=Vendor Specific Class, Driver=usbfs, 5000M
    |__ Port 2: Dev 3, If 0, Class=Hub, Driver=hub/4p, 10000M
/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/2p, 480M
    |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/4p, 480M
    |__ Port 2: Dev 3, If 0, Class=Hub, Driver=hub/4p, 480M

dmesg 
...
[ 1540.800054] IPv6: ADDRCONF(NETDEV_CHANGE): veth01f4285: link becomes ready
[ 1540.800135] br-47a4b4dc8786: port 2(veth01f4285) entered blocking state
[ 1540.800146] br-47a4b4dc8786: port 2(veth01f4285) entered forwarding state
[ 1546.090540] usb 2-1.3: reset SuperSpeed USB device number 4 using xhci_hcd
[ 1546.109851] usb 2-1.3: LPM exit latency is zeroed, disabling LPM.
[ 1550.188998] xhci_hcd 0000:13:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 1
[ 1550.189078] xhci_hcd 0000:13:00.0: Looking for event-dma 0000000035485000 trb-start 0000000035484fe0 trb-end 0000000035484fe0 seg-start 0000000035484000 seg-end 0000000035484ff0
[ 1550.189090] xhci_hcd 0000:13:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 1
[ 1550.189161] xhci_hcd 0000:13:00.0: Looking for event-dma 0000000035485010 trb-start 0000000035484fe0 trb-end 0000000035484fe0 seg-start 0000000035484000 seg-end 0000000035484ff0
[ 1550.189177] xhci_hcd 0000:13:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 1
[ 1550.189248] xhci_hcd 0000:13:00.0: Looking for event-dma 0000000035485020 trb-start 0000000035484fe0 trb-end 0000000035484fe0 seg-start 0000000035484000 seg-end 0000000035484ff0
[ 1550.189260] xhci_hcd 0000:13:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 1
[ 1550.189316] xhci_hcd 0000:13:00.0: Looking for event-dma 0000000035485030 trb-start 0000000035484fe0 trb-end 0000000035484fe0 seg-start 0000000035484000 seg-end 0000000035484ff0
[ 1550.189440] xhci_hcd 0000:13:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 1
[ 1550.191946] xhci_hcd 0000:13:00.0: Looking for event-dma 0000000035485040 trb-start 0000000035484fe0 trb-end 0000000035484fe0 seg-start 0000000035484000 seg-end 0000000035484ff0
[ 1550.191959] xhci_hcd 0000:13:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 1
[ 1550.194319] xhci_hcd 0000:13:00.0: Looking for event-dma 0000000035485050 trb-start 0000000035484fe0 trb-end 0000000035484fe0 seg-start 0000000035484000 seg-end 0000000035484ff0
[ 1550.194331] xhci_hcd 0000:13:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 1
[ 1550.196767] xhci_hcd 0000:13:00.0: Looking for event-dma 0000000035485060 trb-start 0000000035484fe0 trb-end 0000000035484fe0 seg-start 0000000035484000 seg-end 0000000035484ff0

(over the next week or so will test this same hardware on baremetal)

goldserve commented 6 months ago

bob454522, have you made any other discoveries? I'm beginning to think it is an underlying OS issue so is there another OS that is not Debian/Ubuntu that is more stable and can run docker?

KevSex commented 6 months ago

I've had similar issues experienced in relation to these errors:

reset SuperSpeed USB device number 4 using xhci_hcd
LPM exit latency is zeroed, disabling LPM.

I had a look at disabling LPM on the device through the following grub args:

usbcore.autosuspend=-1 <- To avoid kernel putting USB devices into power-save usbcore.quirks=18d1:9302:k <- USB_QUIRK_NO_LPM where the entry is in the form VendorID:ProductID:Flags

I ended up doing this on both Proxmox and the Docker VM just to be sure

I no longer receive the disabling LPM message.

Thought I'd post to rule out any issues with the kernel attempting to run some power saving on the USB device.

leccelecce commented 5 months ago

@KevSex thank you, you've cleared my issue on my Dell 3080 Micro / Debian Bookworm with USB Coral!

Simple case of editing /etc/default/grub, adding those usbcore. settings, update-grub and reboot

Wvanlaa commented 4 months ago

I've had similar issues experienced in relation to these errors:

reset SuperSpeed USB device number 4 using xhci_hcd
LPM exit latency is zeroed, disabling LPM.

I had a look at disabling LPM on the device through the following grub args:

usbcore.autosuspend=-1 <- To avoid kernel putting USB devices into power-save usbcore.quirks=18d1:9302:k <- USB_QUIRK_NO_LPM where the entry is in the form VendorID:ProductID:Flags

I ended up doing this on both Proxmox and the Docker VM just to be sure

I no longer receive the disabling LPM message.

Thought I'd post to rule out any issues with the kernel attempting to run some power saving on the USB device.

@KevSex do you know if there is a bit of a "step by step" guide on how to change those parameters, or can you elaborate on what you did (and how you did it)? Also suffering from: Apr 29 06:14:35 pve kernel: usb 4-2: reset SuperSpeed USB device number 4 using xhci_hcd Apr 29 06:14:35 pve kernel: usb 4-2: LPM exit latency is zeroed, disabling LPM

I KNOW I have seen an article somewhere that does a "step by step" around this, just cannot find it anymore. So a bit of help would be highly appreciate.

KevSex commented 4 months ago

I've had similar issues experienced in relation to these errors:

reset SuperSpeed USB device number 4 using xhci_hcd
LPM exit latency is zeroed, disabling LPM.

I had a look at disabling LPM on the device through the following grub args: usbcore.autosuspend=-1 <- To avoid kernel putting USB devices into power-save usbcore.quirks=18d1:9302:k <- USB_QUIRK_NO_LPM where the entry is in the form VendorID:ProductID:Flags I ended up doing this on both Proxmox and the Docker VM just to be sure I no longer receive the disabling LPM message. Thought I'd post to rule out any issues with the kernel attempting to run some power saving on the USB device.

@KevSex do you know if there is a bit of a "step by step" guide on how to change those parameters, or can you elaborate on what you did (and how you did it)? Also suffering from: Apr 29 06:14:35 pve kernel: usb 4-2: reset SuperSpeed USB device number 4 using xhci_hcd Apr 29 06:14:35 pve kernel: usb 4-2: LPM exit latency is zeroed, disabling LPM

I KNOW I have seen an article somewhere that does a "step by step" around this, just cannot find it anymore. So a bit of help would be highly appreciate.

Edit your grub file

nano /etc/default/grub

Look for line GRUB_CMDLINE_LINUX_DEFAULT. Within the quotes, append the two parameters usbcore.autosuspend=-1 usbcore.quirks=18d1:9302:k

Mine looks like: GRUB_CMDLINE_LINUX_DEFAULT="quiet usbcore.autosuspend=-1 usbcore.quirks=18d1:9302:k"

Save and exit file

Update grub

update-grub

Reboot

snh commented 1 week ago

usbcore.quirks=18d1:9302:k <- USB_QUIRK_NO_LPM where the entry is in the form VendorID:ProductID:Flags

usbcore.quirks=18d1:9302:k resolved the issue for me, I didn't need to worry about usbcore.autosuspend=-1 in my case.

Thanks for the fix @KevSex!