blakeblackshear / frigate

NVR with realtime local object detection for IP cameras
https://frigate.video
MIT License
18.08k stars 1.65k forks source link

[Detector Support]: PCIe Edge TPU is found, then "Detection appears to be stuck" and things go downhill from there #5910

Closed kevinmilner closed 1 year ago

kevinmilner commented 1 year ago

Describe the problem you are having

Upon startup, the PCIe TPU is found, but then the detector is reported as stuck, is killed, and never comes back to life. I am running this as an app on a Truenas Scale (bluefin) using Truecharts (appologies for the verbose auto-generated config file below). I have the drivers successfully installed on the host system.

My computer (an HP Z230 workstation) doesn't have any mPCIe slots, so I am using the Ableconn PEX-MP117 adapter; many people on Amazon have reported success using with this adapter with the Coral.

I have also tried passing the PCIe card through to a VM and using a simpler config with docker-compose, but got the same errors (I originally thought they were due to PCIe passthrough issues, but them occurring on the bare metal install makes me think otherwise). When I was doing tests with the VM, I did not have the drivers installed on the host (only the VM). When switching to bare metal, I disabled VM PCIE passthrough so the VM shouldn't be mucking with things.

Version

0.11.1-2eada21

Frigate config file

{
  "birdseye": {
    "enabled": true,
    "height": 720,
    "mode": "objects",
    "quality": 8,
    "width": 1280
  },
  "cameras": {
    "front_porch": {
      "best_image_timeout": 60,
      "birdseye": {
        "enabled": true,
        "mode": "objects"
      },
      "detect": {
        "enabled": true,
        "fps": 5,
        "height": 720,
        "max_disappeared": 25,
        "stationary": {
          "interval": 0,
          "max_frames": {
            "default": null,
            "objects": {}
          },
          "threshold": 50
        },
        "width": 1280
      },
      "ffmpeg": {
        "global_args": [
          "-hide_banner",
          "-loglevel",
          "warning"
        ],
        "hwaccel_args": [],
        "input_args": [
          "-avoid_negative_ts",
          "make_zero",
          "-fflags",
          "+genpts+discardcorrupt",
          "-rtsp_transport",
          "tcp",
          "-timeout",
          "5000000",
          "-use_wallclock_as_timestamps",
          "1"
        ],
        "inputs": [
          {
            "global_args": [],
            "hwaccel_args": [],
            "input_args": [],
            "path": "rtsp://USER:PASS@192.168.4.24:10554/Streaming/Channels/102/",
            "roles": [
              "record",
              "rtmp",
              "detect"
            ]
          }
        ],
        "output_args": {
          "detect": [
            "-f",
            "rawvideo",
            "-pix_fmt",
            "yuv420p"
          ],
          "record": [
            "-f",
            "segment",
            "-segment_time",
            "10",
            "-segment_format",
            "mp4",
            "-reset_timestamps",
            "1",
            "-strftime",
            "1",
            "-c",
            "copy",
            "-an"
          ],
          "rtmp": [
            "-c",
            "copy",
            "-f",
            "flv"
          ]
        }
      },
      "ffmpeg_cmds": [
        {
          "cmd": "ffmpeg -hide_banner -loglevel warning -avoid_negative_ts make_zero -fflags +genpts+discardcorrupt -rtsp_transport tcp -timeout 5000000 -use_wallclock_as_timestamps 1 -i rtsp://USER:PASS@192.168.4.24:10554/Streaming/Channels/102/ -c copy -f flv rtmp://127.0.0.1/live/front_porch -r 5 -s 1280x720 -f rawvideo -pix_fmt yuv420p pipe:",
          "roles": [
            "record",
            "rtmp",
            "detect"
          ]
        }
      ],
      "live": {
        "height": 720,
        "quality": 8
      },
      "motion": {
        "contour_area": 30,
        "delta_alpha": 0.2,
        "frame_alpha": 0.2,
        "frame_height": 50,
        "improve_contrast": false,
        "mask": "",
        "mqtt_off_delay": 30,
        "threshold": 25
      },
      "mqtt": {
        "bounding_box": true,
        "crop": true,
        "enabled": true,
        "height": 270,
        "quality": 70,
        "required_zones": [],
        "timestamp": true
      },
      "name": "front_porch",
      "objects": {
        "filters": {
          "person": {
            "mask": null,
            "max_area": 24000000,
            "max_ratio": 24000000,
            "min_area": 0,
            "min_ratio": 0,
            "min_score": 0.5,
            "threshold": 0.7
          }
        },
        "mask": "",
        "track": [
          "person"
        ]
      },
      "record": {
        "enabled": false,
        "events": {
          "objects": null,
          "post_capture": 5,
          "pre_capture": 5,
          "required_zones": [],
          "retain": {
            "default": 10,
            "mode": "motion",
            "objects": {}
          }
        },
        "expire_interval": 60,
        "retain": {
          "days": 0,
          "mode": "all"
        },
        "retain_days": null
      },
      "rtmp": {
        "enabled": true
      },
      "snapshots": {
        "bounding_box": true,
        "clean_copy": true,
        "crop": false,
        "enabled": false,
        "height": null,
        "quality": 70,
        "required_zones": [],
        "retain": {
          "default": 10,
          "mode": "motion",
          "objects": {}
        },
        "timestamp": false
      },
      "timestamp_style": {
        "color": {
          "blue": 255,
          "green": 255,
          "red": 255
        },
        "effect": null,
        "format": "%m/%d/%Y %H:%M:%S",
        "position": "tl",
        "thickness": 2
      },
      "ui": {
        "dashboard": true,
        "order": 0
      },
      "zones": {}
    }
  },
  "database": {
    "path": "/db/frigate.db"
  },
  "detect": {
    "enabled": true,
    "fps": 5,
    "height": 720,
    "max_disappeared": null,
    "stationary": {
      "interval": 0,
      "max_frames": {
        "default": null,
        "objects": {}
      },
      "threshold": null
    },
    "width": 1280
  },
  "detectors": {
    "coral": {
      "device": "pci:0",
      "num_threads": 3,
      "type": "edgetpu"
    }
  },
  "environment_vars": {},
  "ffmpeg": {
    "global_args": [
      "-hide_banner",
      "-loglevel",
      "warning"
    ],
    "hwaccel_args": [],
    "input_args": [
      "-avoid_negative_ts",
      "make_zero",
      "-fflags",
      "+genpts+discardcorrupt",
      "-rtsp_transport",
      "tcp",
      "-timeout",
      "5000000",
      "-use_wallclock_as_timestamps",
      "1"
    ],
    "output_args": {
      "detect": [
        "-f",
        "rawvideo",
        "-pix_fmt",
        "yuv420p"
      ],
      "record": [
        "-f",
        "segment",
        "-segment_time",
        "10",
        "-segment_format",
        "mp4",
        "-reset_timestamps",
        "1",
        "-strftime",
        "1",
        "-c",
        "copy",
        "-an"
      ],
      "rtmp": [
        "-c",
        "copy",
        "-f",
        "flv"
      ]
    }
  },
  "live": {
    "height": 720,
    "quality": 8
  },
  "logger": {
    "default": "info",
    "logs": {}
  },
  "model": {
    "height": 320,
    "labelmap": {},
    "labelmap_path": null,
    "path": null,
    "width": 320
  },
  "motion": null,
  "mqtt": {
    "client_id": "frigate",
    "host": "192.168.5.39",
    "password": "PASS",
    "port": 1883,
    "stats_interval": 60,
    "tls_ca_certs": null,
    "tls_client_cert": null,
    "tls_client_key": null,
    "tls_insecure": null,
    "topic_prefix": "frigate",
    "user": "USER"
  },
  "objects": {
    "filters": null,
    "mask": "",
    "track": [
      "person"
    ]
  },
  "plus": {
    "enabled": false
  },
  "record": {
    "enabled": false,
    "events": {
      "objects": null,
      "post_capture": 5,
      "pre_capture": 5,
      "required_zones": [],
      "retain": {
        "default": 10,
        "mode": "motion",
        "objects": {}
      }
    },
    "expire_interval": 60,
    "retain": {
      "days": 0,
      "mode": "all"
    },
    "retain_days": null
  },
  "rtmp": {
    "enabled": true
  },
  "snapshots": {
    "bounding_box": true,
    "clean_copy": true,
    "crop": false,
    "enabled": false,
    "height": null,
    "quality": 70,
    "required_zones": [],
    "retain": {
      "default": 10,
      "mode": "motion",
      "objects": {}
    },
    "timestamp": false
  },
  "timestamp_style": {
    "color": {
      "blue": 255,
      "green": 255,
      "red": 255
    },
    "effect": null,
    "format": "%m/%d/%Y %H:%M:%S",
    "position": "tl",
    "thickness": 2
  },
  "ui": {
    "use_experimental": false
  }
}

This simpler config also generates the same errors, but was used within a VM with PCIe passthrough (the above was bare metal):

mqtt:
  host: 192.168.5.39
  user: USER
  password: PASS

detectors:
  coral1:
    type: edgetpu
    device: pci

database:
  # The path to store the SQLite DB (default: shown below)
  path: /db/frigate.db

cameras:
  front_porch:
    ffmpeg: 
      inputs:
        - path: rtsp://USER:PASS@192.168.4.24:10554/Streaming/Channels/102/
          roles:
            - detect
            - rtmp
    rtmp:
      enabled: True # <-- RTMP should be disabled if your stream is not H264
    detect:
      width: 352
      height: 240

docker-compose file or Docker CLI command

Automatically generated by the Truecharts app, but I had the same exact errors with this configuration running in a VM:

version: "3.9"
services:
  frigate:
    container_name: frigate
    privileged: true # this may not be necessary for all setups
    restart: unless-stopped
    image: blakeblackshear/frigate:stable
    shm_size: "128mb" # update for your cameras based on calculation above
    devices:
      - /dev/apex_0:/dev/apex_0 # passes a PCIe Coral, follow driver instructions here https://coral.ai/docs/m2/get-started/#2a-on-linux
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - /home/kevin/frigate/config.yml:/config/config.yml:ro
      - /mnt/surveillance/frigate:/media/frigate
      - /home/kevin/frigate/db:/db
      - type: tmpfs # Optional: 1GB of memory, reduces SSD/SD Card wear
        target: /tmp/cache
        tmpfs:
          size: 1000000000
    ports:
      - "5000:5000"
      - "1935:1935" # RTMP feeds
    environment:
      FRIGATE_RTSP_PASSWORD: "PASS"

Relevant log output

2023-04-04 21:53:47.217570+00:00[s6-init] making user provided files available at /var/run/s6/etc...exited 0.
2023-04-04 21:53:47.686818+00:00[s6-init] ensuring user provided files have correct perms...exited 0.
2023-04-04 21:53:47.687555+00:00[fix-attrs.d] applying ownership & permissions fixes...
2023-04-04 21:53:47.688069+00:00[fix-attrs.d] done.
2023-04-04 21:53:47.688565+00:00[cont-init.d] executing container initialization scripts...
2023-04-04 21:53:47.689051+00:00[cont-init.d] done.
2023-04-04 21:53:47.689554+00:00[services.d] starting services
2023-04-04 21:53:47.693761+00:00[services.d] done.
2023-04-04 21:53:48.988959+00:00[2023-04-04 14:53:48] frigate.app                    INFO    : Starting Frigate (0.11.1-2eada21)
2023-04-04 21:53:49.001110+00:00Starting migrations
2023-04-04 21:53:49.001164+00:00[2023-04-04 14:53:49] peewee_migrate                 INFO    : Starting migrations
2023-04-04 21:53:49.005148+00:00There is nothing to migrate
2023-04-04 21:53:49.005192+00:00[2023-04-04 14:53:49] peewee_migrate                 INFO    : There is nothing to migrate
2023-04-04 21:53:49.018995+00:00[2023-04-04 14:53:49] detector.coral                 INFO    : Starting detection process: 223
2023-04-04 21:53:49.020941+00:00[2023-04-04 14:53:49] frigate.app                    INFO    : Output process started: 225
2023-04-04 21:53:49.023759+00:00[2023-04-04 14:53:49] ws4py                          INFO    : Using epoll
2023-04-04 21:53:49.025951+00:00[2023-04-04 14:53:49] frigate.app                    INFO    : Camera processor started for front_porch: 230
2023-04-04 21:53:49.032629+00:00[2023-04-04 14:53:49] frigate.app                    INFO    : Capture process started for front_porch: 234
2023-04-04 21:53:49.057347+00:00[2023-04-04 14:53:49] frigate.edgetpu                INFO    : Attempting to load TPU as pci:0
2023-04-04 21:53:49.084478+00:00[2023-04-04 14:53:49] frigate.edgetpu                INFO    : TPU found
2023-04-04 21:53:49.329886+00:00[2023-04-04 14:53:49] ws4py                          INFO    : Using epoll
2023-04-04 21:54:09.341025+00:00[2023-04-04 14:54:09] frigate.watchdog               INFO    : Detection appears to be stuck. Restarting detection process...
2023-04-04 21:54:09.341083+00:00[2023-04-04 14:54:09] root                           INFO    : Waiting for detection process to exit gracefully...
2023-04-04 21:54:39.368925+00:00[2023-04-04 14:54:39] root                           INFO    : Detection process didnt exit. Force killing...
2023-04-04 21:54:52.249741+00:00[2023-04-04 14:54:52] detector.coral                 INFO    : Starting detection process: 309
2023-04-04 21:55:05.156839+00:00[2023-04-04 14:54:52] frigate.edgetpu                INFO    : Attempting to load TPU as pci:0
2023-04-04 21:55:05.157069+00:00[2023-04-04 14:55:05] frigate.edgetpu                ERROR   : No EdgeTPU was detected. If you do not have a Coral device yet, you must configure CPU detectors.
2023-04-04 21:55:05.157097+00:00Process detector:coral:
2023-04-04 21:55:05.158282+00:00Traceback (most recent call last):
2023-04-04 21:55:05.158333+00:00File "/usr/lib/python3/dist-packages/tflite_runtime/interpreter.py", line 160, in load_delegate
2023-04-04 21:55:05.158349+00:00delegate = Delegate(library, options)
2023-04-04 21:55:05.158362+00:00File "/usr/lib/python3/dist-packages/tflite_runtime/interpreter.py", line 119, in __init__
2023-04-04 21:55:05.158382+00:00raise ValueError(capture.message)
2023-04-04 21:55:05.158399+00:00ValueError
2023-04-04 21:55:05.158415+00:002023-04-04T21:55:05.158415440Z
2023-04-04 21:55:05.158428+00:00During handling of the above exception, another exception occurred:
2023-04-04 21:55:05.158435+00:002023-04-04T21:55:05.158435456Z
2023-04-04 21:55:05.158442+00:00Traceback (most recent call last):
2023-04-04 21:55:05.158455+00:00File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
2023-04-04 21:55:05.158463+00:00self.run()
2023-04-04 21:55:05.158470+00:00File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
2023-04-04 21:55:05.158477+00:00self._target(*self._args, **self._kwargs)
2023-04-04 21:55:05.158494+00:00File "/opt/frigate/frigate/edgetpu.py", line 135, in run_detector
2023-04-04 21:55:05.158504+00:00object_detector = LocalObjectDetector(
2023-04-04 21:55:05.158513+00:00File "/opt/frigate/frigate/edgetpu.py", line 43, in __init__
2023-04-04 21:55:05.158525+00:00edge_tpu_delegate = load_delegate("libedgetpu.so.1.0", device_config)
2023-04-04 21:55:05.158539+00:00File "/usr/lib/python3/dist-packages/tflite_runtime/interpreter.py", line 162, in load_delegate
2023-04-04 21:55:05.158547+00:00raise ValueError('Failed to load delegate from {}\n{}'.format(
2023-04-04 21:55:05.158554+00:00ValueError: Failed to load delegate from libedgetpu.so.1.0
2023-04-04 21:55:05.158564+00:002023-04-04T21:55:05.158564991Z
2023-04-04 21:55:12.247642+00:00[2023-04-04 14:55:12] frigate.watchdog               INFO    : Detection appears to have stopped. Exiting frigate...
2023-04-04 21:55:12.330284+00:00[cont-finish.d] executing container finish scripts...
2023-04-04 21:55:12.330767+00:00[cont-finish.d] done.
2023-04-04 21:55:12.330988+00:00[s6-finish] waiting for services.
2023-04-04 21:55:12.356730+00:00[2023-04-04 14:55:12] frigate.video                  ERROR   : front_porch: Unable to read frames from ffmpeg process.
2023-04-04 21:55:12.356797+00:00[2023-04-04 14:55:12] frigate.video                  ERROR   : front_porch: ffmpeg process is not running. exiting capture thread...
2023-04-04 21:55:12.534563+00:00[s6-finish] sending all processes the TERM signal.
2023-04-04 21:55:15.543725+00:00[s6-finish] sending all processes the KILL signal and exiting.

Operating system

Debian

Install method

Docker Compose

Coral version

PCIe

Any other information that may be helpful

I see the following in dmesg. Same exact error messages when run on bare metal as when I tried it inside of a VM:

[ 9770.048831] x86/PAT: frigate.detecto:596252 map pfn RAM range req uncached-minus for [mem 0x10041c000-0x10041ffff], got write-back
[ 9774.204362] apex 0000:02:00.0: Apex performance not throttled due to temperature
[ 9779.324380] apex 0000:02:00.0: Apex performance not throttled due to temperature
[ 9784.444380] apex 0000:02:00.0: Apex performance not throttled due to temperature
[ 9789.564414] apex 0000:02:00.0: Apex performance not throttled due to temperature
[ 9794.688582] apex 0000:02:00.0: Apex performance not throttled due to temperature
[ 9799.808556] apex 0000:02:00.0: Apex performance not throttled due to temperature
[ 9804.924633] apex 0000:02:00.0: Apex performance not throttled due to temperature
[ 9810.044713] apex 0000:02:00.0: Apex performance not throttled due to temperature
[ 9815.164748] apex 0000:02:00.0: Apex performance not throttled due to temperature
[ 9820.284809] apex 0000:02:00.0: Apex performance not throttled due to temperature
[ 9833.240937] apex 0000:02:00.0: RAM did not enable within timeout (12000 ms)
[ 9833.248782] apex 0000:02:00.0: Apex performance not throttled due to temperature
[ 9846.145137] apex 0000:02:00.0: RAM did not enable within timeout (12000 ms)
[ 9846.152145] apex 0000:02:00.0: Error in device open cb: -110
[ 9846.157896] apex 0000:02:00.0: Apex performance not throttled due to temperature
NickM-27 commented 1 year ago

If it happens on the host OS as well as in a VM then I'd say it is most likely a hardware issue, not sure if it is the adapter but I've not seen this behavior much at all with PCIe based corals.

kevinmilner commented 1 year ago

If it happens on the host OS as well as in a VM then I'd say it is most likely a hardware issue, not sure if it is the adapter but I've not seen this behavior much at all with PCIe based corals.

Thanks for the reply. I tried a different PCIe slot but same result. One more data point, I tailed the logs and dmesg at the same time and discovered some things about the timing of the various log messages. At the point of this log:

frigate.watchdog               INFO    : Detection appears to be stuck. Restarting detection process...

...the only relevant dmesg messages were:

[  284.222972] x86/PAT: frigate.detecto:59726 map pfn RAM range req uncached-minus for [mem 0x10041c000-0x10041ffff], got write-back
[  292.125253] apex 0000:02:00.0: Apex performance not throttled due to temperature
[  297.245143] apex 0000:02:00.0: Apex performance not throttled due to temperature

That "not throttled" message keeps repeating for a while, then the RAM errors don't hit until after this log line:

root                           INFO    : Detection process didnt exit. Force killing...

That line is followed a few seconds later in dmesg by:

[  347.469109] apex 0000:02:00.0: RAM did not enable within timeout (12000 ms)
[  347.476731] apex 0000:02:00.0: Apex performance not throttled due to temperature
[  360.361156] apex 0000:02:00.0: RAM did not enable within timeout (12000 ms)
[  360.368172] apex 0000:02:00.0: Error in device open cb: -110

So maybe the RAM error is a symptom of the detection processing being force killed and might not itself be a hardware issue? Are there any configuration issues that can cause the detection process to stall like that? Things like improper resolution, ffmpeg flags, etc? This is my first go at things, so it could definitely be wrong. I'll try to get a simple test case working with a CPU detector first to verify that's not the issue.

NickM-27 commented 1 year ago

Like I said I've not seen this with PCIe based coral. There are a handful of cases of this error happening with the USB coral and in those cases it was:

  1. an error with proxmox hardware mapping, some users reported passing the entire USB BUS vs just the single port fixed it
  2. an error with hardware, changing USB ports fixed it
  3. an issue with the USB coral not receiving enough power, a powered USB hub fixed it

So, for a PCIe coral it's not clear but I really doubt it has anything to do with ffmpeg or frigate itself

kevinmilner commented 1 year ago

OK, thanks for the info. When you say "this error happening" are you referring to the log message about how Detection appears to be stuck., the dmesg about Apex performance not throttled due to temperature, and/or the dmesg about RAM did not enable within timeout (12000 ms)? Is that not throttled due to temperature a common message or an indication of a problem (I know the other 2 are, but not sure about that one)?

My CPU-only test did fire up just fine, so I think you're right about it being hardware. I'm just trying to understand what's normal and what's not in the logs while I go down the hardware debugging rabbit hole. Step 1 will be a new adapter since that's a lot easier to get than a new TPU.

Really appreciate the time and the quick responses!

NickM-27 commented 1 year ago

OK, thanks for the info. When you say "this error happening" are you referring to the log message about how Detection appears to be stuck., the dmesg about Apex performance not throttled due to temperature, and/or the dmesg about RAM did not enable within timeout (12000 ms)? Is that not throttled due to temperature a common message or an indication of a problem (I know the other 2 are, but not sure about that one)?

I am referring to detection being stuck.

kevinmilner commented 1 year ago

Haven't figure this out yet, but I can confirm that it's a coral issue and not related to frigate, closing issue.

For anyone who finds this, I'll be trying to figure this out over on the coral issue tracker: https://github.com/google-coral/edgetpu/issues/741#event-8939904810

krair commented 1 year ago

FWIW, I ran into the same issue on my machine running the A+E key Coral TPU using a PCIe card "converter". I am running frigate in AlmaLinux 9.2 via rootless Podman 4.4.1. On the host I am running the gasket-dkms module.

Digging around I found this which did solve the issue for me https://github.com/google-coral/edgetpu/issues/345#issuecomment-1656841523

I already had the pcie_aspm=off kernel option configured, but it was removing and rescanning that seemed to "reset" the Coral and allowed the container detect the TPU once again.

roycamp commented 1 year ago

I had the same error message and it was a reproducible issue for me related to temperature. I had been using the PCIe TPU reliably with a single core. Once I added the Dual TPU B+M adapter key and started leveraging both cores, temperatures would climb to 100 and then the errors would start.

NickM-27 commented 1 year ago

The driver has an automatic full shutdown by default at 90 I believe so that would make sense. I have my dual corals cooled via 40mm noctua fan.