[Solved] [Detector Support]: TensorRT failure 0.13.0-beta2-tensorrt

503Dev commented 1 year ago

Describe the problem you are having

Upon switching to 0.13.0-beta2-tensorrt, the tensorrt detector no longer works correctly. I have forced the models to rebuild and ensured they are in the _modelcache folder as expected. The engine loads and I can see the processes in nvidia-smi as expected, but the camera streams are unavailable and the log produces errors.

Version

0.13.0-beta2-tensorrt

Frigate config file

mqtt:
  host: 192.168.68.193
  user: user
  password: pass

detectors:
  tensorrt:
    type: tensorrt
    device: 0

model:
  path: /config/model_cache/tensorrt/yolov7-tiny-416.trt
  input_tensor: nchw
  input_pixel_format: rgb
  width: 416
  height: 416

go2rtc:
  streams:
    sala:
      - rtsp://192.168.68.50:555/live/ch1
    entrada:
      - rtsp://192.168.68.51:554/live/ch1
    cocina:
      - rtsp://192.168.68.52:554/live/ch1
    oficina:
      - rtsp://192.168.68.54:554/live/ch1
    servicio:
      - rtsp://192.168.68.55:554/live/ch1
  webrtc:
    candidates:
      - 192.168.68.191:8555
      - stun:8555

ffmpeg:
  hwaccel_args: preset-nvidia-h264
  output_args:
    record: preset-record-generic-audio-copy
birdseye:
  enabled: True
  mode: objects

snapshots:
  enabled: True
  clean_copy: True
  timestamp: False
  retain:
    default: 5

ui:
  live_mode: webrtc
  use_experimental: True
cameras:
  sala: 
    ffmpeg:
      inputs:
        - path: rtsp://127.0.0.1:8554/sala 
          input_args: preset-rtsp-restream
          roles:
            - record
            - detect
            - rtmp
    rtmp:
      enabled: False  
    detect:
      width: 1280  
      height: 720 
      fps: 5
    record: 
      enabled: True
      retain:
        days: 10
        mode: active_objects
    motion:
      mask:
        - 386,151,377,93,441,0,62,0,162,317
        - 436,0,1280,0,1203,290,452,116
    birdseye:
      enabled: True
      mode: motion

...<additional cameras identical in config>

docker-compose file or Docker CLI command

version: "3.9"
services:
  frigate:
    container_name: frigate
    privileged: true 
    restart: unless-stopped
    image: ghcr.io/blakeblackshear/frigate:0.13.0-beta2-tensorrt
    deploy:    
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1 # number of GPUs
              capabilities: [gpu]

    group_add:
      - '106'
    shm_size: "164mb" 
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - ./config/:/config/
      - /media/zero/NAS1/Public/frigate/:/media/frigate
      - ./trt-models/trt-models/:/trt-models
      - type: tmpfs 
        target: /tmp/cache
        tmpfs:
          size: 1000000000
    ports:
      - "5000:5000"
      - "1936:1935" # RTMP feeds
      - "8555:8555/tcp" # WebRTC over tcp
    environment:
       FRIGATE_RTSP_PASSWORD: "password"
       PLUS_API_KEY: "<apikey>"
       USE_FP16: false
       YOLO_MODELS: yolov7-tiny-416, yolov7x-320

Relevant log output

frigate  | 2023-10-19 15:46:54.109775125  [2023-10-19 15:46:54] frigate.detectors.plugins.tensorrt INFO    : Loaded engine size: 34 MiB
frigate  | 2023-10-19 15:46:54.276330589  [INFO] Starting go2rtc healthcheck service...
frigate  | 2023-10-19 15:46:55.024365914  [2023-10-19 15:46:55] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +193, GPU +74, now: CPU 540, GPU 214 (MiB)
frigate  | 2023-10-19 15:46:55.303936686  [2023-10-19 15:46:55] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] Init cuDNN: CPU +110, GPU +44, now: CPU 650, GPU 258 (MiB)
frigate  | 2023-10-19 15:46:55.308282444  [2023-10-19 15:46:55] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +34, now: CPU 0, GPU 34 (MiB)
frigate  | 2023-10-19 15:46:55.318826088  [2023-10-19 15:46:55] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 616, GPU 250 (MiB)
frigate  | 2023-10-19 15:46:55.321386239  [2023-10-19 15:46:55] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 616, GPU 258 (MiB)
frigate  | 2023-10-19 15:46:55.321704055  [2023-10-19 15:46:55] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +13, now: CPU 0, GPU 47 (MiB)
frigate  | 2023-10-19 15:46:55.321714919  [2023-10-19 15:46:55] frigate.detectors.plugins.tensorrt WARNING : CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
frigate  | 2023-10-19 15:47:07.740023846  Process camera_processor:sala:
frigate  | 2023-10-19 15:47:07.740031186  Traceback (most recent call last):
frigate  | 2023-10-19 15:47:07.740034150    File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
frigate  | 2023-10-19 15:47:07.740036428      self.run()
frigate  | 2023-10-19 15:47:07.740039101    File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
frigate  | 2023-10-19 15:47:07.740041546      self._target(*self._args, **self._kwargs)
frigate  | 2023-10-19 15:47:07.740044044    File "/opt/frigate/frigate/video.py", line 501, in track_camera
frigate  | 2023-10-19 15:47:07.740046217      process_frames(
frigate  | 2023-10-19 15:47:07.740048679    File "/opt/frigate/frigate/video.py", line 876, in process_frames
frigate  | 2023-10-19 15:47:07.740078243      detect(
frigate  | 2023-10-19 15:47:07.740081268    File "/opt/frigate/frigate/video.py", line 575, in detect
frigate  | 2023-10-19 15:47:07.740084012      region_detections = object_detector.detect(tensor_input)
frigate  | 2023-10-19 15:47:07.740086494    File "/opt/frigate/frigate/object_detection.py", line 225, in detect
frigate  | 2023-10-19 15:47:07.740125316      (self.labels[int(d[0])], float(d[1]), (d[2], d[3], d[4], d[5]))
frigate  | 2023-10-19 15:47:07.740128036  KeyError: -16
frigate  | 2023-10-19 15:47:07.810600771  Process camera_processor:servicio:
frigate  | 2023-10-19 15:47:07.810609005  Traceback (most recent call last):
frigate  | 2023-10-19 15:47:07.810612031    File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
frigate  | 2023-10-19 15:47:07.810614445      self.run()
frigate  | 2023-10-19 15:47:07.810617249    File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
frigate  | 2023-10-19 15:47:07.810619625      self._target(*self._args, **self._kwargs)
frigate  | 2023-10-19 15:47:07.810622105    File "/opt/frigate/frigate/video.py", line 501, in track_camera
frigate  | 2023-10-19 15:47:07.810624234      process_frames(
frigate  | 2023-10-19 15:47:07.810626773    File "/opt/frigate/frigate/video.py", line 876, in process_frames
frigate  | 2023-10-19 15:47:07.810628798      detect(
frigate  | 2023-10-19 15:47:07.810631300    File "/opt/frigate/frigate/video.py", line 575, in detect
frigate  | 2023-10-19 15:47:07.810633680      region_detections = object_detector.detect(tensor_input)
frigate  | 2023-10-19 15:47:07.810636173    File "/opt/frigate/frigate/object_detection.py", line 225, in detect
frigate  | 2023-10-19 15:47:07.810638602      (self.labels[int(d[0])], float(d[1]), (d[2], d[3], d[4], d[5]))
frigate  | 2023-10-19 15:47:07.810640611  KeyError: -15
frigate  | 2023-10-19 15:47:07.856840496  Process camera_processor:oficina:
frigate  | 2023-10-19 15:47:07.856848530  Traceback (most recent call last):
frigate  | 2023-10-19 15:47:07.856851365    File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
frigate  | 2023-10-19 15:47:07.856853601      self.run()
frigate  | 2023-10-19 15:47:07.856856238    File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
frigate  | 2023-10-19 15:47:07.856858609      self._target(*self._args, **self._kwargs)
frigate  | 2023-10-19 15:47:07.856861043    File "/opt/frigate/frigate/video.py", line 501, in track_camera
frigate  | 2023-10-19 15:47:07.856863227      process_frames(
frigate  | 2023-10-19 15:47:07.856865785    File "/opt/frigate/frigate/video.py", line 876, in process_frames
frigate  | 2023-10-19 15:47:07.856867772      detect(
frigate  | 2023-10-19 15:47:07.856870267    File "/opt/frigate/frigate/video.py", line 575, in detect
frigate  | 2023-10-19 15:47:07.856872718      region_detections = object_detector.detect(tensor_input)
frigate  | 2023-10-19 15:47:07.856875303    File "/opt/frigate/frigate/object_detection.py", line 225, in detect
frigate  | 2023-10-19 15:47:07.856877745      (self.labels[int(d[0])], float(d[1]), (d[2], d[3], d[4], d[5]))
frigate  | 2023-10-19 15:47:07.856879689  KeyError: -11
frigate  | 2023-10-19 15:47:08.231833430  Process camera_processor:principal:
frigate  | 2023-10-19 15:47:08.233653540  Traceback (most recent call last):
frigate  | 2023-10-19 15:47:08.234002313    File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
frigate  | 2023-10-19 15:47:08.234007040      self.run()
frigate  | 2023-10-19 15:47:08.234090934    File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
frigate  | 2023-10-19 15:47:08.234094637      self._target(*self._args, **self._kwargs)
frigate  | 2023-10-19 15:47:08.234188914    File "/opt/frigate/frigate/video.py", line 501, in track_camera
frigate  | 2023-10-19 15:47:08.234192467      process_frames(
frigate  | 2023-10-19 15:47:08.234275971    File "/opt/frigate/frigate/video.py", line 876, in process_frames
frigate  | 2023-10-19 15:47:08.234279148      detect(
frigate  | 2023-10-19 15:47:08.234357523    File "/opt/frigate/frigate/video.py", line 575, in detect
frigate  | 2023-10-19 15:47:08.234360885      region_detections = object_detector.detect(tensor_input)
frigate  | 2023-10-19 15:47:08.234442091    File "/opt/frigate/frigate/object_detection.py", line 225, in detect
frigate  | 2023-10-19 15:47:08.234445820      (self.labels[int(d[0])], float(d[1]), (d[2], d[3], d[4], d[5]))

Operating system

Debian

Install method

Docker Compose

Coral version

Other

Any other information that may be helpful

TensorRT via PCI-E Nvidia GPU. Configuration works fine with 0.12 using TensorRT. I tested with CPU detectors on 0.13.0-beta2 to validate, and it functions as expected.

NickM-27 commented 1 year ago

it's having an issue with the labelmap, not sure why as other users have not seen this issue

503Dev commented 1 year ago

it's having an issue with the labelmap, not sure why as other users have not seen this issue

Thank you. I did suspect this as I have worked with custom models in the past and defined my labelmaps and recognized the error. The confusing thing to me here is that this is the stock models and not custom therefore I do not have a custom labelmap to define and as a result I am not sure how to proceed or fix this.

I attempted to grab the labelmap.txt from the repo, drop it in my _modelcache and specify it via the config but the same error happens.

NickM-27 commented 1 year ago

I don't know why it would happen unless maybe something went wrong while the model was generated, like I said others have not seen this issue. Does running the 320 model work for you?

503Dev commented 1 year ago

I don't know why it would happen unless maybe something went wrong while the model was generated, like I said others have not seen this issue. Does running the 320 model work for you?

It does not. I am going to backup and then purge all of it and do a fresh setup from scratch. I am coming from a very weathered 0.12 install w/ previous use of extensive custom models, etc. Maybe something in the database or in my files is irregular. I will attempt a full fresh run and update shortly.

NickM-27 commented 1 year ago

it won't be the database file itself but yeah could be something weird

503Dev commented 1 year ago

it won't be the database file itself but yeah could be something weird

Well, I am perplexed. Just wiped everything and did a fresh pull and started with a minimum viable config. Everything starts smooth, detect process launches, the tensorrt engine loads ok and then same exact thing:

frigate  | 2023-10-19 18:17:35.384802509  Process camera_processor:sala:
frigate  | 2023-10-19 18:17:35.386412649  Traceback (most recent call last):
frigate  | 2023-10-19 18:17:35.386459460    File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
frigate  | 2023-10-19 18:17:35.386462508      self.run()
frigate  | 2023-10-19 18:17:35.386465280    File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
frigate  | 2023-10-19 18:17:35.386467735      self._target(*self._args, **self._kwargs)
frigate  | 2023-10-19 18:17:35.386482408    File "/opt/frigate/frigate/video.py", line 506, in track_camera
frigate  | 2023-10-19 18:17:35.386484884      process_frames(
frigate  | 2023-10-19 18:17:35.386487348    File "/opt/frigate/frigate/video.py", line 881, in process_frames
frigate  | 2023-10-19 18:17:35.386489451      detect(
frigate  | 2023-10-19 18:17:35.386491886    File "/opt/frigate/frigate/video.py", line 580, in detect
frigate  | 2023-10-19 18:17:35.386494419      region_detections = object_detector.detect(tensor_input)
frigate  | 2023-10-19 18:17:35.386496948    File "/opt/frigate/frigate/object_detection.py", line 225, in detect
frigate  | 2023-10-19 18:17:35.386499442      (self.labels[int(d[0])], float(d[1]), (d[2], d[3], d[4], d[5]))
frigate  | 2023-10-19 18:17:35.386511808  KeyError: -14

In frigate web UI the image is blank (no frames recvd) but on the system page I can see the inference speed updating properly and the detector process showing activity and FPS. Very unusual. I am stumped as to what is going on, I followed the 0.13-beta2 docs and started fresh with all minimal configs.

I tested both 0.13-beta1 and 0.13-beta2 and the result was the same.

NickM-27 commented 1 year ago

Maybe @NateMeyer has a better idea.

What driver version are you running?

503Dev commented 1 year ago

Maybe @NateMeyer has a better idea.

What driver version are you running?

NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0

NateMeyer commented 1 year ago

0.13 updated the TensorRT library and now needs driver 530 or later.

503Dev commented 1 year ago

@NateMeyer Thank you. This can be closed as absolute user error. I actually read that and assumed mine were up-to-date because I updated everything on my fresh Debian 12 install. Of course, thats a dumb assumption and my drivers were up to date but the repo drivers for Nvidia are not 5.30+ because they aren't flagged as stable.

Anyway, I had to purge those + Nouveau and then manually use the .run installer from the Nvidia site. Rebooted and tested, works like a charm. Performance is significantly better on 0.13.0-beta2 vs 0.12.x.

blakeblackshear / frigate