[Detector Support]: Fatal Python error: Segmentation fault

usafle commented 7 months ago

Describe the problem you are having

Launching v13 with NVIDIA branch causes a bootloop with the above error and no other explanation. I was told in a different support ticket that my NVIDIA driver version was too new

I have since downgraded to Driver v535.129.03 which is supposedly stable according to the last ticket I opened (https://github.com/blakeblackshear/frigate/issues/9575)

The error still is present.

Version

v13

Frigate config file

mqtt:
  enabled: true
  host: 192.168.1.102
  user: frigate
  password: PASSWORD

# detectors:
#  cpu1:
#    type: cpu
#    num_threads: 2

# birdseye:
#   enabled: True
#   restream: false
#   mode: continuous
#   width: 1280
#   height: 720
#   quality: 8

go2rtc:
  streams:
    Rear_Deck:
      - rtsp://admin:PASSWORD@192.168.1.114:554/h264Preview_01_main
    Rear_Deck_sub:
      - rtsp://admin:PASWWROD@192.168.1.114:554/h264Preview_01_sub
    Garage_Camera:
      - rtsp://admin:PASSWORD@192.168.1.215:554/cam/realmonitor?channel=1&subtype=0
    Garage_Camera_sub:
     - rtsp://admin:PASSWORD@192.168.1.215:554/cam/realmonitor?channel=1&subtype=1

ffmpeg:
  hwaccel_args: preset-nvidia-h265

rtmp:
  enabled: False 

cameras:
############## REAR DECK ##################
  Rear_Deck:
    ffmpeg:
      inputs:
        - path: rtsp://127.0.0.1:8554/Rear_Deck_sub
          input_args: preset-rtsp-restream
          roles:
            - detect
        - path: rtsp://127.0.0.1:8554/Rear_Deck
          input_args: preset-rtsp-restream
          roles:
            - record
      output_args:
        record: -f segment -segment_time 10 -segment_format mp4 -reset_timestamps 1 -strftime 1 -c:v copy -c:a aac
    objects:
      track:
        - person
        - dog
        - bird
        - cat
    detect:
      width: 1280
      height: 720
      fps: 4
    record:
      enabled: True
      events:
        retain:
          default: 2
    snapshots:
      enabled: True

  Garage_Camera:
    ffmpeg:
      inputs:
        - path: rtsp://127.0.0.1:8554/Garage_Camera_sub
          input_args: preset-rtsp-restream
          roles:
            - detect
        - path: rtsp://127.0.0.1:8554/Garage_Camera
          input_args: preset-rtsp-restream
          roles:
            - record
      output_args:
        record: -f segment -segment_time 10 -segment_format mp4 -reset_timestamps 1 -strftime 1 -c:v copy -c:a aac
        # record: preset-record-generic-audio-aac
    objects:
      track:
        - person
        - dog
        - cat
        - car
        - package
    detect:
      width: 1280
      height: 720
      fps: 4
    record:
      enabled: True
      events:
        retain:
          default: 2
    snapshots:
      enabled: True

docker-compose file or Docker CLI command

docker run
  -d
  --name='frigate'
  --net='bridge'
  -e TZ="America/New_York"
  -e HOST_OS="Unraid"
  -e HOST_HOSTNAME="CozsNAS"
  -e HOST_CONTAINERNAME="frigate"
  -e 'FRIGATE_RTSP_PASSWORD'='********!'
  -e 'PLUS_API_KEY'='********'
  -e 'NVIDIA_VISIBLE_DEVICES'='GPU-53a3b891-6d7b-8fe8-bd57-9467c8797875'
  -e 'NVIDIA_DRIVER_CAPABILITIES'='compute,utility,video'
  -e 'YOLO_MODELS'='yolov4-416,yolov4-tiny-416'
  -e 'USE_FP16'='false'
  -e 'TRT_MODEL_PREP_DEVICE'='0'
  -l net.unraid.docker.managed=dockerman
  -l net.unraid.docker.webui='http://[IP]:[PORT:5000]'
  -l net.unraid.docker.icon='https://raw.githubusercontent.com/yayitazale/unraid-templates/main/frigate.png'
  -p '5000:5000/tcp'
  -p '8554:8554/tcp'
  -p '8555:8555/tcp'
  -p '8555:8555/udp'
  -p '1984:1984/tcp'
  -v '/mnt/user/appdata/frigate':'/config':'rw'
  -v '/mnt/user/Frigate Recordings/':'/media/frigate':'rw'
  -v '/etc/localtime':'/etc/localtime':'rw'
  --shm-size=256mb
  --mount type=tmpfs,target=/tmp/cache,tmpfs-size=1000000000
  --restart unless-stopped
  --gpus=all 'ghcr.io/blakeblackshear/frigate:stable-tensorrt'

d4d83afad657559c53468c5ebc065e6caf904150a1c06d31719cfa84768c6afa

Relevant log output

2024-02-11 11:51:47.953981276  Fatal Python error: Segmentation fault
2024-02-11 11:51:47.953989283  
2024-02-11 11:51:47.953991367  Thread 0x00001506c59ee6c0 (most recent call first):
2024-02-11 11:51:47.954045897    File "/usr/lib/python3.9/threading.py", line 312 in wait
2024-02-11 11:51:47.954133590    File "/usr/lib/python3.9/multiprocessing/queues.py", line 233 in _feed
2024-02-11 11:51:47.954194783    File "/usr/lib/python3.9/threading.py", line 892 in run
2024-02-11 11:51:47.954276894    File "/usr/lib/python3.9/threading.py", line 954 in _bootstrap_inner
2024-02-11 11:51:47.954332450    File "/usr/lib/python3.9/threading.py", line 912 in _bootstrap
2024-02-11 11:51:47.954343287  
2024-02-11 11:51:47.954345312  Current thread 0x00001506ea62f740 (most recent call first):
2024-02-11 11:51:47.954442917    File "/opt/frigate/frigate/detectors/plugins/tensorrt.py", line 168 in <listcomp>
2024-02-11 11:51:47.954540476    File "/opt/frigate/frigate/detectors/plugins/tensorrt.py", line 167 in _do_inference
2024-02-11 11:51:47.954632394    File "/opt/frigate/frigate/detectors/plugins/tensorrt.py", line 286 in detect_raw
2024-02-11 11:51:47.954722466    File "/opt/frigate/frigate/object_detection.py", line 75 in detect_raw
2024-02-11 11:51:47.954846892    File "/opt/frigate/frigate/object_detection.py", line 125 in run_detector
2024-02-11 11:51:47.954996257    File "/usr/lib/python3.9/multiprocessing/process.py", line 108 in run
2024-02-11 11:51:47.955127573    File "/usr/lib/python3.9/multiprocessing/process.py", line 315 in _bootstrap
2024-02-11 11:51:47.955272691    File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 71 in _launch
2024-02-11 11:51:47.955386735    File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 19 in __init__
2024-02-11 11:51:47.955491455    File "/usr/lib/python3.9/multiprocessing/context.py", line 277 in _Popen
2024-02-11 11:51:47.955586815    File "/usr/lib/python3.9/multiprocessing/context.py", line 224 in _Popen
2024-02-11 11:51:47.955687173    File "/usr/lib/python3.9/multiprocessing/process.py", line 121 in start
2024-02-11 11:51:47.955783568    File "/opt/frigate/frigate/object_detection.py", line 183 in start_or_restart
2024-02-11 11:51:47.955869434    File "/opt/frigate/frigate/object_detection.py", line 151 in __init__
2024-02-11 11:51:47.955992198    File "/opt/frigate/frigate/app.py", line 453 in start_detectors
2024-02-11 11:51:47.956082578    File "/opt/frigate/frigate/app.py", line 683 in start
2024-02-11 11:51:47.956155258    File "/opt/frigate/frigate/__main__.py", line 17 in <module>
2024-02-11 11:51:47.956251165    File "/usr/lib/python3.9/runpy.py", line 87 in _run_code
2024-02-11 11:51:47.956342911    File "/usr/lib/python3.9/runpy.py", line 197 in _run_module_as_main
2024-02-11 11:51:49.659338836  [INFO] Starting go2rtc healthcheck service...
2024-02-11 11:52:02.661642001  [2024-02-11 11:52:02] frigate.watchdog               INFO    : Detection appears to be stuck. Restarting detection process...
2024-02-11 11:52:02.685462534  [2024-02-11 11:52:02] detector.tensorrt              INFO    : Starting detection process: 1255
2024-02-11 11:52:02.984562929  [2024-02-11 11:52:02] frigate.detectors.plugins.tensorrt INFO    : Loaded engine size: 39 MiB
2024-02-11 11:52:03.343516305  [2024-02-11 11:52:03] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +6, GPU +8, now: CPU 158, GPU 230 (MiB)
2024-02-11 11:52:03.356055956  [2024-02-11 11:52:03] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 160, GPU 240 (MiB)
2024-02-11 11:52:03.360742717  [2024-02-11 11:52:03] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +40, now: CPU 0, GPU 40 (MiB)
2024-02-11 11:52:03.369792099  [2024-02-11 11:52:03] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 120, GPU 232 (MiB)
2024-02-11 11:52:03.370187842  [2024-02-11 11:52:03] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 120, GPU 240 (MiB)
2024-02-11 11:52:03.370321313  [2024-02-11 11:52:03] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +13, now: CPU 0, GPU 53 (MiB)

Operating system

UNRAID

Install method

Docker Compose

Coral version

CPU (no coral)

Any other information that may be helpful

No response

NickM-27 commented 7 months ago

what GPU do you have again?

usafle commented 7 months ago

Edited Post to remove the Frigtate Plus API key and RTSP password that was visible in the Docker CLI command.

usafle commented 7 months ago

No one has ANY suggestions (besides buying a CORAL device) on how to get my Frigtate instance up and running??

NickM-27 commented 7 months ago

seg faults are difficult because it is usually something related to the host or the hardware and there is no info about what is going wrong.

From your previous post logs we can see that as soon as the model is initialized there is a seg fault indicating some failure to communicate correctly. Many users use this type of setup on unraid so it seems there is nothing particular about that. You could try a memtest and see if perhaps system memory is failing.

usafle commented 7 months ago

memtest complete. 0 errors.

Next suggestion please?

hvardhan20 commented 6 months ago

I have the same error. Need help I have a eufy cam2 pro which only sends a stream when a motion is detected. I suspect this could be a potential cause. Any thoughts?

jdgiddings commented 5 months ago

I'm experiencing the exact same error on TrueNAS Scale w/ GTX 1060

usafle commented 5 months ago

@hvardhan20 and @jdgiddings - I hope you both get a response but, if my past experience holds true, it doesn't look good. CPU detection worked fine. GPU detection worked fine...... until they bundled it all into one container.
Hard to fix issues when you don't have any support from anyone here.

NickM-27 commented 5 months ago

There are many tensorrt users so this seems to be a very isolated problem. Like I said before, seg faults are difficult to debug and without being able to reproduce there really isn't any good way to move towards solving the problem because it is not clear what is causing this other than something on the host.

The logic to compile the models is the same as before just done automatically, that is unlikely to be causing this. It could be due to using newer libraries / tensorrt version but that was done to support the latest Nvidia GPUs and also unrelated to frigate building the models automatically.

jdgiddings commented 5 months ago

here's the output from nvidia-smi on the host. I believe these are all supported versions

NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2

I'm experimenting with different models right now to see if any do not cause the error. I will report back

jdgiddings commented 5 months ago

yolov7-320 does not throw the segfault

NickM-27 commented 5 months ago

which model did you use that did?

jdgiddings commented 5 months ago

yolov7x-640 and yolov7x-320 were both throwing the error on my machine

jdgiddings commented 5 months ago

I did some more testing. Any model larger than yolov7-320 throws the same segfault error

jpreston84 commented 5 months ago

I just wanted to add another voice here -- I am able to run yolov7x-320, but if I attempt to run yolov7x-640, I get a segfault (the same as the OP). I'm on a GTX 1650 Super. My setup is a bit odd:

My system is running TrueNAS SCALE.
Inside TrueNAS, I have set up a VM (because the Docker implementation of SCALE sucks).
The VM is running Ubuntu 22.04.
I'm also running CasaOS (but I don't think that matters, because I loaded Frigate via docker compose CLI).
My test camera is streaming via RTSP through go2rtc, with WebRTC and MPE working correctly.

Let me know if I can do anything to help debug this.

[edit] I previously said my 1650 is an LHR. This is incorrect. My 3060 is LHR, and I confused the two.

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

SlavikCA commented 3 months ago

I have same issue with nVidia Quadro P5000 16GB RAM

Driver Version: 555.42.02
CUDA Version: 12.5 OS: Ubuntu 22

evelant commented 3 months ago

My first time trying to use frigate and I'm running into this. nvidia 1070ti on unraid. The yolov4-416.trt model was in the container by default and it segfaults if I try to use it.

evelant commented 3 months ago

I tried the 320 model and that seems to work. There appears to be a common thread here where nvidia+unraid+model larger than 320 fails.

NickM-27 commented 3 months ago

It's more likely to be something related to driver and the GPU that is used. Unraid and 3050 works on any model for me

usafle commented 3 months ago

Although I've given up on this, I am still actively watching this thread. It's still the same reply / response from the only Collaborator that decides to look at this thread. I am sure we all appriecieate the time you are taking to reply NickM-27. However, your answer doesn't really hold water at this point. Multiple different GPUs, multiple different Driver versions, even one individual was here on a different OS than UnRaid.

Fromnboobto commented 3 months ago

For me i had issues with segmentation issues.

What I found out and had fogotten about that if i have the yolo7x-640.trt I had to have teh With and height variables under model at the same IE: Model: path: /yourpathhere Input_tensor: nchw input_pixel_format: rgb width: 640 height: 640

for me the width and height had to match the model size and once i did that i didnt have segmentation problems anymore.

Also, have you created the models under /config/model_cache/tensorrt/? I couldnt see the that in ur docker launch that it had anything about building the models, i might be wrong and blind if so : nothing to see here.

SlavikCA commented 3 months ago

Thank you! Got yolo7x-640 working without segmentation fault now on my nVidia P5000.

docker-compose:

services:
  frigate:
    container_name: frigate
    privileged: true 
    restart: unless-stopped
    image: ghcr.io/blakeblackshear/frigate:0.14.0-beta2-tensorrt
    shm_size: "256mb"
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - ./config:/config
      - /mnt/hdd/frigate:/media/frigate
      - type: tmpfs
        target: /tmp/cache
        tmpfs:
          size: 1000000000
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
    ports:
      - "8080:8080"
      - "5000:5000"
      - "8554:8554" # RTSP feeds
      - "8555:8555/tcp" # WebRTC over tcp
      - "8555:8555/udp" # WebRTC over udp
    environment:
      FRIGATE_RTSP_PASSWORD: "password"
      YOLO_MODELS: yolov7x-640
      USE_FP16: false

config.yaml:

ffmpeg:
  hwaccel_args: preset-nvidia-h264

detectors:
  tensorrt:
    type: tensorrt
model:
  path: /config/model_cache/tensorrt/yolov7x-640.trt
  input_tensor: nchw
  input_pixel_format: rgb
  width: 640
  height: 640
...

I see 1.6GB of VRAM used.

slavik@ub22gpu:~$ nvidia-smi
Thu Jun  6 04:02:27 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro P5000                   Off |   00000000:0B:00.0 Off |                  Off |
| 38%   61C    P0            157W /  180W |    1525MiB /  16384MiB |     19%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      8767      C   frigate.detector.tensorrt                     678MiB |
|    0   N/A  N/A      8849      C   ffmpeg                                        278MiB |
|    0   N/A  N/A      8850      C   ffmpeg                                        127MiB |
|    0   N/A  N/A      8857      C   ffmpeg                                        133MiB |
|    0   N/A  N/A      8875      C   ffmpeg                                        147MiB |
|    0   N/A  N/A      8876      C   ffmpeg                                        158MiB |
+-----------------------------------------------------------------------------------------+

P.S. After running it for about a day, I see it crashing every couple of hours:

[2024-06-07 19:53:03] detector.tensorrt              INFO    : Exited detection process...
[2024-06-07 19:53:03] detector.tensorrt              INFO    : Starting detection process: 507144
[2024-06-07 19:53:07] frigate.detectors.plugins.tensorrt INFO    : Loaded engine size: 392 MiB
[2024-06-07 19:53:08] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +6, GPU +8, now: CPU 758, GPU 1354 (MiB)
[2024-06-07 19:53:08] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 760, GPU 1364 (MiB)
[2024-06-07 19:53:08] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +394, now: CPU 0, GPU 394 (MiB)
[2024-06-07 19:53:08] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 368, GPU 1360 (MiB)
[2024-06-07 19:53:08] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 368, GPU 1368 (MiB)
[2024-06-07 19:53:08] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +145, now: CPU 0, GPU 539 (MiB)
[2024-06-07 19:57:38] detector.tensorrt              INFO    : Signal to exit detection process...
[2024-06-07 19:57:39] detector.tensorrt              INFO    : Exited detection process...
[2024-06-07 19:57:39] detector.tensorrt              INFO    : Starting detection process: 509270
[2024-06-07 19:57:43] frigate.detectors.plugins.tensorrt INFO    : Loaded engine size: 392 MiB
[2024-06-07 19:57:44] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +6, GPU +8, now: CPU 753, GPU 1354 (MiB)
[2024-06-07 19:57:44] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 755, GPU 1364 (MiB)
[2024-06-07 19:57:44] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +394, now: CPU 0, GPU 394 (MiB)
[2024-06-07 19:57:44] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 363, GPU 1360 (MiB)
[2024-06-07 19:57:44] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 363, GPU 1368 (MiB)
[2024-06-07 19:57:44] frigate.detectors.plugins.tensorrt INFO    : [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +145, now: CPU 0, GPU 539 (MiB)
[2024-06-07 19:58:57] detector.tensorrt              INFO    : Signal to exit detection process...
[2024-06-07 19:58:57] detector.tensorrt              INFO    : Exited detection process...

But may be that was because I was simultaneously running other processes on the same card

blakeblackshear / frigate