Closed hamishfagg closed 7 months ago
We've not seen any other reports of this, the only way this would happen would be if the model was returning incorrect or unexpected values. Can you try regenerating the model or perhaps using a different one and see if it occurs
Hi,
I've tried re-generating yolov4-tiny-288
, yolov4-tiny-416
, yolov7-tiny-288
, yolov7-tiny-416
models several times and haven't had any success - I've also tried using USE_FP16=false
but that gives me keyerrors:
2024-02-08 20:29:18.559414197 Process camera_processor:back_yard:
2024-02-08 20:29:18.559939998 Traceback (most recent call last):
2024-02-08 20:29:18.559953569 File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
2024-02-08 20:29:18.559954229 self.run()
2024-02-08 20:29:18.559954829 File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
2024-02-08 20:29:18.559955249 self._target(*self._args, **self._kwargs)
2024-02-08 20:29:18.559958519 File "/opt/frigate/frigate/video.py", line 436, in track_camera
2024-02-08 20:29:18.559958959 process_frames(
2024-02-08 20:29:18.559959369 File "/opt/frigate/frigate/video.py", line 689, in process_frames
2024-02-08 20:29:18.559959729 detect(
2024-02-08 20:29:18.559961909 File "/opt/frigate/frigate/video.py", line 474, in detect
2024-02-08 20:29:18.559962339 region_detections = object_detector.detect(tensor_input)
2024-02-08 20:29:18.559962799 File "/opt/frigate/frigate/object_detection.py", line 225, in detect
2024-02-08 20:29:18.559963219 (self.labels[int(d[0])], float(d[1]), (d[2], d[3], d[4], d[5]))
2024-02-08 20:29:18.559971419 KeyError: -11
can you show the output of nvidia-smi
Sure thing, here it is:
I seem to have the same issue as here: #8329 - I get keyerrors if I use USE_FP16=false
which apparently I need to use for my card.
There is one person with an apparent solution at the end of that issue but I don't know enough about tensorRT to recreate what they did without more info.
I to am having the same issue.
I have a quadro k1200
Also using USE_FP16=false
.
I to am having the same issue. I have a quadro k1200 Also using
USE_FP16=false
.
So, I belive I did all the update requirements correctly. I updated from 0.12 to 0.13, read all the breaking changes and prepared for it. I run frigate on proxmox host in a LXC with docker, only for it.
docker compose was updated correctly, included the environment vars for USE_FP16
and YOLOV MODELS
.
Generated a couple of models, no issues there.
First run migrated the DB and created the models.
The cameras appeared to be working but then, live feed stopped and the logs are like the above in @hamishfagg post.
Tried different combinations and the thing that gets me the far is using yolov4-tiny-416
model instead of yolov7-tiny-416
that I was using in 0.12. However, with this one, it appears it doesn't detect anything, although debug view and bounding boxes appear.
When using this model, I get the Received nan values from distance function
, so, my guess is something in the model generation that is failing.
Also tried to using 0.12 tensorRT.sh to create the models but using 23.03 image but with no luck :(
I'm kind of lost here...
Any help @NickM-27 ??
Thanks
I don't know why this would be happening. There were no reports during the beta / RC and it is not clear why something would be doing this unless the model was returning incorrect coordinates. Seeing your config would be a good first step. Maybe @NateMeyer has an idea
I don't know why this would be happening. There were no reports during the beta / RC and it is not clear why something would be doing this unless the model was returning incorrect coordinates. Seeing your config would be a good first step. Maybe @NateMeyer has an idea
Sure, here's the config:
logger:
# Optional: default log level (default: shown below)
default: info
mqtt:
enabled: true
host: homeassistant
user: mqtt
password: REDACTED
birdseye:
enabled: True
mode: continuous
restream: True
quality: 15
detectors:
tensorrt:
type: tensorrt
device: 0 #This is the default, select the first GPU
model:
path: /config/model_cache/tensorrt/yolov7-tiny-416.trt
input_tensor: nchw
input_pixel_format: rgb
width: 416
height: 416
ffmpeg:
hwaccel_args: preset-nvidia-h264
output_args:
record: preset-record-generic-audio-aac
input_args: preset-rtsp-restream
go2rtc:
streams:
adelaidecam:
- rtsp://REDACTED@192.168.0.122:554/stream1 # <- stream which supports video & aac audio
- "ffmpeg:adelaidecam#audio=aac" # <- copy of the stream which transcodes audio to the missing codec (usually will be opus)
adelaidecam_sub:
- rtsp://REDACTED@192.168.0.122:554/stream2
webrtc:
candidates:
- 192.168.0.153:8555
- stun:8555
cameras:
adelaidecam:
ffmpeg:
inputs:
- path: rtsp://127.0.0.1:8554/adelaidecam
roles:
- record
- path: rtsp://127.0.0.1:8554/adelaidecam_sub
roles:
- detect
objects:
track:
- person
snapshots:
enabled: True
clean_copy: True
timestamp: false
bounding_box: True
crop: False
record:
enabled: True
retain:
days: 3
mode: motion
events:
retain:
default: 30
mode: motion
I have 5 cameras but, at the moment, I'm testing version 0.13 on cloned LXC from version 0.12 and using only 1 camera and minimum settings to focus on the current issue.
This is the docker-compose:
version: "3.9"
services:
frigate:
container_name: frigate
privileged: true
restart: unless-stopped
image: ghcr.io/blakeblackshear/frigate:stable-tensorrt
runtime: nvidia
deploy: # <------------- Add this section
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0'] # this is only needed when using multiple GPUs
capabilities: [gpu]
shm_size: "256mb"
volumes:
- /etc/localtime:/etc/localtime:ro
- /frigate/config/:/config/
- /frigateData:/media/frigate
- type: tmpfs # 1GB of memory
target: /tmp/cache
tmpfs:
size: 1000000000
ports:
- "5000:5000" # Port used by the Web UI
- "8554:8554" # RTSP feeds
- "8555:8555/tcp" # WebRTC over tcp
- "8555:8555/udp" # WebRTC over udp
- "1984:1984"
environment:
FRIGATE_RTSP_PASSWORD: "useyourownpassword!"
USE_FP16: False
YOLO_MODELS: yolov7x-320,yolov7-320,yolov7-tiny-416,yolov7-tiny-288,yolov4-tiny-416,yolov4-tiny-288
nivida-smi on LXC:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.146.02 Driver Version: 535.146.02 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Quadro K1200 Off | 00000000:01:00.0 Off | N/A |
| 61% 75C P0 2W / 35W | 137MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
nvidia-smi on host:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.146.02 Driver Version: 535.146.02 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Quadro K1200 On | 00000000:01:00.0 Off | N/A |
| 62% 75C P0 2W / 35W | 137MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 990023 C frigate.detector.tensorrt 94MiB |
| 0 N/A N/A 990043 C ffmpeg 38MiB |
+---------------------------------------------------------------------------------------+
The log with the error, when using yolo7-tiny-416.trt:
2024-02-13 08:51:19.863620690 [INFO] Preparing Frigate...
2024-02-13 08:51:19.893291142 [INFO] Starting Frigate...
2024-02-13 08:51:23.825680984 [2024-02-13 08:51:23] frigate.app INFO : Starting Frigate (0.13.1-34fb1c2)
2024-02-13 08:51:26.298183165 [2024-02-13 08:51:26] peewee_migrate.logs INFO : Starting migrations
2024-02-13 08:51:26.328795915 [2024-02-13 08:51:26] peewee_migrate.logs INFO : There is nothing to migrate
2024-02-13 08:51:26.338584236 [2024-02-13 08:51:26] frigate.app INFO : Recording process started: 442
2024-02-13 08:51:26.341439016 [2024-02-13 08:51:26] frigate.app INFO : go2rtc process pid: 109
2024-02-13 08:51:26.370399689 [2024-02-13 08:51:26] frigate.app INFO : Output process started: 453
2024-02-13 08:51:26.400168785 [2024-02-13 08:51:26] frigate.app INFO : Camera processor started for adelaidecam: 460
2024-02-13 08:51:26.400171835 [2024-02-13 08:51:26] frigate.app INFO : Capture process started for adelaidecam: 462
2024-02-13 08:51:26.436859532 [2024-02-13 08:51:26] detector.tensorrt INFO : Starting detection process: 452
2024-02-13 08:51:26.459985714 [2024-02-13 08:51:26] frigate.detectors.plugins.tensorrt INFO : Loaded engine size: 34 MiB
2024-02-13 08:51:26.575882563 [2024-02-13 08:51:26] frigate.detectors.plugins.tensorrt INFO : [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +6, GPU +8, now: CPU 150, GPU 72 (MiB)
2024-02-13 08:51:26.580074488 [2024-02-13 08:51:26] frigate.detectors.plugins.tensorrt INFO : [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 151, GPU 82 (MiB)
2024-02-13 08:51:26.593988153 [2024-02-13 08:51:26] frigate.detectors.plugins.tensorrt INFO : [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +34, now: CPU 0, GPU 34 (MiB)
2024-02-13 08:51:26.595868736 [2024-02-13 08:51:26] frigate.detectors.plugins.tensorrt INFO : [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 117, GPU 74 (MiB)
2024-02-13 08:51:26.596107894 [2024-02-13 08:51:26] frigate.detectors.plugins.tensorrt INFO : [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 117, GPU 82 (MiB)
2024-02-13 08:51:26.596242602 [2024-02-13 08:51:26] frigate.detectors.plugins.tensorrt INFO : [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +13, now: CPU 0, GPU 47 (MiB)
2024-02-13 08:51:31.575526010 Process camera_processor:adelaidecam:
2024-02-13 08:51:31.603301451 Traceback (most recent call last):
2024-02-13 08:51:31.603304441 File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
2024-02-13 08:51:31.603305491 self.run()
2024-02-13 08:51:31.603306612 File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
2024-02-13 08:51:31.603308485 self._target(*self._args, **self._kwargs)
2024-02-13 08:51:31.603309540 File "/opt/frigate/frigate/video.py", line 436, in track_camera
2024-02-13 08:51:31.603330489 process_frames(
2024-02-13 08:51:31.603331714 File "/opt/frigate/frigate/video.py", line 689, in process_frames
2024-02-13 08:51:31.603333010 detect(
2024-02-13 08:51:31.603334100 File "/opt/frigate/frigate/video.py", line 474, in detect
2024-02-13 08:51:31.603335161 region_detections = object_detector.detect(tensor_input)
2024-02-13 08:51:31.603336200 File "/opt/frigate/frigate/object_detection.py", line 225, in detect
2024-02-13 08:51:31.603354979 (self.labels[int(d[0])], float(d[1]), (d[2], d[3], d[4], d[5]))
2024-02-13 08:51:31.603376856 KeyError: -15
2024-02-13 08:52:41.351035847 [2024-02-13 08:52:41] frigate.record.maintainer WARNING : Unable to keep up with recording segments in cache for adelaidecam. Keeping the 6 most recent segments out of 7 and discarding the rest...
2024-02-13 08:52:51.351495453 [2024-02-13 08:52:51] frigate.record.maintainer WARNING : Unable to keep up with recording segments in cache for adelaidecam. Keeping the 6 most recent segments out of 7 and discarding the rest...
...
with yolov4-tiny-416 everything seems to work, however it seems it doesn't detect anything.. and if I add the rest of the cameras, I start having the Received nan values from distance function
Does this info help? Also asking @NateMeyer for help :)
what is your drive version? this is a typical error that means the card has an out of date driver
what is your drive version? this is a typical error that means the card has an out of date driver
Thanks for the reply. It is on the nvidi-ami print:Driver Version: 535.146.02
, should I update?
I think that driver version is ok.
It's weird we're just seeing these issues with the Kx2 "Maxwell" Quadro cards. I wonder if there are issues with the compute 5.0 cards in this version of TensorRT? We might have to post something on the NVidia forums.
That seems to be the consensus, compute 5.0 specifically are having issues
TensorRT 8.5.3 claims to support compute 5.0, but I don't have one of those cards to test with. Do we know of anyone with a Maxwell GPU that is running 0.13 successfully?
I was going through other issues in here and I think the one mentioned above is very much related https://github.com/blakeblackshear/frigate/issues/8329#issuecomment-1807249026
The issue was staled but I think we should revive it. What do you guys think? maybe @kdill00 or @qubex22 can help?
Any luck on this? :(
I was going through other issues in here and I think the one mentioned above is very much related
https://github.com/blakeblackshear/frigate/issues/8329#issuecomment-1807249026
The issue was staled but I think we should revive it.
What do you guys think?
maybe @kdill00 or @qubex22 can help?
I ended up buying a second hand P600. The only sure thing is that there's a problem with Maxwell cards.
Well.. I can't afford another Graphics card.. So, I ended up installing codeProject.AI for this and it works. Here is some detail: I have a quadro P1200 that was working fine with frigate tensorRT 0.12 runninc on a LXC in proxmox I installed codeproject.AI in another LXC (Ubuntu) with latest nvidia driver BUT I installed cuda-tollkit 11.7. Now I'm using frigate 0.13 but the detector is codeproject.Ai. I'm using yolo5 tiny model. My inferece went up from 10ms to 33ms. I haven't optimized anything yet .. so, I guess it is a good tradeoff for now. Hopefully, someone will figure out what is wrong with the tensorRT images for these GPUs and fix it.
I also gave in and bought a new GPU, a T1000. Everything is working fine with it, so I won't be able to test any fixes here. But I'll leave this open
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I am having the exact same issue with a Maxwell GPU (K2200). Just like the issue lays out, Yolov4-tiny seems to be the only model that doesn't crash, but it only shows motion and no object detection. Other models will either have Key errors, divide by zero errors, NaN values from distance function errors etc... I was wondering what the consensus is with the devs on this issue before I troubleshoot further. Should compute 5.0 Maxwell GPU owners look to upgrade hardware at this point or is it worth looking into? I have 2 K2200s in my machine and would be willing to send one to a dev if it is worth looking into at all.
this should be fixed in the next version, based on the linked PR above
this should be fixed in the next version, based on the linked PR above
Oh awesome! I didn't see that. I will look out for that build and test it out.
this should be fixed in the next version, based on the linked PR above
Oh awesome! I didn't see that. I will look out for that build and test it out.
Also, I wrote up a workaround for v.13 if you want to give that a shot. https://gist.github.com/NateMeyer/a689b4462e57b3de0ebcc40e6538fc03
Describe the problem you are having
Upgraded from 0.12.x to 0.13.1 and followed the steps in the release notes. I'm using tensorRT (recreated the model after upgrade) with a quadro k620. I'm getting the below errors in the logs, and cameras initially display an image, then one by one drop off to show "no frames have been received, check error logs"
Version
0.13.1-34fb1c2
Frigate config file
docker-compose file or Docker CLI command
Relevant log output
Operating system
UNRAID
Install method
Docker CLI
Coral version
Other
Any other information that may be helpful