blakeblackshear / frigate

NVR with realtime local object detection for IP cameras
MIT License
17.99k stars 1.64k forks source link

[Config Support]: Whole Machine Crashing - looking for some tips #8470

Open madasus opened 10 months ago

madasus commented 10 months ago

Describe the problem you are having

I have two docker hosts and both have a coral. I find that Frigate seems to cause the whole host to freeze completely (console is not responsive) at frequent intervals - right now I would say on average every 48 hours but its not consistent. I've moved the docker container to the other host and cleared out all the other dockers and the freeze follows Frigate.

Its likely Frigate is pushing the hosts much harder than any other docker and perhaps its finding a bug somewhere in the hardware or OS. The Devices are BeeLink devices running the latest Ubuntu.

Looking for some advice - has anyone seen this sort of behavior and identified the cause?

This has been happening for many months so it is not related to the beta Frigate or any particular Frigate (and likely this is NOT a Frigate bug)


0.13 Beta 3

Frigate config file

  path: /db/frigate.db

  user: mqtt
  password: xxx

#  hwaccel_args: -c:v h264_qsv
#  hwaccel_args: preset-intel-qsv-h264
  hwaccel_args: preset-vaapi

  # Optional: default log level (default: shown below)
  default: warning
  # Optional: module by module log level configuration
    frigate.mqtt: error

    type: edgetpu
    device: usb

  # Optional: The threshold passed to cv2.threshold to determine if a pixel is different enough to be counted as motion. (default: shown below)
  # Increasing this value will make motion detection less sensitive and decreasing it will make motion detection more sensitive.
  # The value should be between 1 and 255.
  threshold: 40
  contour_area: 20
  lightning_threshold: 0.7

  max_disappeared: 500
  width: 1280
  # Optional: height of the frame for the input with the detect role (default: shown below)
  height: 720

  # Optional: Position of the timestamp (default: shown below)
  #           "tl" (top left), "tr" (top right), "bl" (bottom left), "br" (bottom right)
  position: tl
  # Optional: Format specifier conform to the Python package "datetime" (default: shown below)
  #           Additional Examples:
  #             german: "%d.%m.%Y %H:%M:%S"
  format: '%m/%d/%Y %H:%M:%S'
  # Optional: Color of font
    # All Required when color is specified (default: shown below)
    red: 255
    green: 255
    blue: 255
  # Optional: Line thickness of font (default: shown below)
  thickness: 1
  # Optional: Effect of lettering (default: shown below)
  #           None (No effect),
  #           "solid" (solid background in inverse color of font)
  #           "shadow" (shadow for font)
  effect: solid

  # Optional: Enable birdseye view (default: shown below)
  enabled: true
  # Optional: Width of the output resolution (default: shown below)
  width: 1280
  # Optional: Height of the output resolution (default: shown below)
  height: 720
  # Optional: Encoding quality of the mpeg1 feed (default: shown below)
  # 1 is the highest quality, and 31 is the lowest. Lower quality feeds utilize less CPU resources.
  quality: 8
  # Optional: Mode of the view. Available options are: objects, motion, and continuous
  #   objects - cameras are included if they have had a tracked object within the last 30 seconds
  #   motion - cameras are included if motion was detected in the last 30 seconds
  #   continuous - all cameras are included always
  mode: objects
  restream: true

  - person
  - cat

  enabled: true
      default: 10
      mode: active_objects
    pre_capture: 5
    post_capture: 15

  sync_on_startup: true
  expire_interval: 60

# Optional: Configuration for the jpg snapshots written to the clips directory for each event
# NOTE: Can be overridden at the camera level
  # Optional: Enable writing jpg snapshot to /media/frigate/clips (default: shown below)
  enabled: true
  # Optional: save a clean PNG copy of the snapshot image (default: shown below)
  clean_copy: true
  # Optional: print a timestamp on the snapshots (default: shown below)
  timestamp: false
  # Optional: draw bounding box on the snapshots (default: shown below)
  bounding_box: false
  # Optional: crop the snapshot (default: shown below)
  crop: false
  # Optional: height to resize the snapshot to (default: original size)
  height: 175
  # Optional: Restrict snapshots to objects that entered any of the listed zones (default: no required zones)
  required_zones: []
  # Optional: Camera override for retention settings (default: global values)
    # Required: Default retention days (default: shown below)
    default: 10
    # Optional: Per object retention days
      person: 15
  # Optional: quality of the encoded jpeg, 0-100 (default: shown below)
  quality: 70

  # Optional: Set the default live mode for cameras in the UI (default: shown below)
  live_mode: mse
  # Optional: Set a timezone to use in the UI (default: use browser local time)
  timezone: America/New_York
  # Optional: Use an experimental recordings / camera view UI (default: shown below)
  use_experimental: false
  # Optional: Set the time format used.
  # Options are browser, 12hour, or 24hour (default: shown below)
  time_format: 12hour
  # Optional: Set the date style for a specified length.
  # Options are: full, long, medium, short
  # Examples:
  #    short: 2/11/23
  #    medium: Feb 11, 2023
  #    full: Saturday, February 11, 2023
  # (default: shown below).
  date_style: full
  # Optional: Set the time style for a specified length.
  # Options are: full, long, medium, short
  # Examples:
  #    short: 8:14 PM
  #    medium: 8:15:22 PM
  #    full: 8:15:22 PM Mountain Standard Time
  # (default: shown below).
  time_style: medium
  # Optional: Ability to manually override the date / time styling to use strftime format
  # possible values are shown above (default: not set)
  strftime_fmt: '%Y/%m/%d %H:%M'

  # Optional: Enabled network interfaces for bandwidth stats monitoring (default: shown below)
  #  - eth
  #  - enp
  #  - eno
  #  - ens
  #  - wl
  #  - lo
  # Optional: Configure system stats
    # Enable AMD GPU stats (default: shown below)
   # amd_gpu_stats: True
    # Enable Intel GPU stats (default: shown below)
    intel_gpu_stats: true
    # Enable network bandwidth stats monitoring for camera ffmpeg processes, go2rtc, and object detectors. (default: shown below)
    network_bandwidth: false
  # Optional: Enable the latest version outbound check (default: shown below)
  # NOTE: If you use the HomeAssistant integration, disabling this will prevent it from reporting new versions
  version_check: true


REMOVED - but I have about 15

I also wanted to include my docker compose for ideas

version: "3"
#    image:
    shm_size: "2048mb"
    container_name: frigate
    privileged: true
      - /dev/dri:/dev/dri
      - /disk1/docker/frigate/config:/config
#      - /disk1/docker/frigate/db:/db
#      - /disk1/docker/frigate/media:/media/frigate
      - /etc/localtime:/etc/localtime:ro
      - /dev/bus/usb:/dev/bus/usb
     - PUID=0
     - PGID=0
     - TZ=America/New_York
     - PLUS_API_KEY=xxx
    restart: unless-stopped

Relevant log output

None that I can find relevant.

Frigate stats

No response

Operating system


Install method

Docker Compose

Coral version


Any other information that may be helpful

No response

NickM-27 commented 10 months ago

There's no info provided here so there is nothing to go off of. You first need to figure out why the machine is actually freezing (is it memory issue, kernel panic, etc.)

madasus commented 10 months ago

nothing is shown on the console. I'll check to see if there is anything in syslog. The last time i checked there was nothing - the whole machine was just frozen.

NickM-27 commented 10 months ago

it can happen for many different reasons, if there is no information that can be provided then there's not really much that can be done on the frigate side. There are plenty of solutions like having a log written to a file so the cause can be seen in the logs after restarting the machine.

Also, you can try putting a memory limit on the frigate container

blakeblackshear commented 10 months ago

The next steps would be to back down frigate to a bare minimum config and slowly add parts back until you can see what is causing the issue.

madasus commented 10 months ago

Thanks - i'm following the other thread also. I also added some more debugging to Ubuntu to see if I can capture anything in the logs before the freeze.

Do you have any suggestions on where to start removing the config from? us the hwaccel param a place to start?

ffmpeg: hwaccel_args: preset-vaapi

antipesto93 commented 10 months ago

This comment is not very helpful but I had the exact issue running the containers on kubernetes (microk8s on ubuntu). Host would crash, have to power cycle. No useful information in logs or kernel log.

I ended up removing my coral (m.2) and switching to CPU/VAAPI detection for now, It's been a few weeks without issue.

It's a long shot but could be worth trying the same to rule it out? I have not gone back to the coral as I have only a 1 camera doing detection / CPU usage is not high.

ggidofalvy-tc commented 10 months ago

I have a similar issue, running an i5-6500T, no external accelerator, and so far I've been able to ascertain the following:

Here's my config using three random camera feeds from the Internet that I use for debugging, currently the hardware acceleration for decoding/encoding is commented out:

  enabled: false

#    test_camera_1_main:
#      - rtsp://admin:xxxxxxxxxxx@
#    test_camera_1_sub:
#      - rtsp://admin:xxxxxxxxxxx@
   #  - ffmpeg:
     - ffmpeg:

   #  - ffmpeg:
     - ffmpeg:
   #  - ffmpeg:
     - ffmpeg:

#  test_camera_1: # <------ Name the camera
#    ffmpeg:
#      output_args:
#        record: preset-record-generic-audio-copy
#      inputs:
#        - path: rtsp:// # <----- The stream you want to use for detection
#          input_args: preset-rtsp-restream
#          hwaccel_args: preset-vaapi
#          roles:
#            - detect
#        - path: rtsp:// # <----- The stream you want to use for recording
#          input_args: preset-rtsp-restream
#          hwaccel_args: preset-vaapi
#          roles:
#            - record
#    record:
#      enabled: True
#    detect:
#      enabled: True # <---- disable detection until you have a working camera feed
#      width: 640 # <---- update for your camera's resolution
#      height: 480 # <---- update for your camera's resolution
#    live:
#      stream_name: test_camera_1_main
  test1: # <------ Name the camera
      - path: rtsp://   # <----- The stream you want to use for detection
        input_args: preset-rtsp-restream
#        hwaccel_args: preset-vaapi
        - detect
      - path: rtsp://   # <----- The stream you want to use for recording
        input_args: preset-rtsp-restream
#        hwaccel_args: preset-vaapi
        - record
      enabled: true
      enabled: true # <---- disable detection until you have a working camera feed
      width: 1280 # <---- update for your camera's resolution
      height: 720 # <---- update for your camera's resolution
      stream_name: test1
      - person
      - car
  test2: # <------ Name the camera
      - path: rtsp://   # <----- The stream you want to use for detection
        input_args: preset-rtsp-restream
        hwaccel_args: preset-vaapi
        - detect
      - path: rtsp://   # <----- The stream you want to use for recording
        input_args: preset-rtsp-restream
#        hwaccel_args: preset-vaapi
        - record
      enabled: true
      enabled: true # <---- disable detection until you have a working camera feed
      width: 1280 # <---- update for your camera's resolution
      height: 720 # <---- update for your camera's resolution
      stream_name: test2
      - person
  test3: # <------ Name the camera
      - path: rtsp://   # <----- The stream you want to use for detection
        input_args: preset-rtsp-restream
#        hwaccel_args: preset-vaapi
        - detect
      - path: rtsp://   # <----- The stream you want to use for recording
        input_args: preset-rtsp-restream
#        hwaccel_args: preset-vaapi
        - record
      enabled: true
      enabled: true # <---- disable detection until you have a working camera feed
      width: 1920 # <---- update for your camera's resolution
      height: 1080 # <---- update for your camera's resolution
      stream_name: test3
      - person

      - 716,0,723,359,126,378,129,0
      - 1920,0,1920,0,1920,731,1869,783,1804,823,1722,860,1604,855,1480,838,1314,778,1299,729,1188,676,1123,667,1061,683,1010,642,978,598,961,516,850,496,755,464,674,447,603,306,582,0
    days: 0
    mode: all
      default: 14
      mode: motion
        person: 30

    type: openvino
    device: AUTO
      path: /openvino-model/ssdlite_mobilenet_v2.xml

  width: 300
  height: 300
  input_tensor: nhwc
  input_pixel_format: bgr
  labelmap_path: /openvino-model/coco_91cl_bkgr.txt

# Include all cameras by default in Birdseye view
  enabled: true

I tried to grab kernel crashdump via kdump, and also tried out kernel netconsole (dmesg) logging to another server running on the same network, but neither resulted in any output, which makes me think it's a driver issue that affects the CPU itself, not even a kernel crash.

Running the beta2 image in docker-compose, the beta3 image has an issue with go2rtc failing to parse the camera feed URLs.

If you have any ideas for any further troubleshooting I could do, please do let me know.

Pingbo commented 9 months ago

@ggidofalvy-tc Have the same issue on unraid:

After round about 12h-24h the Host is crashing when using OpenVINO.

Tried different drivers in the Host, but didnt Help.

Currently thinking about to buy a Coral...

audiophonicz commented 9 months ago

I have same issue on K3s on Debian with i3-6100U. VAAPI HW encode/decode + OpenVINO setup in config.

with obj detection turned off its rock solid. if I turn on obj detection for a single object on a single camera, whole node hangs within 3 days.

Those of us using the official helm chart cant update go2rtc or ffmpeg with custom versions.

ggidofalvy-tc commented 9 months ago

Adding onto my previous comment:

Running Ubuntu 22.04, tried both the GA (5.15) and HWE (6.2) kernels, both exhibited the same crash behaviour.

NickM-27 commented 9 months ago may be relevant with a couple suggestions (and other linked issue)

kevin-david commented 9 months ago

@ggidofalvy-tc @madasus especially if your frigate machine is headless, I would recommend removing the often-default quiet kernel parameter/command-line-argument and adding debug. that's what helped in my case linked above to at least narrow down the issue to the GPU, but I have made limited progress above as NickM has linked. my errors only showed up on the physical console, due to the hang.

It's certainly suspicious that what I reported in #8338 is also using a i7-6600(U) / Skylake GPU - same generation as you both - wondering if there is a driver bug / hardware quirk that other generations don't have that the i915 driver isn't handling

madasus commented 9 months ago

@kevin-david my host is headless so i'll give this a try. Will the debug then be written to syslog? how are you grabbing it?

Can you point me in the direction of where you made this change in your linux distro? (i'm using Ubuntu).

I'm glad i opened this thread as it appears this is not an isolated problem - and while not a Frigate issue but likely something that Frigate exposes due to load in the underlying hardware/software.



kevin-david commented 9 months ago

@madasus sure - I am using proxmox, so it should be similar. In my case the message never appeared in syslog, only on the physically connected screen - I guess because the machine was hung, it wasn't able to be written to syslog. this might mean you need to temporarily connect a monitor to the machine.

To do what I was talking about, you'll want to change GRUB_CMDLINE_LINUX_DEFAULT in the /etc/default/grub file and run update-grub to regenerate the configuration file, and reboot.

This describes it a little more: Again in my case, I removed quiet which resulted in messages logged to the console, and added debug (which I'm not sure makes a huge difference, but isn't super noisy either)

ggidofalvy-tc commented 9 months ago

I gave echo 0 | sudo tee /sys/class/drm/card0/engine/rcs0/preempt_timeout_ms a spin, but no luck in preventing/prolonging the crash.

This is what I got in dmesg:

[525594.184400] [drm:__uc_sanitize [i915]] *ERROR* Failed to reset GuC, ret = -110

I'll keep a look out for more messages in the netconsole destination now that I rebooted again and set the debug kernel commandline flag, I won't be applying the "fix" this time around.

Pingbo commented 9 months ago

@ggidofalvy-tc @madasus

I probably found a solution... Running a yolov8s model since some days and currently >48h stable without any crash. Perhaps you can try this aswell?

ggidofalvy-tc commented 9 months ago

@Pingbo can you share your model and detector config.yml snippets? Sorry for the mild derail, I would like to see if this might be a model-specific issue, not an OpenVINO-related one. Running the beta2 branch, since beta3 has issues with go2rtc with my config.

I've been trying to get yolov8n/yolov8s running on my setup based on the notebook linked in this comment:

But I keep getting an error when the detector starts up:

2023-11-20 12:55:12.267683201  Traceback (most recent call last):
2023-11-20 12:55:12.267747530    File "/usr/lib/python3.9/multiprocessing/", line 315, in _bootstrap
2023-11-20 12:55:12.267749391
2023-11-20 12:55:12.267806449    File "/usr/lib/python3.9/multiprocessing/", line 108, in run
2023-11-20 12:55:12.267808281      self._target(*self._args, **self._kwargs)
2023-11-20 12:55:12.267855642    File "/opt/frigate/frigate/", line 102, in run_detector
2023-11-20 12:55:12.267857527      object_detector = LocalObjectDetector(detector_config=detector_config)
2023-11-20 12:55:12.267898126    File "/opt/frigate/frigate/", line 53, in __init__
2023-11-20 12:55:12.267899858      self.detect_api = create_detector(detector_config)
2023-11-20 12:55:12.267941613    File "/opt/frigate/frigate/detectors/", line 18, in create_detector
2023-11-20 12:55:12.267943162      return api(detector_config)
2023-11-20 12:55:12.267986312    File "/opt/frigate/frigate/detectors/plugins/", line 26, in __init__
2023-11-20 12:55:12.267988059      self.ov_model = self.ov_core.read_model(detector_config.model.path)
2023-11-20 12:55:12.268047582  RuntimeError: Check 'false' failed at src/frontends/common/src/frontend.cpp:53:
2023-11-20 12:55:12.268049096  Converting input model
2023-11-20 12:55:12.268050592  Cannot create Interpolate layer /model.10/Resize id:164 from unsupported opset: opset11

My config.yml bits, attempting to run the yolov8n model:

    type: openvino
    device: AUTO
      path: /config/openvino-model/yolov8n.xml

  width: 416
  height: 416
  input_tensor: nhwc
  input_pixel_format: bgr
  model_type: yolov8
  labelmap_path: /openvino-model/coco_91cl_bkgr.txt

(all 3 output files are mounted /config/openvino-model, I'm reusing the labelmap from the original mobileSSD model used)

madasus commented 9 months ago

when mine freezes I managed to check the console this time and there were no messages at all being written to the console before the crash.

@Pingbo can you elaborate on how to use the model you are suggesting? is this being used instead of the Coral?

Pingbo commented 9 months ago


Thats how i have done it:

  1. Generate a yolo model with
  2. Download the following labelmap:
  3. My Config:
    type: openvino
    device: GPU
      # path: /openvino-model/ssdlite_mobilenet_v2.xml
      path: /config/yolov8s/yolov8s.xml


width: 300

height: 300

width: 416 height: 416 input_tensor: nchw # nhwc input_pixel_format: bgr model_type: yolov8 labelmap_path: /config/coco_80cl.txt #/openvino-model/coco_91cl_bkgr.txt

Yes this is using OpenVino as detector and not coral. As far is i know you cannot use coral and yolo models together
audiophonicz commented 9 months ago

Thank you for the detail @Pingbo

The only thing I would clarify for others is 1. you want to put all 3 files in the .zip file from the yolo model generation in the model folder, and 2. the files that were generated for me were yolo8n.xml, so make sure your file path is correct. Hopefully this is the fix.

Edit: 2 weeks running with solid person detection using the yolov8n model on a single camera. Looks like CPU usage dropped significantly for me. Enabling it on the rest of my cameras now.

ggidofalvy-tc commented 9 months ago

@Pingbo Thank you for the help and the detailed instructions! I've been using yolov8n for nearly two weeks now without any crashing on beta2.

I think the issue might indeed be caused by the combination of the bundled ssdlite_mobilenet_v2 model and Skylake-gen OpenVINO -- is this perhaps worth documenting somewhere?

FeatherKing commented 9 months ago

Wanted to chime in here, im a new frigate user as of about two weeks ago. My hardware is an i7-7700 kaby lake. I am running frigate and wyze-bridge together. Wyze bridge is correctly using Intel QSV with ffmpeg and Frigate will use it fine on ffmpeg as well. However, if i tried to use any openvino detector, it would crash the container everytime. If i set a detector as cpu (not openvino cpu), the container would start and detect fine.

Today i followed these steps by @Pingbo and finally my openvino detector will start with GPU selected. My inference speed went from 45ms (cpu) to 15ms (ov gpu).

The only error i could make out from the container was RuntimeError: The input blob size is not equal to the network input size: got 307200 expecting 270000 I tried spinning up various python openvino demos and i was getting similar errors. I was running these demos inside the container. Errors like Resulting shape '{1,3,300,3}' after preprocessing is not aligned with original parameter's shape: {1,300,300,3}, input parameter: image_tensor. This led me to believe maybe something with the included frigate openvino model and kabylake was not working out.

Anyway, the yolov8 model from the above comment seems to have resolved my issue for now. Ive been stable for a few hours (where previously i was unable to even start the containers). I will continue to monitor. (thanks @Pingbo !!)

edit: i am on frigate version 0.12.1-367D724

bean72 commented 8 months ago

Followed the advice of @Pingbo for using the yolov8 model as well. Been running for a couple weeks without any issues. My detections are more reliable as well, so that is an added bonus. Thanks @Pingbo

Strux-DK commented 7 months ago

@Pingbo i'm trying generate a YOLO model via the link, but i have zero to no clue what i'm doing. The scripts are giving me errors. Is it possible for you to help me?

Example from first script: ImportError: cannot import name 'is_exact_shape_match' from 'pandas.core.indexers' (/usr/local/lib/python3.10/dist-packages/pandas/core/indexers/

Example from second script: ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. ipython 7.34.0 requires jedi>=0.16, which is not installed. lida 0.0.10 requires fastapi, which is not installed. lida 0.0.10 requires kaleido, which is not installed. lida 0.0.10 requires python-multipart, which is not installed. lida 0.0.10 requires uvicorn, which is not installed. llmx 0.0.15a0 requires cohere, which is not installed. llmx 0.0.15a0 requires openai, which is not installed. llmx 0.0.15a0 requires tiktoken, which is not installed.

(with some other lines and then ending with..)

WARNING: The following packages were previously imported in this runtime: [PIL,_distutils_hack,certifi,dateutil,defusedxml,google,numpy,pkg_resources,setuptools,six] You must restart the runtime in order to use newly installed versions.

By that i understand that i just have to refresh the page and try again, but it produces the same error.

audiophonicz commented 7 months ago

@Strux-DK its not you, you just picked a bad day.

I literally just ran this yesterday to generate an 8s model to upgrade from the 8n I was using and it worked flawlessly. Today, I'm getting the same errors you are. (tried 3x)

It looks like @aeozyalcin 's awesome colab may have broken?

That said, since I just generated it yesterday, maybe give this one a try?

leccelecce commented 7 months ago

I have Frigate 0.13 running on a Dell Optiplex 3070 Micro i3-9100t, with a single USB coral. It runs Debian Bookworm and Frigate only. I've had probably one hang per month on average, but recently I made quite a few config changes (mainly around go2rtc streams), and it now seems to crash around once a week. I'm in the process of experimenting with kernel crash dumps to see if I can catch any more info.

As has already been said, this is highly unlikely to be a Frigate bug. I suspect the issue is either in Linux drivers, or (particularly in the case of cheap no-name mini PCs, or even micro form factor desktops like my Optiplex) hardware - not necessarily faulty hardware, but hardware that isn't designed to run full-tilt 24/7 on the CPU/iGPU while also supplying 1A power to a Coral (or two).

I never had any problems on my Dell R220 with Xeon E3s and ECC RAM, and I don't think I've seen many examples of people reporting crashes on server-class hardware. I think Frigate is actually quite an interesting test-case of a 24/7 demanding application often running on cheap consumer gear. The only other tasks most people might assign to this gear - transcoding video or gaming - doesn't run 24/7.

BeFygo commented 7 months ago

I had the same issues with a Poweredge T20 Xeon E3 1225 V3 CPU, hardware acceleration and coral usb active. Unable to log anything, many succesfull system checks.

Therefore I upgraded to a Dell PowerEdge T320 without hardware acceleration and with coral usb. No crashes with this system and it's running for 4 months.

noisymime commented 7 months ago

Another +1 here with similar config, Skylake (i3-6100) generation CPU with OpenVINO (No Coral) in case it helps as a further reference point.

The combination of mobilenet_v2 and vaapi for hwaccel causes hard system crashes within 24 hours every time. The workarounds I've found that work are:

And the things I've tried that I tried in isolation and do NOT work (Including because other issues referenced these as possible solutions):

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Pingbo commented 6 months ago

Issue still exists.

yayitazale commented 5 months ago

Same problem in unraid with 0.13.2-6476f8a. I have setup a syslog to try to catch the what it is happening when server gets stuck.

BrandonG777 commented 5 months ago

Same problem in unraid with 0.13.2-6476f8a. I have setup a syslog to try to catch the what it is happening when server gets stuck.

Having this issue as well on unRAID... Attaching unRAID diagnostics (includes syslog and other logs/info) in case it helps track this down. Observed out of memory errors and the reaper process killing ffmpeg within the syslog. Sometimes the system runs for weeks other times it dies in a matter of days but works great otherwise. Usually, the system will hard lock and I'm unable to get any further diagnostic information but one of my plugins detected the OOM problems and this diagnostic zip was captured at that time.

and full props to yayitazale for helping me get this far over on the unRAID forums.

NickM-27 commented 5 months ago

I had this happen on unraid and it was due to their network implementation. Switching to ipvlan fixed the issue and hasn't happened to me in months

yayitazale commented 5 months ago

I had this happen on unraid and it was due to their network implementation. Switching to ipvlan fixed the issue and hasn't happened to me in months

I have switched to ipvlan and I'm still having crashes.

BrandonG777 commented 5 months ago

I had this happen on unraid and it was due to their network implementation. Switching to ipvlan fixed the issue and hasn't happened to me in months

Confirmed I am already on ipvlan and having this issue. Fought that battle a couple years ago.

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

roldengarm commented 4 months ago

Another +1 here with similar config, Skylake (i3-6100) generation CPU with OpenVINO (No Coral) in case it helps as a further reference point.

The combination of mobilenet_v2 and vaapi for hwaccel causes hard system crashes within 24 hours every time. The workarounds I've found that work are:

* ✅ Disabling vaapi or

* ✅ Using yolov8n

And the things I've tried that I tried in isolation and do NOT work (Including because other issues referenced these as possible solutions):

* ❌ Upgrading the kernel to 6.2 (From 5.15)

* ❌ Changing shm values. Tried huge ones, makes no difference

* ❌ Switching between iHD and i965 drivers

* ❌ Different Frigate versions (Tried v0.11, 0.12 and the current v0.13.1)

I can confirm that using yolov8n has most likely fixed the issue on my HP ProDesk G3, before that the GPU would crash every ~2 weeks causing the whole PC to hang.

BrandonG777 commented 4 months ago

Switched from Cuda (Quadro RTX 4000) to Coral/Intel 12th Gen QSV for processing. Consumes about 100watts less power and generates a lot less heat but still crashes. If there's an improvement in uptime I haven't noticed. Though limiting the memory of this container to 4GBs seems to keep it in check.

yayitazale commented 4 months ago

In my case, everything up to date and running with no issues for 1 month.

djcrawleravp commented 4 months ago

Same Here Using a Raspberry Pi 4 8gb Model with USB Coral TPU

Frigate Version 0.13.2 docker container on Raspberry Pi OS

The whole machine crashes, No access from any source (web,ssh) but strangely ping works (with lots of lag) The only way to restore everything is by restarting the Pi

4 Cameras (3 detecting off office hours) Google Coral TPU over USB

UPDATE: Did some changes on the config files but no luck so far


    container_name: Frigate
    privileged: true
    restart: unless-stopped
    shm_size: 128mb <---------------- increased to 1024mb
      - /dev/bus/usb/002/004:/dev/bus/usb/002/004 # passes the USB Coral, needs to be modified for other versions
      - /dev/video11:/dev/video11 # Extracted from frigate doumentation for RBPi
      - /etc/localtime:/etc/localtime:ro
      - /home/djcrawleravp/docker/frigate:/config
      - /home/djcrawleravp/docker/frigate/media:/media/frigate
    network_mode: host
      FRIGATE_RTSP_PASSWORD: "Password123"

Config File: (Recentrly Disabled Birdseye, not much improvement)

  user: mqtt
  password: Password123
  topic_prefix: frigate

 # hwaccel_args: preset-rpi-64-h264  <---------------- hwaccell disabled

    type: edgetpu
    device: usb

  enabled: false
  mode: continuous

      order: 2
      - path: rtsp://arquitec1:pass**@
        - record
      - path: rtsp://arquitec1:pass**
        - detect
      - person
      - mouse
      enabled: true
        default: 30
          person: 30
      quality: 100
      enabled: true
        days: 1
        mode: active_objects
        pre_capture: 2
        post_capture: 2 
          default: 15
          mode: active_objects
            person: 15
            mouse: 15
      enabled: true
      fps: 5
      width: 1280
      height: 720
      - 0,296,440,172,864,113,918,164,989,187,988,253,1079,275,1102,84,1280,76,1280,0,0,0

      order: 1
      - path: rtsp://arquitec1:pass**@
        - record
      - path: rtsp://arquitec1:pass**@
        - detect
      - person
      - cat
      - mouse
      enabled: true
        default: 30
          person: 30
      quality: 100
      enabled: true
        days: 1
        mode: active_objects
        pre_capture: 2
        post_capture: 2
          default: 15
          mode: active_objects
            person: 15
            cat: 15
            mouse: 15
      enabled: true
      fps: 5
      width: 1280
      height: 720

      - 158,406,0,720,0,0,1280,0,1280,39,1280,720,1013,720,1067,364,657,283,398,234

      order: 3
      - path: rtsp://arquitec1:pass**@
        - record
      - path: rtsp://arquitec1:pass**@
        - detect
      - person
      - cat
      - mouse
      - car
      enabled: true
        default: 30
          person: 30
      quality: 100
      enabled: true
        days: 1
        mode: active_objects
        pre_capture: 2
        post_capture: 2
          default: 15
          mode: active_objects
            person: 15
            cat: 15
            mouse: 15
      enabled: true
      fps: 5
      width: 1280
      height: 720
      - 0,185,0,0,33,0,1280,0,1280,75,771,83,134,158

      order: 4
      - path: rtsp://arquitec1:pass**@
        - record
      - path: rtsp://arquitec1:pass**@
        - detect
      - person
      - cat
      - mouse
      enabled: true
        default: 30
          person: 30
      quality: 100
      enabled: true
        days: 1
        mode: active_objects
        pre_capture: 2
        post_capture: 2
          default: 15
          mode: active_objects
            person: 15
            cat: 15
            mouse: 15
      enabled: true
      fps: 5
      width: 1280
      height: 720
      - 912,217,478,336,0,337,0,0,1280,0,1280,194

Htop while detecting: (usually below 20%when not detecting)

Screen Shot 2024-05-07 at 12 51 43

Frigate Side:

Screen Shot 2024-05-07 at 12 51 19 Screen Shot 2024-05-07 at 12 51 28
eohlde commented 3 months ago

I have this same thing happening, Running Ubuntu on a SFF pc, with a PCI Coral. I get maybe 3 days before it goes unresponsive. I have removed all the recording and coral from the config to get down to a barebones and will add things back as I get stability.

My setup is 5 unifi cameras, gortc, and QSV for decode.

arrikhan commented 3 months ago

I have same problem, hard hang with no access at console or remotely.

I’ve moved from RPI4 with 4GB to Mac mini (2012) with 16GB on Ubuntu both having same issue. Hangs within 2-3 days normally but sometimes a couple of times a day. No logs and doesn’t seem related to excess activity. I have 2 coral tpu usb devices but have moved one off to another server to limit this server to just frigate.

Am running 4 cams.

djcrawleravp commented 3 months ago

Same Here Using a Raspberry Pi 4 8gb Model with USB Coral TPU

Frigate Version 0.13.2 docker container on Raspberry Pi OS

The whole machine crashes, No access from any source (web,ssh) but strangely ping works (with lots of lag) The only way to restore everything is by restarting the Pi

4 Cameras (3 detecting off office hours) Google Coral TPU over USB

UPDATE: Did some changes on the config files but no luck so far


    container_name: Frigate
    privileged: true
    restart: unless-stopped
    shm_size: 128mb <---------------- increased to 1024mb
      - /dev/bus/usb/002/004:/dev/bus/usb/002/004 # passes the USB Coral, needs to be modified for other versions
      - /dev/video11:/dev/video11 # Extracted from frigate doumentation for RBPi
      - /etc/localtime:/etc/localtime:ro
      - /home/djcrawleravp/docker/frigate:/config
      - /home/djcrawleravp/docker/frigate/media:/media/frigate
    network_mode: host
      FRIGATE_RTSP_PASSWORD: "Password123"

Config File: (Recentrly Disabled Birdseye, not much improvement)

  user: mqtt
  password: Password123
  topic_prefix: frigate

 # hwaccel_args: preset-rpi-64-h264  <---------------- hwaccell disabled

    type: edgetpu
    device: usb

  enabled: false
  mode: continuous

      order: 2
      - path: rtsp://arquitec1:pass**@
        - record
      - path: rtsp://arquitec1:pass**
        - detect
      - person
      - mouse
      enabled: true
        default: 30
          person: 30
      quality: 100
      enabled: true
        days: 1
        mode: active_objects
        pre_capture: 2
        post_capture: 2 
          default: 15
          mode: active_objects
            person: 15
            mouse: 15
      enabled: true
      fps: 5
      width: 1280
      height: 720
      - 0,296,440,172,864,113,918,164,989,187,988,253,1079,275,1102,84,1280,76,1280,0,0,0

      order: 1
      - path: rtsp://arquitec1:pass**@
        - record
      - path: rtsp://arquitec1:pass**@
        - detect
      - person
      - cat
      - mouse
      enabled: true
        default: 30
          person: 30
      quality: 100
      enabled: true
        days: 1
        mode: active_objects
        pre_capture: 2
        post_capture: 2
          default: 15
          mode: active_objects
            person: 15
            cat: 15
            mouse: 15
      enabled: true
      fps: 5
      width: 1280
      height: 720

      - 158,406,0,720,0,0,1280,0,1280,39,1280,720,1013,720,1067,364,657,283,398,234

      order: 3
      - path: rtsp://arquitec1:pass**@
        - record
      - path: rtsp://arquitec1:pass**@
        - detect
      - person
      - cat
      - mouse
      - car
      enabled: true
        default: 30
          person: 30
      quality: 100
      enabled: true
        days: 1
        mode: active_objects
        pre_capture: 2
        post_capture: 2
          default: 15
          mode: active_objects
            person: 15
            cat: 15
            mouse: 15
      enabled: true
      fps: 5
      width: 1280
      height: 720
      - 0,185,0,0,33,0,1280,0,1280,75,771,83,134,158

      order: 4
      - path: rtsp://arquitec1:pass**@
        - record
      - path: rtsp://arquitec1:pass**@
        - detect
      - person
      - cat
      - mouse
      enabled: true
        default: 30
          person: 30
      quality: 100
      enabled: true
        days: 1
        mode: active_objects
        pre_capture: 2
        post_capture: 2
          default: 15
          mode: active_objects
            person: 15
            cat: 15
            mouse: 15
      enabled: true
      fps: 5
      width: 1280
      height: 720
      - 912,217,478,336,0,337,0,0,1280,0,1280,194

Htop while detecting: (usually below 20%when not detecting)

Screen Shot 2024-05-07 at 12 51 43

Frigate Side: Screen Shot 2024-05-07 at 12 51 19 Screen Shot 2024-05-07 at 12 51 28

Disabling hardware accel and increasing shm_size to 1024 did the trick for me… running perfectly for 3 days now

eohlde commented 3 months ago

I have this same thing happening, Running Ubuntu on a SFF pc, with a PCI Coral. I get maybe 3 days before it goes unresponsive. I have removed all the recording and coral from the config to get down to a barebones and will add things back as I get stability.

My setup is 5 unifi cameras, gortc, and QSV for decode.

I increased shm_size -> 1024m and have not had an issue with just camera decode (no detection or storage) Uptime for almost 7 days now. I'm going add in detection tonight.

eohlde commented 3 months ago

I think for me this was due to having the media/frigate folder mounted to a samba share. I've put the physical disk into the same host machine that is running frigate, and will try to use an nfs share to let home assistant access the files over the network.

jdeath commented 3 months ago

I have had memory/CPU spike issues for the past couple weeks, which crash my homeassistant every few hours. My configuration has worked for over a year without issue, so I think HASSOS must have made a change recently which caused issues. I had issues with my i5-6500T w/ 8 Gig Ram. I upgraded to new systems with a i5-8500 and 16 Gig ram and still have the same issue.

I isolated the problem to Frigate (had to disable all integrations and addons to isolate it). From reading this issue, I figured out the vaapi hardware acceleration was causing spikes. Disabling vaapi with: hwaccel_args: " "

prevents the memory spikes and no longer have crashes. Using vaapi crashes within an hour or so. I use a coral TPU for the detection, so the increased CPU is not a big deal, especially with the upgraded CPU.

Thought I would post this issue so others would know. I have used 13.2 and 14.0-beta2 (both regular and full access) and all crash if use vaapi. So strange.

NickM-27 commented 3 months ago

you could try using qsv and see if it behaves any differently

jdeath commented 3 months ago

@NickM-27 The documentation says qsv is only for >= 10th gen. I am only on an 8th gen now.

NickM-27 commented 3 months ago

it is a bit ambiguous because some 8th gen CPUs do support it, might be worth a try

jdeath commented 3 months ago

Thanks. Appears to work, UI says intel-qsv GPU is working. I have two identical cameras and see a 18% v 6% CPU use when I put one on qsv. I will see if it is stable and works on my other cameras

edit: Still got a memory spike/crash using qsv. I will just go back to no hardware acceleration.