[Support]: Unable to keep up with recording segments

ccutrer commented 1 year ago

Describe the problem you are having

I get the log message Unable to keep up with recording segments in cache for <camera>. Keeping the 5 most recent segments out of <x> and discarding the rest... in my log a LOT. I've enabled debug logging for recordings, but I don't see any noticeable slowdown in file copies (generally 0.2s or less per file) out of the cache. Watching iotop the DISK write is mostly in the range of 50-100 K/s, with occasional spurts of 20-50 M/s. iftop indicates I'm doing ~205 Mbps of incoming bandwidth. My recording volume is a RAID-0 of 6x Western Digital Purple drives (though in an unusual configuration, so it's highly likely only two spindles will be used at any given time). Even if it was a single drive, the sustained throughput is rated at 145MB/s, so I shouldn't be coming anywhere near the actual storage throughput. Snapshots and database all go to the main OS volume, which is a 250 GB SSD. All hard drives are connected via SATA (4 of them via an external eSATA enclosure connected to a PCI Express eSATA card).

Analyzing my log file, of the last 14,502 segments that were copied, the slowest one took 0.84s, and on average they take 0.17s.

CPU usage normally sits at 50-60%.

I have 64GB of RAM. Usually about ~7GB is in use, and the rest is in buffers/cache.

It might be that my storage is too slow, but it sure seems like either Frigate is being too impatient with a large number of cameras, or something else is happening outside of the actual copy process that is being too slow.

Version

0.12.0-27A31E7

Frigate config file

detectors:
  coral:
    type: edgetpu
    device: usb

ffmpeg:
  hwaccel_args: preset-vaapi
  input_args: preset-rtsp-restream-low-latency
  output_args:
    record: preset-record-generic-audio-aac

ui:
  live_mode: webrtc
  use_experimental: false

birdseye:
  mode: objects
  width: 1920
  height: 1080
  restream: true

logger:
  logs:
    frigate.record: debug

go2rtc:
  <elided>

detect:
  width: 1280
  height: 720
  fps: 6
  max_disappeared: 18
  stationary:
    interval: 12

objects:
  track:
    - person
    - dog
    - cat

record:
  enabled: true
  retain:
    days: 14
    mode: all
  events:
    pre_capture: 2
    post_capture: 2

snapshots:
  enabled: true
  retain:
    default: 14

cameras:
  <33 cameras>

Relevant log output

See https://gist.github.com/ccutrer/224fc652c70661d91f56811a1cd88f3f (log is too long to include inline)

FFprobe output from your camera

N/A

Frigate stats

{"back_yard_south":{"camera_fps":6.0,"capture_pid":477,"detection_enabled":1,"detection_fps":0.2,"ffmpeg_pid":482,"pid":427,"process_fps":6.0,"skipped_fps":0.0},"basketball":{"camera_fps":6.1,"capture_pid":481,"detection_enabled":1,"detection_fps":2.6,"ffmpeg_pid":489,"pid":429,"process_fps":5.7,"skipped_fps":0.0},"bunk":{"camera_fps":6.1,"capture_pid":483,"detection_enabled":1,"detection_fps":0.0,"ffmpeg_pid":495,"pid":430,"process_fps":6.1,"skipped_fps":0.0},"cpu_usages":{"%Cpu(s):":{"cpu":"id,","mem":"0.6"},"1":{"cpu":"0.0","mem":"0.0"},"104":{"cpu":"0.0","mem":"0.0"},"128":{"cpu":"0.0","mem":"0.0"},"129":{"cpu":"0.0","mem":"0.0"},"130":{"cpu":"0.0","mem":"0.0"},"131":{"cpu":"0.0","mem":"0.0"},"134":{"cpu":"0.0","mem":"0.0"},"145":{"cpu":"0.0","mem":"0.0"},"15":{"cpu":"0.0","mem":"0.0"},"16":{"cpu":"0.0","mem":"0.0"},"166":{"cpu":"0.0","mem":"0.0"},"211":{"cpu":"0.0","mem":"0.0"},"2393":{"cpu":"3.0","mem":"0.1"},"2395":{"cpu":"2.6","mem":"0.0"},"24":{"cpu":"0.0","mem":"0.0"},"2403":{"cpu":"2.0","mem":"0.0"},"25":{"cpu":"0.0","mem":"0.0"},"26":{"cpu":"0.0","mem":"0.0"},"27":{"cpu":"0.0","mem":"0.0"},"28":{"cpu":"0.0","mem":"0.0"},"29":{"cpu":"0.0","mem":"0.0"},"30":{"cpu":"0.0","mem":"0.0"},"31":{"cpu":"0.0","mem":"0.0"},"40":{"cpu":"0.0","mem":"0.0"},"41":{"cpu":"0.0","mem":"0.0"},"412":{"cpu":"0.0","mem":"0.1"},"418":{"cpu":"2.0","mem":"0.0"},"419":{"cpu":"28.1","mem":"0.2"},"420":{"cpu":"54.3","mem":"0.2"},"427":{"cpu":"2.3","mem":"0.2"},"428":{"cpu":"0.0","mem":"0.0"},"429":{"cpu":"3.0","mem":"0.2"},"430":{"cpu":"0.7","mem":"0.2"},"431":{"cpu":"0.0","mem":"0.0"},"432":{"cpu":"1.0","mem":"0.2"},"433":{"cpu":"2.6","mem":"0.2"},"434":{"cpu":"0.0","mem":"0.0"},"435":{"cpu":"14.2","mem":"0.2"},"436":{"cpu":"16.9","mem":"0.2"},"437":{"cpu":"0.0","mem":"0.0"},"438":{"cpu":"1.7","mem":"0.2"},"439":{"cpu":"0.7","mem":"0.2"},"440":{"cpu":"0.0","mem":"0.0"},"441":{"cpu":"1.7","mem":"0.2"},"442":{"cpu":"0.0","mem":"0.0"},"443":{"cpu":"0.3","mem":"0.2"},"444":{"cpu":"0.0","mem":"0.0"},"445":{"cpu":"0.0","mem":"0.0"},"446":{"cpu":"1.0","mem":"0.2"},"447":{"cpu":"1.0","mem":"0.2"},"448":{"cpu":"0.7","mem":"0.2"},"449":{"cpu":"1.0","mem":"0.2"},"450":{"cpu":"1.0","mem":"0.2"},"451":{"cpu":"0.7","mem":"0.2"},"452":{"cpu":"0.7","mem":"0.2"},"453":{"cpu":"1.0","mem":"0.2"},"454":{"cpu":"13.6","mem":"0.2"},"455":{"cpu":"0.0","mem":"0.0"},"456":{"cpu":"0.3","mem":"0.2"},"457":{"cpu":"0.7","mem":"0.2"},"458":{"cpu":"0.0","mem":"0.0"},"459":{"cpu":"0.0","mem":"0.0"},"460":{"cpu":"2.6","mem":"0.2"},"461":{"cpu":"16.2","mem":"0.2"},"462":{"cpu":"0.0","mem":"0.0"},"463":{"cpu":"1.0","mem":"0.2"},"464":{"cpu":"0.0","mem":"0.0"},"465":{"cpu":"0.3","mem":"0.2"},"466":{"cpu":"14.9","mem":"0.2"},"467":{"cpu":"0.0","mem":"0.0"},"468":{"cpu":"16.9","mem":"0.2"},"469":{"cpu":"0.7","mem":"0.2"},"470":{"cpu":"0.0","mem":"0.0"},"471":{"cpu":"0.7","mem":"0.2"},"472":{"cpu":"0.7","mem":"0.2"},"473":{"cpu":"0.0","mem":"0.0"},"47312":{"cpu":"5.0","mem":"0.0"},"474":{"cpu":"14.6","mem":"0.2"},"475":{"cpu":"2.3","mem":"0.2"},"476":{"cpu":"0.0","mem":"0.0"},"477":{"cpu":"2.0","mem":"0.2"},"478":{"cpu":"0.0","mem":"0.0"},"481":{"cpu":"2.0","mem":"0.2"},"482":{"cpu":"2.3","mem":"0.1"},"483":{"cpu":"1.7","mem":"0.2"},"484":{"cpu":"0.0","mem":"0.0"},"489":{"cpu":"2.6","mem":"0.1"},"490":{"cpu":"2.3","mem":"0.2"},"493":{"cpu":"2.6","mem":"0.0"},"494":{"cpu":"0.0","mem":"0.0"},"495":{"cpu":"3.0","mem":"0.1"},"497":{"cpu":"2.3","mem":"0.2"},"502":{"cpu":"0.0","mem":"0.0"},"503":{"cpu":"2.6","mem":"0.1"},"503573":{"cpu":"3.3","mem":"0.1"},"503575":{"cpu":"1.3","mem":"0.0"},"504":{"cpu":"1.7","mem":"0.2"},"508":{"cpu":"1.0","mem":"0.0"},"509":{"cpu":"3.0","mem":"0.1"},"510":{"cpu":"2.3","mem":"0.2"},"511":{"cpu":"2.3","mem":"0.0"},"513":{"cpu":"0.0","mem":"0.0"},"518":{"cpu":"2.3","mem":"0.0"},"520":{"cpu":"3.6","mem":"0.1"},"521":{"cpu":"3.0","mem":"0.0"},"525":{"cpu":"2.3","mem":"0.0"},"526":{"cpu":"2.6","mem":"0.1"},"527":{"cpu":"1.7","mem":"0.2"},"528":{"cpu":"0.0","mem":"0.0"},"531":{"cpu":"2.0","mem":"0.2"},"536":{"cpu":"3.0","mem":"0.0"},"538":{"cpu":"2.0","mem":"0.2"},"541":{"cpu":"2.6","mem":"0.1"},"542":{"cpu":"0.0","mem":"0.0"},"544":{"cpu":"2.0","mem":"0.0"},"545":{"cpu":"1.7","mem":"0.2"},"546":{"cpu":"2.6","mem":"0.1"},"547528":{"cpu":"0.3","mem":"0.0"},"547543":{"cpu":"1.3","mem":"0.1"},"549":{"cpu":"0.0","mem":"0.0"},"550":{"cpu":"1.7","mem":"0.0"},"555":{"cpu":"2.6","mem":"0.1"},"558":{"cpu":"2.6","mem":"0.1"},"560":{"cpu":"3.0","mem":"0.0"},"561":{"cpu":"2.0","mem":"0.0"},"563":{"cpu":"2.3","mem":"0.0"},"564":{"cpu":"0.0","mem":"0.0"},"565":{"cpu":"2.3","mem":"0.2"},"576":{"cpu":"1.3","mem":"0.0"},"577":{"cpu":"2.3","mem":"0.2"},"578":{"cpu":"3.0","mem":"0.0"},"581":{"cpu":"0.0","mem":"0.0"},"582":{"cpu":"2.3","mem":"0.1"},"583":{"cpu":"2.0","mem":"0.0"},"584":{"cpu":"2.0","mem":"0.2"},"589":{"cpu":"2.3","mem":"0.2"},"590":{"cpu":"3.3","mem":"0.1"},"591":{"cpu":"0.0","mem":"0.0"},"592":{"cpu":"3.0","mem":"0.0"},"594":{"cpu":"2.0","mem":"0.0"},"598":{"cpu":"2.0","mem":"0.2"},"599":{"cpu":"2.6","mem":"0.1"},"602":{"cpu":"0.0","mem":"0.0"},"603":{"cpu":"2.0","mem":"0.2"},"605":{"cpu":"2.0","mem":"0.0"},"607":{"cpu":"3.0","mem":"0.0"},"615":{"cpu":"2.3","mem":"0.2"},"616":{"cpu":"3.0","mem":"0.1"},"618":{"cpu":"2.6","mem":"0.0"},"619":{"cpu":"2.3","mem":"0.0"},"620":{"cpu":"2.0","mem":"0.2"},"623":{"cpu":"0.0","mem":"0.0"},"624":{"cpu":"2.6","mem":"0.1"},"626":{"cpu":"3.0","mem":"0.1"},"627":{"cpu":"2.3","mem":"0.0"},"629":{"cpu":"2.0","mem":"0.2"},"630":{"cpu":"3.0","mem":"0.0"},"635":{"cpu":"0.0","mem":"0.0"},"636":{"cpu":"1.7","mem":"0.2"},"641":{"cpu":"2.6","mem":"0.0"},"642":{"cpu":"3.3","mem":"0.0"},"643":{"cpu":"1.7","mem":"0.2"},"644":{"cpu":"2.6","mem":"0.0"},"647":{"cpu":"2.6","mem":"0.1"},"650":{"cpu":"0.0","mem":"0.0"},"651":{"cpu":"2.3","mem":"0.2"},"653":{"cpu":"3.3","mem":"0.1"},"654":{"cpu":"1.7","mem":"0.2"},"658":{"cpu":"2.0","mem":"0.0"},"659":{"cpu":"3.0","mem":"0.1"},"662":{"cpu":"0.0","mem":"0.0"},"664":{"cpu":"1.7","mem":"0.2"},"668":{"cpu":"2.6","mem":"0.1"},"669":{"cpu":"2.6","mem":"0.1"},"671":{"cpu":"1.7","mem":"0.2"},"673":{"cpu":"1.7","mem":"0.0"},"675":{"cpu":"2.0","mem":"0.0"},"676":{"cpu":"13.2","mem":"0.1"},"677":{"cpu":"0.7","mem":"0.2"},"681":{"cpu":"3.0","mem":"0.0"},"687":{"cpu":"2.6","mem":"0.0"},"689":{"cpu":"1.0","mem":"0.0"},"690":{"cpu":"3.0","mem":"0.1"},"691":{"cpu":"2.0","mem":"0.0"},"698":{"cpu":"2.3","mem":"0.0"},"700":{"cpu":"4.0","mem":"0.2"},"706":{"cpu":"1.7","mem":"0.2"},"707":{"cpu":"3.0","mem":"0.1"},"717":{"cpu":"2.0","mem":"0.1"},"718":{"cpu":"1.7","mem":"0.0"},"729":{"cpu":"1.7","mem":"0.2"},"735":{"cpu":"5.6","mem":"0.1"},"749":{"cpu":"1.7","mem":"0.2"},"755":{"cpu":"2.6","mem":"0.1"},"756":{"cpu":"3.0","mem":"0.0"},"761":{"cpu":"1.7","mem":"0.2"},"763":{"cpu":"2.3","mem":"0.0"},"765":{"cpu":"2.6","mem":"0.0"},"766":{"cpu":"3.0","mem":"0.1"},"768":{"cpu":"2.0","mem":"0.0"},"769":{"cpu":"2.0","mem":"0.2"},"770":{"cpu":"3.3","mem":"0.0"},"772":{"cpu":"1.0","mem":"0.0"},"775":{"cpu":"2.3","mem":"0.0"},"776":{"cpu":"2.6","mem":"0.1"},"78":{"cpu":"0.0","mem":"0.0"},"783":{"cpu":"3.0","mem":"0.1"},"790":{"cpu":"1.0","mem":"0.0"},"793":{"cpu":"3.0","mem":"0.0"},"794":{"cpu":"2.0","mem":"0.0"},"795":{"cpu":"3.0","mem":"0.1"},"798":{"cpu":"3.0","mem":"0.0"},"799":{"cpu":"1.3","mem":"0.0"},"80":{"cpu":"0.0","mem":"0.0"},"800":{"cpu":"2.3","mem":"0.0"},"803":{"cpu":"1.3","mem":"0.0"},"804":{"cpu":"2.0","mem":"0.0"},"805":{"cpu":"1.7","mem":"0.0"},"81":{"cpu":"0.0","mem":"0.0"},"876":{"cpu":"1.7","mem":"0.0"},"88":{"cpu":"66.2","mem":"0.0"},"93":{"cpu":"33.8","mem":"0.8"},"993":{"cpu":"1.7","mem":"0.0"},"MiB":{"cpu":"53267.2","mem":"avail"},"PID":{"cpu":"%CPU","mem":"%MEM"},"Tasks:":{"cpu":"stopped,","mem":"0"},"top":{"cpu":"users,","mem":"load"}},"deck":{"camera_fps":6.0,"capture_pid":490,"detection_enabled":1,"detection_fps":0.0,"ffmpeg_pid":503,"pid":432,"process_fps":6.0,"skipped_fps":0.0},"detection_fps":44.7,"detectors":{"coral":{"detection_start":1679595253.171513,"inference_speed":18.52,"pid":419}},"dining":{"camera_fps":6.0,"capture_pid":497,"detection_enabled":1,"detection_fps":0.5,"ffmpeg_pid":509,"pid":433,"process_fps":6.0,"skipped_fps":0.0},"doorbell":{"camera_fps":6.0,"capture_pid":504,"detection_enabled":1,"detection_fps":4.8,"ffmpeg_pid":520,"pid":435,"process_fps":5.6,"skipped_fps":0.0},"driveway":{"camera_fps":6.0,"capture_pid":510,"detection_enabled":1,"detection_fps":5.7,"ffmpeg_pid":526,"pid":436,"process_fps":3.6,"skipped_fps":0.0},"entry":{"camera_fps":6.1,"capture_pid":527,"detection_enabled":1,"detection_fps":0.5,"ffmpeg_pid":541,"pid":438,"process_fps":6.1,"skipped_fps":0.0},"family":{"camera_fps":6.1,"capture_pid":531,"detection_enabled":1,"detection_fps":0.0,"ffmpeg_pid":546,"pid":439,"process_fps":6.1,"skipped_fps":0.0},"fire_pit":{"camera_fps":6.1,"capture_pid":538,"detection_enabled":1,"detection_fps":0.9,"ffmpeg_pid":555,"pid":441,"process_fps":6.0,"skipped_fps":0.0},"gaming":{"camera_fps":6.0,"capture_pid":545,"detection_enabled":1,"detection_fps":0.0,"ffmpeg_pid":558,"pid":443,"process_fps":6.0,"skipped_fps":0.0},"garage":{"camera_fps":6.1,"capture_pid":565,"detection_enabled":1,"detection_fps":0.0,"ffmpeg_pid":582,"pid":446,"process_fps":6.1,"skipped_fps":0.0},"gpu_usages":{"intel-vaapi":{"gpu":"1.0 %","mem":"- %"}},"great":{"camera_fps":6.1,"capture_pid":577,"detection_enabled":1,"detection_fps":0.0,"ffmpeg_pid":590,"pid":447,"process_fps":6.1,"skipped_fps":0.0},"hot_tub":{"camera_fps":6.1,"capture_pid":584,"detection_enabled":1,"detection_fps":0.1,"ffmpeg_pid":599,"pid":448,"process_fps":6.1,"skipped_fps":0.0},"john_deere":{"camera_fps":6.1,"capture_pid":589,"detection_enabled":1,"detection_fps":0.0,"ffmpeg_pid":2393,"pid":449,"process_fps":6.1,"skipped_fps":0.0},"laundry":{"camera_fps":6.0,"capture_pid":598,"detection_enabled":1,"detection_fps":0.0,"ffmpeg_pid":616,"pid":450,"process_fps":6.0,"skipped_fps":0.0},"loft":{"camera_fps":6.0,"capture_pid":603,"detection_enabled":1,"detection_fps":0.0,"ffmpeg_pid":624,"pid":451,"process_fps":6.0,"skipped_fps":0.0},"loft_door":{"camera_fps":6.1,"capture_pid":615,"detection_enabled":1,"detection_fps":0.0,"ffmpeg_pid":626,"pid":452,"process_fps":6.1,"skipped_fps":0.0},"man_door":{"camera_fps":6.0,"capture_pid":620,"detection_enabled":1,"detection_fps":0.3,"ffmpeg_pid":653,"pid":453,"process_fps":6.2,"skipped_fps":0.0},"mud":{"camera_fps":6.1,"capture_pid":629,"detection_enabled":1,"detection_fps":2.8,"ffmpeg_pid":647,"pid":454,"process_fps":5.8,"skipped_fps":0.0},"pantry":{"camera_fps":6.1,"capture_pid":636,"detection_enabled":1,"detection_fps":0.0,"ffmpeg_pid":659,"pid":456,"process_fps":6.1,"skipped_fps":0.0},"patio":{"camera_fps":6.1,"capture_pid":643,"detection_enabled":1,"detection_fps":0.0,"ffmpeg_pid":668,"pid":457,"process_fps":6.1,"skipped_fps":0.0},"playset":{"camera_fps":6.1,"capture_pid":651,"detection_enabled":1,"detection_fps":0.1,"ffmpeg_pid":669,"pid":460,"process_fps":6.1,"skipped_fps":0.0},"porch":{"camera_fps":6.0,"capture_pid":654,"detection_enabled":1,"detection_fps":5.1,"ffmpeg_pid":690,"pid":461,"process_fps":4.1,"skipped_fps":0.0},"service":{"last_updated":1679595257,"latest_version":"0.11.1","storage":{"/dev/shm":{"free":455.7,"mount_type":"tmpfs","total":536.9,"used":81.1},"/media/frigate/clips":{"free":119362.6,"mount_type":"ext4","total":241770.0,"used":110051.7},"/media/frigate/recordings":{"free":8300459.7,"mount_type":"ext4","total":27892949.5,"used":18192199.2},"/tmp/cache":{"free":1052.3,"mount_type":"tmpfs","total":2147.5,"used":1095.2}},"temperatures":{},"uptime":85107,"version":"0.12.0-27a31e7"},"sewing":{"camera_fps":6.0,"capture_pid":664,"detection_enabled":1,"detection_fps":0.0,"ffmpeg_pid":503573,"pid":463,"process_fps":6.0,"skipped_fps":0.0},"solar_panels":{"camera_fps":6.1,"capture_pid":671,"detection_enabled":1,"detection_fps":0.0,"ffmpeg_pid":707,"pid":465,"process_fps":6.1,"skipped_fps":0.0},"south_side":{"camera_fps":6.1,"capture_pid":677,"detection_enabled":1,"detection_fps":5.6,"ffmpeg_pid":717,"pid":466,"process_fps":5.6,"skipped_fps":0.0},"street":{"camera_fps":12.1,"capture_pid":700,"detection_enabled":1,"detection_fps":5.6,"ffmpeg_pid":735,"pid":468,"process_fps":3.7,"skipped_fps":0.0},"theater_back":{"camera_fps":6.1,"capture_pid":706,"detection_enabled":1,"detection_fps":0.0,"ffmpeg_pid":755,"pid":469,"process_fps":6.1,"skipped_fps":0.0},"theater_front":{"camera_fps":6.0,"capture_pid":729,"detection_enabled":1,"detection_fps":0.0,"ffmpeg_pid":766,"pid":471,"process_fps":6.0,"skipped_fps":0.0},"toy":{"camera_fps":6.0,"capture_pid":749,"detection_enabled":1,"detection_fps":0.0,"ffmpeg_pid":776,"pid":472,"process_fps":6.0,"skipped_fps":0.0},"trampoline":{"camera_fps":6.2,"capture_pid":761,"detection_enabled":1,"detection_fps":5.7,"ffmpeg_pid":783,"pid":474,"process_fps":2.4,"skipped_fps":0.0},"volleyball":{"camera_fps":6.0,"capture_pid":769,"detection_enabled":1,"detection_fps":4.2,"ffmpeg_pid":795,"pid":475,"process_fps":3.9,"skipped_fps":0.0}}

Operating system

HassOS

Install method

HassOS Addon

Coral version

USB

Network connection

Wired

Camera make and model

Mostly Hikvisions of various models. One UniFi doorbell cam. All cameras are >= 4MP

Any other information that may be helpful

No response

NickM-27 commented 1 year ago

Curious what @blakeblackshear thinks but my understanding is that, while the storage times are reasonably fast, they are not fast enough for 33 cameras and there are some cameras which end up with too many segments sitting idle in memory.

but it sure seems like either Frigate is being too impatient with a large number of cameras

If this is the case of an issue for too many cameras for the storage to handle, I would disagree with "being too impatient". With more than 5 segments for some cameras sitting in cache, restarting frigate / the host would mean at least a minute of the most recent footage was lost. This of course also covers cases like power loss, someone purposefully unplugging the computer, etc.

If frigate does not limit to recent segments then the list could end up getting increasingly long causing the potential for much worse footage loss.

As far as ways to improve the scenario, I would suggest that at least some cameras could certainly be moved to a retain mode of motion instead of all since the only segments not kept would be ones where nothing happened (no motion). This would reduce the number of segments needing to be moved for that camera and reducing pressure on other cameras.

ccutrer commented 1 year ago

You make good points about being more "patient" actually causing problems.

One thing I noticed in the log is that copies seem to happen in alphabetic order, meaning all 5 segments for cameras A through Y are copied before the oldest segment of camera Z is even attempted. I think what I'm saying about "impatient" is perhaps this behavior is causing cameras near the end of the list to be more likely to lose segments. Though as I think about this, I think you're right that assuming the disk transfer rate can keep up, no matter what segments will be lost, and it's just a matter of which cameras lose segments. And I can't think of a reason why it should prefer to drop "oldest" segments from all cameras vs. more segments from certain cameras.

while the storage times are reasonably fast, they are not fast enough for 33 cameras

With 33 cameras, and thus 33 segments every 10s, that means I need to copy on average 3.3 segments per second, or 0.3s per segment. My average copy time of 0.17s definitely stays under that, but my max of 0.84s does not, so it's definitely in the realm of possibility that when it's slow, it simply can't keep up. This seems to be an argument for allowing one to configure how many cached segments are allowed, allowing one to knowingly choose to allow a possible longer latency of getting segments to permanent storage at the cost of needing additional RAM and the other possible downsides of how much could be lost when unexpectedly interrupted.

Additionally, I stopped Frigate, and ran a disk speed check:

$ dd if=/dev/zero of=frigate/storage/recordings/test1.img bs=1G count=20 oflag=dsync
20+0 records in
20+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 227.014 s, 94.6 MB/s

Definitely seems like the raw disk throughput is more than enough to handle ~30MB/s of recording data, and if Frigate is unable to achieve somewhere near that, a ~3x overhead seems a bit extreme.

NickM-27 commented 1 year ago

You make good points about being more "patient" actually causing problems.

One thing I noticed in the log is that copies seem to happen in alphabetic order, meaning all 5 segments for cameras A through Y are copied before the oldest segment of camera Z is even attempted. I think what I'm saying about "impatient" is perhaps this behavior is causing cameras near the end of the list to be more likely to lose segments. Though as I think about this, I think you're right that assuming the disk transfer rate can keep up, no matter what segments will be lost, and it's just a matter of which cameras lose segments. And I can't think of a reason why it should prefer to drop "oldest" segments from all cameras vs. more segments from certain cameras.

To be clear the limit of 5 is per camera. Frigate also runs through ALL cameras before it loops back around to move segments from cameras that it already looked at. The list of recordings is also not changed while frigate is moving existing segments so I am not sure this is the case.

while the storage times are reasonably fast, they are not fast enough for 33 cameras

With 33 cameras, and thus 33 segments every 10s, that means I need to copy on average 3.3 segments per second, or 0.3s per segment. My average copy time of 0.17s definitely stays under that, but my max of 0.84s does not, so it's definitely in the realm of possibility that when it's slow, it simply can't keep up. This seems to be an argument for allowing one to configure how many cached segments are allowed, allowing one to knowingly choose to allow a possible longer latency of getting segments to permanent storage at the cost of needing additional RAM and the other possible downsides of how much could be lost when unexpectedly interrupted.

Additionally, I stopped Frigate, and ran a disk speed check:
$ dd if=/dev/zero of=frigate/storage/recordings/test1.img bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 5.34069 s, 201 MB/s
Definitely seems like the raw disk throughput is more than enough to handle ~30MB/s of recording data, and if Frigate is unable to achieve somewhere near that, a 6-7x overhead seems a bit extreme.

That's talking about sequential write of a single file though which is very different from random write of small files. Same reason why transferring a bunch of jpgs is slower than transferring one large mp4 of the same total size.

https://superuser.com/a/1611851

The option for customizing the number of segments kept in cache has been discussed and may come in the future, but it likely won't be a part of 0.12 https://github.com/blakeblackshear/frigate/pull/5419#issuecomment-1421665760

ccutrer commented 1 year ago

That's talking about sequential write of a single file though which is very different from random write of small files. Same reason why transferring a bunch of jpgs is slower than transferring one large mp4 of the same size.

Indeed, there are differences. But I definitely would not characterize Frigate as "random writes of small files". Frigate writes are neither random (which when talking about disk I/O usually refers to having to do lots of reads at the same time as writes, or reading from files physically located all over a the platter. Frigate is 99% percent writes, without really modifying anything else, which would translate out to a really long contiguous write of sectors on disk), nor small (my segments average almost exactly 5MB. I used to work on a petabyte-scale distributed storage system. We were definitely aware of throughput problems with small files, but didn't consider a file as small unless it was 50KB or less).

Anyhow, I ran a test of copying an 1.7GB worth of actual segments. I would expect this to be slower than Frigate could accomplish, because I'm reading from the same volume I'm writing too (so both read time, and seek time). It moved 360 files (1.7GB) in 22.5s at ~75MB/s (0.06s per file). Again pointing to Frigate having some hidden unnecessary overhead.

NickM-27 commented 1 year ago

Anyhow, I ran a test of copying an 1.7GB worth of actual segments. I would expect this to be slower than Frigate could accomplish, because I'm reading from the same volume I'm writing too (so both read time, and seek time). It moved 360 files (1.7GB) in 22.5s at ~75MB/s (0.06s per file). Again pointing to Frigate having some hidden unnecessary overhead.

Calling it unnecessary seems a bit presumptuous. One thing that I remember after looking in the code is that it is not a direct copy, when a segment is to be moved from cache to storage frigate uses ffmpeg to set the moov atom to the beginning of the mp4 file (it is typically at the end) to aid in faster metadata reading by nginx.

https://github.com/blakeblackshear/frigate/blob/1bf3b83ef38d25238d5d6ef1bcbbf28dc9386815/frigate/record.py#L303-L331

ccutrer commented 1 year ago

😂 you're right, that was a bit presumptuous. I correct to "...overhead, possibly unnecessary..."). Running it through ffmpeg seems like a pretty useful step to take.

Now to test how much that adds:

time for f in mud/*; do ffmpeg -y -i $f -c copy -movflags +faststart test/$(basename $f); done

Took 47s, 36MB/s, or 0.13s each. That accounts for the majority of the extra time. While doing that, neither I/O nor CPU was showing any stress at all. Now I'm worried that no matter how much faster I make my disk (by adding more disks in a proper RAID-0), the time will usually be dominated by ffmpeg processing the segments :(.

NickM-27 commented 1 year ago

Potentially, like I said setting a few cameras to motion retain mode should help. another thing that would help is having an SSD as a write cache.

Perhaps in the future frigate could use asyncio to run this logic in multiple threads.

ccutrer commented 1 year ago

I'm trying out motion retain mode for all cameras. This is a big leap of trust in Frigate to not miss anything!

another thing that would help is having an SSD as a write cache.

Under the theory that it would speed up the synchronous writing flow of Frigate, which is the critical process, and because it's not actually stymied by the backend disk transfer rate, would have no troubles keeping up inflow-to-outflow?

Perhaps in the future frigate could use asyncio to run this logic in multiple threads.

Oooh yeah, good idea.

for f in mud/*; do (ffmpeg -y -i $f -c copy -movflags +faststart test/$(basename $f); done

gets me down to ~7s, or ~244MB/s (0.2s per file)! Seems like a massive win! (again, you're correct that these results won't be 100% predictive of if frigate were doing a similar thing internally). Frigate would likely limit to one thread per CPU core, and maybe not even bother if there are less than 10 segments to archive. And if it's really the disk slowing down significantly occasionally, it will just make that problem more obvious, rather than actually fix it.

blakeblackshear commented 1 year ago

It's not just the recordings related activities. The thread that manages recordings is one of many threads. It shares a single CPU with lots of other parts of Frigate's processing. Many parts of Frigate are in dedicated processes, but this isn't one of them. My guess is that there just isn't enough time to go around at some point when there is a lot of other things happening with lots of cameras.

We are already talking about some architecture changes that would eliminate this contention problem.

It doesn't look like you have configured a db location. You may want to move your sqlite db to a faster drive if possible. https://deploy-preview-4055--frigate-docs.netlify.app/frigate/installation#storage

ccutrer commented 1 year ago

The thread that manages recordings is one of many threads. It shares a single CPU with lots of other parts of Frigate's processing.

But different threads can execute on different CPUs, no? Or is this because of the infamous Python GIL? (I'm not much of a Python guy, so I just know that a GIL exists, not the intimate details of what is constrained by it and what's not).

We are already talking about some architecture changes that would eliminate this contention problem.

👍

It doesn't look like you have configured a db location. You may want to move your sqlite db to a faster drive if possible.

Only my recordings directory is mounted from my big-slow-RAID-array. The rest of the storage directory is on my root volume, which is an SSD. I shouldn't need to configure anything to change that, no?

blakeblackshear commented 1 year ago

But different threads can execute on different CPUs, no?

All Python threads run on the same CPU. It's the infamous GIL.

The rest of the storage directory is on my root volume, which is an SSD. I shouldn't need to configure anything to change that, no?

Are you actually running the Addon in HassOS? If not, can you provide your compose file? The db is stored at /media/frigate/ by default, so yea it needs to be changed if you want it somewhere else.

ccutrer commented 1 year ago

Docker Compose on Ubuntu host.

docker-compose.yml (located at /home/cody/docker/docker-compose.yml on the host):

version: "3.9"
services:
  frigate:
    container_name: frigate
    privileged: true
    restart: unless-stopped
    image: ghcr.io/blakeblackshear/frigate:0.12.0-beta8
    shm_size: "512mb"
    devices:
      - /dev/bus/usb:/dev/bus/usb
      - /dev/dri/renderD128
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - ./frigate/config.yml:/config/config.yml
      - ./frigate/storage:/media/frigate
      - type: tmpfs
        target: /tmp/cache
        tmpfs:
          size: 2G
    ports:
      - 5000:5000
      - 8554:8554 # RTSP feeds
      - 8555:8555/tcp # WebRTC over tcp
      - 8555:8555/udp # WebRTC over udp

/etc/fstab:

# <file system> <mount point>   <type>  <options>       <dump>  <pass>
/dev/disk/by-id/dm-uuid-LVM-SGiok68jVsOcFj99cQNc1yf5Z7X7244l6OZdIgwvyQTBFdRFCvP3CBCB56J8B6y1 / ext4 defaults 0 1
/dev/disk/by-uuid/726a8f95-8dbf-4f7c-9be9-9d37429f5eea /boot ext4 defaults 0 1
/dev/disk/by-uuid/117B-B46D /boot/efi vfat defaults 0 1
/swap.img   none    swap    sw  0   0
/dev/mapper/nvr-recordings /home/cody/docker/frigate/storage/recordings ext4 defaults 0 0

(nvr-recordings is my LVM volume composed of 6x WD Purple drives of varying sizes in a RAID-0 configuration, but due to their non-uniform size the volume was built two at a time)

blakeblackshear commented 1 year ago

I would recommend trying to put your database somewhere other than nvr-recordings to see if that helps. You should be able to follow the examples in the docs I linked to change the location. You stop the frigate container, move the existing .db file to the new location and then start back up.

ccutrer commented 1 year ago

??

My database is not on the nvr-recordings volume. Unless there's something about docker I don't understand. nvr-recordings only maps to storage/recordings inside docker; just storage is on the root volume.

blakeblackshear commented 1 year ago

You are right. I misread your fstab.

ccutrer commented 1 year ago

Just to prove it to myself even more:

from outside the container:

cody@alabama:~/docker/frigate$ df -h storage/frigate.db
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv  226G  112G  103G  53% /
cody@alabama:~/docker/frigate$ df -h storage/recordings/
Filesystem                  Size  Used Avail Use% Mounted on
/dev/mapper/nvr-recordings   26T   17T  7.3T  70% /home/cody/docker/frigate/storage/recordings

from inside the container:

root@f7f0f02be3d9:/media/frigate# df -h frigate.db
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv  226G  112G  103G  53% /media/frigate
root@f7f0f02be3d9:/media/frigate# df -h recordings/
Filesystem                  Size  Used Avail Use% Mounted on
/dev/mapper/nvr-recordings   26T   17T  7.3T  70% /media/frigate/recordings

blakeblackshear commented 1 year ago

I think you are just bumping up against the limits of what's possible within the current architecture at those copy speeds. All my copy times are <0.09s. There are enough other things to do with >30 cameras in the shared processes that it just can't get it done.

Here are the workarounds I can think of until we get to re-architect some things:

Set your record retention mode to motion on all/some of the cameras and make sure your motion settings are sensitive enough to not miss any footage you care about, but still reduce the number of segments that need to be copied. I do this and it's the best way to be able to scrub a days worth of footage quickly.
Reduce the record resolution and/or bitrates on some of the cameras to reduce the segment sizes and lower your average copy time.
Not a great option, but you could run a second instance of frigate.

LaurenceGough commented 1 year ago

I wish we could just have an option to increase the keep count for recording segments I think I, like others using a "slower" NAS could avoid having to purchase a whole new pc to replace my Raspberry Pi.

I don't get it, my Frigate dedicated NAS is plenty fast in transferring files at around 100MBps whenever I test it, but Frigate chokes up like mad and Frigate starts discarding a load of recordings at various random times many times a day so I end up missing vital clips, the most important feature of CCTV... This should never happen in my opinion.

CPU load is quite low, never above 60% on the pi4 cores. Nothing else uses the NAS and the only thing running on the Pi4 is home assistant which hardly uses any resources and is the most basic setup.

I don't want to write to a SSD for recordings and cause high wear.

I guess I'll have to either pay up for a new mini computer or I might have a go at building my own Frigate version with this small one line change which will probably take me many hours (read days) as I'm useless at these things.

What's interesting is this issue seems to be getting worse every week which makes me suspect it's something to do with the database files (database is stored on local SSD). I might have a go at clearing that and starting fresh.

Only recordings is mounted on the NAS.

NickM-27 commented 1 year ago

No one ever said an option in the config wouldn't happen, but we're talking about ways to actually fix the issue as opposed to a bandaid that makes it take longer to appear. Especially in OPs case where the storage is fast and it's just CPU time isn't enough along with the other work to keep up with all the segments.

blakeblackshear commented 1 year ago

Can one of you run a custom build with the limit increased to 20? I'm not convinced that will even help. Over time, you will most likely just hit the new limit anyway.

LaurenceGough commented 1 year ago

Can one of you run a custom build with the limit increased to 20? I'm not convinced that will even help. Over time, you will most likely just hit the new limit anyway.

I'm going to try and do it tonight / over the next few days as it'll probably take me longer 😭

Due to the gap between them I believe it'll recover but let's test.

NickM-27 commented 1 year ago

I have created a build at crzynik/frigate:rec-seg-20 which sets the segment count to 20 and also adds a log to put how many segments are currently stored.

ccutrer commented 1 year ago

I've changed to motion recording for all cameras for about a week now, and have only seen this issue with a single camera. That camera is my only camera on WiFi, so I just chalk up any issues unique to that camera as poor connectivity. I haven't had any issues (yet) with not having recording for something I wish I should have. 🤞. I'm just a personal user, so the need to have recordings proving lack of activity is only a minor factor for me.

LaurenceGough commented 1 year ago

I have created a build at crzynik/frigate:rec-seg-20 which sets the segment count to 20 and also adds a log to put how many segments are currently stored.

Hi Nick,

I just wanted to say thanks so much for being so kind and creating that build for me, you sure saved me many hours. The additional logging was really useful too.

I made detailed notes on the issue but unfortunately like a fool I didn't save them before my PC hard crashed due to a nvidia driver issue :(, so I have a few less details and logs but I will try my best to explain it.

The segment drop issue was on average happening for 15-25 minutes of elevated segments at completely random times, 4-6 times a day. The build that you kindly created was better, but still during these times I think I recall the count going up to 50-60 at one point!! Other times around 30-40.

I checked the full syslogs on my Raspberry Pi and NAS and any other logs, any cron jobs that I could find. There was absolutely nothing apart from Frigate segment drop messages. CPU usage never above 50% on either device. RAM usage was high on the NAS but kind of expected as it only has 256MB RAM and is probably caching a lot. Nothing really obvious using the RAM, SMB wasn't using much.

Anyway, I thought I'd change the NAS to use NFS rather than SMB/CIFS/SAMBA. The difference is night and day. Exactly the same setup otherwise. Netgear decided to make the share path quite different when using NFS so that caught me out for a little bit... PS for anyone reading, use showmount -e to check what NFS path is being advertised, it's not always what you set. nfs-common package will need to be installed to be used as NFS client and to run showmount.

However, there were still about 3 times it went to a high of 6-8 segments for a few minutes since I performed this fix ~20 hours ago. The duration is clearly lower and it went no higher than a max of 8 segments at one time, a big improvement. Best of all, no recording losses for the first time in ages!!!!

Perhaps the issue was with the oplocks of SMB/SAMBA, although I believe this is requested by the client, which would be either Frigate, Docker or Portainer, I am not quite sure what the client would be considered in this case. The version of NFS I use doesn't support oplocks.

Hopefully this helps at least one other person.

Do you think changing this max segment count to 10 or so would be reasonable for all or would it have a large negative affect for some users with many cameras?

It looks like I can avoid purchasing new hardware for now. I was looking at 3.5" USB enclosures but none of them here are designed for 24/7 use, a new NAS would be £100s and the only alternative would be to find some kind of device with a SATA port that is fairly efficient with some HDD cooling unless I hacked something together with a fan and my 3D printer :).

Are you guys using 2.5" or 3.5" HDDs for recordings?

Thanks again

NickM-27 commented 1 year ago

Thanks for the update, that is good to know for sure. I believe, as Blake said, the acceptable solution would be in making the amount of max segments configurable, not a hard coded increase. At the end of the day that will always be a bandaid fix compared to the arch changes that will come in the future.

I use 3.5" HDDs along with an SSD write cache pool

YBonline commented 1 year ago

I've started to run into this as well as I increased from 30 to 35 cameras, even with it configured for motion recording only. The issue didn't happen when I was at 30, but 35 makes it happen on all my cameras when there is a lot of motion activity at once (usually the issue doesn't appear until 20 cameras have motion at the same time). Increasing the segment count from 5 to 20 made the issue go away for the first 5 or 6 hours, but after 24 hours of it running, its back with the segment issues, along with my /tmp/cache is filling up again (putting it back to 5 stopped that)

My storage is SATA, 2x of Samsung 870 EVO 2TB in a ZFS stripped RAID setup. Would I get performance improvements switching to ext4 and putting half the cameras storing onto one drive, and half storing onto the other? Would that cause issues with the logic for auto expiration? I was thinking about adding another drive or 2 since I'm adding cameras, is that a better choice to add to the existing ZFS array?

tv21 commented 1 year ago

Was this fix put into version 0.12.1-367d724? I ask because we are seeing this issue with only four cameras, and the recordings are being saved to a standard hard drive that is part of the computer that runs Frigate, not some network location.

NickM-27 commented 1 year ago

Was this fix put into version 0.12.1-367d724? I ask because we are seeing this issue with only four cameras, and the recordings are being saved to a standard hard drive that is part of the computer that runs Frigate, not some network location.

It is not mentioned in the release notes of 0.12.1 so it was not included

no2chem commented 1 year ago

Resuming the conversation from #6458

My setup has 14 cameras and I was seeing this issue, which seems to have gone away after change keep_count to 20 for now. The cameras are VBR (with a pretty large band from max 10240kbps / target 5120kbps), and storing to NFS storage on a single HDD, so the bandwidth is extremely variable. Looking at the logs, sometimes it takes 0.2s to copy a segment, sometimes it takes 1s. There are other workloads on the device as well, so that causes the copy rate to be a bit variable as well.

While I'm happy to submit a PR to make keep_count configurable, I'm not sure this is the best approach. We should only start deleting segments if it looks like we're really about to run out of cache. Perhaps check the space usage of all the recording segments and triggering some sort of cleanup would be the best option.

NickM-27 commented 1 year ago

We should only start deleting segments if it looks like we're really about to run out of cache. Perhaps check the space usage of all the recording segments and triggering some sort of cleanup would be the best option.

I am not sure I agree with that necessarily, at least not in all cases. For example, if there is a sudden power outage or frigate is restarted, all of the segments in the cache are just gone. It also means that when an event happens, it is possible that there could be a considerable delay before those clips are available.

Imagine a hypothetical case where a user's house was getting broken into and for whatever reason the system has a large queue of segments, if frigate is still working on moving old segments to storage then the new segments that contain the footage of the house being broken into could hypothetically have never been copied by the time the computer was unplugged. Obviously very hypothetical but something to consider.

At the same time though, I can see that the preference for new segments over old segments isn't always going to be the right choice either.

NickM-27 commented 1 year ago

That being said, I definitely can see how different users would have different preferences in keeping all segments regardless of how large the cache is getting vs favoring recent segments so perhaps some option that could configure a hard keep_count or a max_cache usage would make sense

no2chem commented 1 year ago

I think the fundamental problem here is that some people have systems with variable bandwidth, and keep_count unnecessarily deletes recordings if there are transient spikes. Transient spikes can happen for all sorts of reasons; bandwidth isn't necessarily fixed in real systems, heat can cause power throttling, traffic spikes in real networks can cause network shares to be backed up, frigate should not silently drop frames because of that.

For me personally, this really sucked because we had an incident this morning and because there were more than 5 elements in the cache, those recordings were dropped, so we literally don't have the footage anymore. I used to have a bunch of rock solid ffmpeg containers that just pulled recordings from the camera to disk, so frigate should at minimum be able to do that.

The power outage example is a really weird edge case. For what you described to become a problem the system would have to be experiencing a transient load spike at the exact time the event was happening, and some sequence of events where the first 5 segments are not relevant but the rest of them are would have to occur. More than likely the first 5 segments are highly relevant.

I think the general expectation should be that frigate shouldn't lose recording clips, and it should be considered a system failure if it did. So actually, now after typing all of that, I think that we should delete lines 113-124 in maintainer.py, and frigate should really just crash if it runs out of cache space. This accomplishes two things:

First, if the system really cannot keep up with the incoming camera bandwidth, frigate fails, (probably with ENOSPACE/-13 in the log), so the user knows they need to adjust camera recording settings or add faster disk, I can't really think of a reason that someone would want random subsets of recordings.

Second, if there is a transient spike frigate will actually try to use as much of the cache as it possibly can to absorb it. If it cannot absorb the spike presumably it will fail and then docker-compose will restart it.

Perhaps a little better than just failing would be to copy all segments if ENOSPACE occurs, so there is no data loss of what is already in cache.

NickM-27 commented 1 year ago

I think that we should delete lines 113-124 in maintainer.py, and frigate should really just crash if it runs out of cache space.

Just deleting the lines would simply mean that the ffmpeg process would fail due to out of space error, Frigate would continue working as is (including moving segments from the cache) and ffmpeg would continue restarting until it was able to sustain itself without running out of space.

Even in the case that Frigate did crash, that would mean the all of the future recordings would be lost until the user realized and fixed / restarted frigate (I don't think it is safe to assume everyone sets their addons or docker to restart on failure since it is not the default option). This is a similar problem to before when frigate would fail if the users host storage was full. This is IMO a regression and frigate should never crash / just stop in these circumstances.

All of that being said, I do understand the usecase where a user would want all recordings to be kept unless their system resources did not allow for it. @blakeblackshear what do you think makes sense for this case?

no2chem commented 1 year ago

Well, You're right, perhaps that is a little drastic. But I think that not losing data by default is how Frigate should work - this is how nearly every other piece of software behaves - Postgres doesn't start dropping inserts if it thinks too many rows are being inserted at once, and I'm pretty sure it just crashes if the disk runs out of space (what can it do?). So I don't think that having Frigate crash is bad in this case (their recordings would stop, but what can Frigate do? this would prompt the user to fix the issue!). So part of this is converting a transient (potentially undetected error) into a fail-stop error that is much easier to debug and diagnose.

Another option would be to just catch ENOSPC and purge the cache in an attempt to try to free space. But I see this as what @blakeblackshear was thinking earlier in the thread, if you are getting ENOSPC you're probably hosed anyway because your disks are too slow, and what you really need to do is reduce your camera bitrate / framerate!

Perhaps a compromise would be to have Frigate purge the cache in case it gets near ENOSPC, and warn the user very prominently that it had to flush the cache due to lack of resources.

I think this is significantly better than setting some sort of limit and having the user figure out how much they need to keep their system stable. And there could be something in the readme about (I got a cache flush error, what do I do?)

So I think we should catch when ffmpeg crashes (presumably, it does that when /tmp/cache runs out of space), flush the entire cache, copying whatever segments were recorded and display a warning somewhere in the frontend and via mqtt that there was a cache flush.

Now - for the concern that we'd drop some recent frames that way - sure, but I don't see how that's worse than dropping old frames - you pick something to drop! And in some ways, it's better because at least you get a contiguous history. And remember, that's only in the really unlikely case you hit ENOSPC in the first place!

blakeblackshear commented 1 year ago

The challenge is that ffmpeg crashes for all sorts of reasons unrelated to ENOSPC. Also, usage of /tmp/cache is a little unpredictable since the clip.mp4 endpoint files are also written there.

Postgres isn't really an appropriate comparison. This part of Frigate is more like a stream processor. Data is streaming in and you need to inspect it a bit before passing it along. There is only so much memory available for caching incoming data, so if you get too far behind, what do you do? You can't simply start rejecting requests like a database because the cameras won't stop sending data and all that data will be lost. If you crash, that fixes nothing and you just start back up with the same problem and all the data sent during the restart is lost.

If the goal is to minimize data loss, I think we already have the right approach. Go as fast as you can and log to the user when it isn't able to keep up, but drop some segments to keep things running. This results in the minimum amount of data lost. The question is really about how to best manage the cache, and it probably makes sense to let the limit scale with the available cache size in hopes that maybe it will catch back up, warning users that they are behind. Users that really want to be sure they don't lose anything can mount a persistent disk at /tmp/cache, but that will slow things down. We don't want to purge this because Frigate will try and recover those segments on startup and the goal is to minimize data loss.

I also used to run a rock solid ffmpeg process for years to just write segments directly to disk, but lots of users don't want their disks spinning constantly, which is why the cache was introduced.

I think there is still a lot to explore to get the recordings maintainer to never be more than 5 segments (50 seconds) behind. Now that it's in a dedicated process, we should be able to give that process priority. I still think there are lots of options to optimize the maintainer too. In theory, as long as the segments can be moved out of the cache as fast as they come in, then it should be possible for it to keep up. We just need to keep the average copy time for a 10s segment to 10s, which should be possible.

no2chem commented 1 year ago

Well, I'm sure we could go on a very long discussion on whats best, but I'll just say that the introduction of this cache has changed Frigate from being rock solid to losing data, which I would classify as a major regression, and the fix for my setup is to just increase the usage of the cache so it can absorb the hit.

If the goal is to minimize data loss, I think we already have the right approach. Go as fast as you can and log to the user when it isn't able to keep up, but drop some segments to keep things running. This results in the minimum amount of data lost.

But it doesn't, without the record "maintainer" my setup would have never have lost any data. And neither would @ccutrer.

We don't want to purge this because Frigate will try and recover those segments on startup and the goal is to minimize data loss.

So, always exhaust all available space in the cache and don't use the maintainer. The whole problem is the maintainer causes dataloss by calling unlink. And the only way to know that is to read the log and know what a very specific log message means.

ccutrer commented 1 year ago

I’m not sure that’s true. How long have you been using Frigate? 0.12, and I believe 0.11 (about when I started) both had the cache. The maintainer (which is new, and actually helped improve the archiving speed) is not what caused me to start dropping segments… I was (unknowingly) on the brink of running out of CPU resources, and a performance regression in the early days of the 0.13 dev branch pushed me over the edge. That has been rectified, and I’ve since upgraded my CPU anyway (I fixed a bug that the stats page didn’t show when frames had been dropped, and I realized I was still dropping frames occasionally anyway). I haven’t had a problem since.

Now, I agree that the five segment limit can be low, especially if I have plenty of available RAM to willingly dedicate to the occasional IO slow down, or CPU spike slowing down archiving.

NickM-27 commented 1 year ago

I think maintainer is being used to describe a lot of different things here.

The maintainer as frigate is concerned is the class that handles the logic of moving segments from cache and storing on the disk (when the retention config specifies it) and also inserting the recordings info to the db.

The maintainer has been around for many releases now (at least since 0.10, when I started contributing).

In 0.13 the recordings cleanup and maintainer were broken out to their own process from the main process as well as other multi threading improvements.

The cache is important for users that don't want to record every single second and only keep motion or object recordings. In this case writing and then deleting directly to disk is wasteful and wears the storage much faster than writing to cache.

I've introduced a PR to make the segment keep count dynamic based on the size of the cache. https://github.com/blakeblackshear/frigate/pull/7265

blakeblackshear commented 1 year ago

You are misunderstanding the sequence of changes:

The cache and code to move segments from cache to disk has been around for many versions. Frigate used to just crash and not recover when the cache filled up.
A limit was implemented to prevent crashing in 0.11. The thinking was that if you were more than a minute behind realtime, then something is really wrong.
In 0.12, a log message was added to warn users that segments were being dropped.
In 0.13, the maintainer was moved to a dedicated process rather than a thread in the main program to avoid sharing resources with other processing on the GIL.
An unrelated change in 0.13 dev builds caused the CPU usage to rise generally to the point that even a dedicated process couldn't keep up for some users.

Clearly, there is still a lot of room for improvement here as it should be able to keep up if copy times are under the segment time of 10s.

Just to clarify, are you running recent 0.13 development builds?

Redsandro commented 1 year ago

Would it help to mount /tmp/cache to a SSD when Frigate currently uses HDDs, and cache it is not mounted separately yet because you don't have enough memory for tmpfs (memdrive)?

How long are segments stored on cache before they are moved to storage? During an event or when the event is finished? Documented tmpfs recommendation is 1GB but that seems pretty large, so I am probably misunderstanding how it works.

NickM-27 commented 1 year ago

Would it help to mount /tmp/cache to a SSD when Frigate currently uses HDDs, and cache it is not mounted separately yet because you don't have enough memory for tmpfs (memdrive)?

That would only slow down the recording segment management, making the problem worse

How long are segments stored on cache before they are moved to storage? During an event or when the event is finished? Documented tmpfs recommendation is 1GB but that seems pretty large, so I am probably misunderstanding how it works.

The segments are moved as soon as possible if they fit the recording retention config. The reason the /tmp/cache is large is because when a user downloads an mp4 clip for an event, it is assembled from segments in /tmp/cache

Redsandro commented 1 year ago

The segments are moved as soon as possible if they fit the recording retention config. The reason the /tmp/cache is large is because when a user downloads an mp4 clip for an event, it is assembled from segments in /tmp/cache

This makes sense! Thank you.

That would only slow down the recording segment management, making the problem worse

Is the opposite true, or is it more nuanced than that? E.g.: If you use the docker image, and docker lib (including volumes) lives on a SSD, but /media/frigate is mounted to a HDD for large storage purposes, does it make sense to explicitly mount /tmp/cache to that same (slower) HDD because Frigate will actually perform faster?

NickM-27 commented 1 year ago

Is the opposite true, or is it more nuanced than that? E.g.: If you use the docker image, and docker lib (including volumes) lives on a SSD, but /media/frigate is mounted to a HDD for large storage purposes, does it make sense to explicitly mount /tmp/cache to that same (slower) HDD because Frigate will actually perform faster?

No it doesn't, because then the same HDD is being used for reading and writing at the same time which will be much slower.

RAM is always recommended because it will always be faster than the SSD and especially HDD, meaning Frigate running segment_time metadata reading, optimizing the segment, etc. will all be done faster on RAM. Also because using RAM means there is no unnecessary wear due to writing segments on the SSD / HDD that are not retained

Redsandro commented 1 year ago

No it doesn't, because then the same HDD is being used for reading and writing at the same time which will be much slower.

But this was the case in my initial question :smile: I think I have phrased my thoughts badly. I'm trying to find out what is the best alternative when memory is too sparse.

Initially I thought mounting the same HDD (same partition) would just move (as in rename) the segments and that would be much faster than copying from SSD to HDD. But I understand Frigate also remuxes and concatenates segments in cache. So it should just be on whichever is fastest, regardless on where media is stored?

When `/media/frigate` is on	`/tmp/cache` should be on
HDD	`tmpfs` preferred, alternatively SSD
SDD	`tmpfs` preferred, alternatively SSD

I could probably get away with a small tmpfs of 96 MB, as long as I don't download any long events. Then Frigate will be able to keep up with recording segments in cache.

NickM-27 commented 1 year ago

Right, it is never a move so the data will always be re-written and that will be slower.

tmpfs should always be setup for /tmp/cache, because you want the benefit of reducing wear and increasing speeds for recording segment management. When you increaes the /tmp/cache it is not pre-allocated, meaning you are setting a limit but if frigate is not currently using that cache it can still be used by other services on the system.

In 0.13 there are quite a few improvements for recording management as well as a new recording exporting feature that the blueprint will hopefully be able to migrate to, and won't have this /tmp/cache issue

Redsandro commented 1 year ago

Thank you. I think a slightly amended documentation may be helpful. Your extra explanation is helpful to me, but to keep the docs from unnecessary verbosity, we can link to tmpfs on wikipedia which explains the same concept. I've created a PR for your convenience, but feel free to alter or reject it.

blakeblackshear / frigate