Closed ccutrer closed 1 year ago
Curious what @blakeblackshear thinks but my understanding is that, while the storage times are reasonably fast, they are not fast enough for 33 cameras and there are some cameras which end up with too many segments sitting idle in memory.
but it sure seems like either Frigate is being too impatient with a large number of cameras
If this is the case of an issue for too many cameras for the storage to handle, I would disagree with "being too impatient". With more than 5 segments for some cameras sitting in cache, restarting frigate / the host would mean at least a minute of the most recent footage was lost. This of course also covers cases like power loss, someone purposefully unplugging the computer, etc.
If frigate does not limit to recent segments then the list could end up getting increasingly long causing the potential for much worse footage loss.
As far as ways to improve the scenario, I would suggest that at least some cameras could certainly be moved to a retain mode of motion
instead of all
since the only segments not kept would be ones where nothing happened (no motion). This would reduce the number of segments needing to be moved for that camera and reducing pressure on other cameras.
You make good points about being more "patient" actually causing problems.
One thing I noticed in the log is that copies seem to happen in alphabetic order, meaning all 5 segments for cameras A through Y are copied before the oldest segment of camera Z is even attempted. I think what I'm saying about "impatient" is perhaps this behavior is causing cameras near the end of the list to be more likely to lose segments. Though as I think about this, I think you're right that assuming the disk transfer rate can keep up, no matter what segments will be lost, and it's just a matter of which cameras lose segments. And I can't think of a reason why it should prefer to drop "oldest" segments from all cameras vs. more segments from certain cameras.
while the storage times are reasonably fast, they are not fast enough for 33 cameras
With 33 cameras, and thus 33 segments every 10s, that means I need to copy on average 3.3 segments per second, or 0.3s per segment. My average copy time of 0.17s definitely stays under that, but my max of 0.84s does not, so it's definitely in the realm of possibility that when it's slow, it simply can't keep up. This seems to be an argument for allowing one to configure how many cached segments are allowed, allowing one to knowingly choose to allow a possible longer latency of getting segments to permanent storage at the cost of needing additional RAM and the other possible downsides of how much could be lost when unexpectedly interrupted.
Additionally, I stopped Frigate, and ran a disk speed check:
$ dd if=/dev/zero of=frigate/storage/recordings/test1.img bs=1G count=20 oflag=dsync
20+0 records in
20+0 records out
21474836480 bytes (21 GB, 20 GiB) copied, 227.014 s, 94.6 MB/s
Definitely seems like the raw disk throughput is more than enough to handle ~30MB/s of recording data, and if Frigate is unable to achieve somewhere near that, a ~3x overhead seems a bit extreme.
You make good points about being more "patient" actually causing problems.
One thing I noticed in the log is that copies seem to happen in alphabetic order, meaning all 5 segments for cameras A through Y are copied before the oldest segment of camera Z is even attempted. I think what I'm saying about "impatient" is perhaps this behavior is causing cameras near the end of the list to be more likely to lose segments. Though as I think about this, I think you're right that assuming the disk transfer rate can keep up, no matter what segments will be lost, and it's just a matter of which cameras lose segments. And I can't think of a reason why it should prefer to drop "oldest" segments from all cameras vs. more segments from certain cameras.
To be clear the limit of 5 is per camera. Frigate also runs through ALL cameras before it loops back around to move segments from cameras that it already looked at. The list of recordings is also not changed while frigate is moving existing segments so I am not sure this is the case.
while the storage times are reasonably fast, they are not fast enough for 33 cameras
With 33 cameras, and thus 33 segments every 10s, that means I need to copy on average 3.3 segments per second, or 0.3s per segment. My average copy time of 0.17s definitely stays under that, but my max of 0.84s does not, so it's definitely in the realm of possibility that when it's slow, it simply can't keep up. This seems to be an argument for allowing one to configure how many cached segments are allowed, allowing one to knowingly choose to allow a possible longer latency of getting segments to permanent storage at the cost of needing additional RAM and the other possible downsides of how much could be lost when unexpectedly interrupted.
Additionally, I stopped Frigate, and ran a disk speed check:
$ dd if=/dev/zero of=frigate/storage/recordings/test1.img bs=1G count=1 oflag=dsync 1+0 records in 1+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 5.34069 s, 201 MB/s
Definitely seems like the raw disk throughput is more than enough to handle ~30MB/s of recording data, and if Frigate is unable to achieve somewhere near that, a 6-7x overhead seems a bit extreme.
That's talking about sequential write of a single file though which is very different from random write of small files. Same reason why transferring a bunch of jpgs is slower than transferring one large mp4 of the same total size.
https://superuser.com/a/1611851
The option for customizing the number of segments kept in cache has been discussed and may come in the future, but it likely won't be a part of 0.12 https://github.com/blakeblackshear/frigate/pull/5419#issuecomment-1421665760
That's talking about sequential write of a single file though which is very different from random write of small files. Same reason why transferring a bunch of jpgs is slower than transferring one large mp4 of the same size.
Indeed, there are differences. But I definitely would not characterize Frigate as "random writes of small files". Frigate writes are neither random (which when talking about disk I/O usually refers to having to do lots of reads at the same time as writes, or reading from files physically located all over a the platter. Frigate is 99% percent writes, without really modifying anything else, which would translate out to a really long contiguous write of sectors on disk), nor small (my segments average almost exactly 5MB. I used to work on a petabyte-scale distributed storage system. We were definitely aware of throughput problems with small files, but didn't consider a file as small unless it was 50KB or less).
Anyhow, I ran a test of copying an 1.7GB worth of actual segments. I would expect this to be slower than Frigate could accomplish, because I'm reading from the same volume I'm writing too (so both read time, and seek time). It moved 360 files (1.7GB) in 22.5s at ~75MB/s (0.06s per file). Again pointing to Frigate having some hidden unnecessary overhead.
Anyhow, I ran a test of copying an 1.7GB worth of actual segments. I would expect this to be slower than Frigate could accomplish, because I'm reading from the same volume I'm writing too (so both read time, and seek time). It moved 360 files (1.7GB) in 22.5s at ~75MB/s (0.06s per file). Again pointing to Frigate having some hidden unnecessary overhead.
Calling it unnecessary seems a bit presumptuous. One thing that I remember after looking in the code is that it is not a direct copy, when a segment is to be moved from cache to storage frigate uses ffmpeg to set the moov atom to the beginning of the mp4 file (it is typically at the end) to aid in faster metadata reading by nginx.
😂 you're right, that was a bit presumptuous. I correct to "...overhead, possibly unnecessary..."). Running it through ffmpeg seems like a pretty useful step to take.
Now to test how much that adds:
time for f in mud/*; do ffmpeg -y -i $f -c copy -movflags +faststart test/$(basename $f); done
Took 47s, 36MB/s, or 0.13s each. That accounts for the majority of the extra time. While doing that, neither I/O nor CPU was showing any stress at all. Now I'm worried that no matter how much faster I make my disk (by adding more disks in a proper RAID-0), the time will usually be dominated by ffmpeg processing the segments :(.
Potentially, like I said setting a few cameras to motion retain mode should help. another thing that would help is having an SSD as a write cache.
Perhaps in the future frigate could use asyncio to run this logic in multiple threads.
I'm trying out motion retain mode for all cameras. This is a big leap of trust in Frigate to not miss anything!
another thing that would help is having an SSD as a write cache.
Under the theory that it would speed up the synchronous writing flow of Frigate, which is the critical process, and because it's not actually stymied by the backend disk transfer rate, would have no troubles keeping up inflow-to-outflow?
Perhaps in the future frigate could use asyncio to run this logic in multiple threads.
Oooh yeah, good idea.
for f in mud/*; do (ffmpeg -y -i $f -c copy -movflags +faststart test/$(basename $f); done
gets me down to ~7s, or ~244MB/s (0.2s per file)! Seems like a massive win! (again, you're correct that these results won't be 100% predictive of if frigate were doing a similar thing internally). Frigate would likely limit to one thread per CPU core, and maybe not even bother if there are less than 10 segments to archive. And if it's really the disk slowing down significantly occasionally, it will just make that problem more obvious, rather than actually fix it.
It's not just the recordings related activities. The thread that manages recordings is one of many threads. It shares a single CPU with lots of other parts of Frigate's processing. Many parts of Frigate are in dedicated processes, but this isn't one of them. My guess is that there just isn't enough time to go around at some point when there is a lot of other things happening with lots of cameras.
We are already talking about some architecture changes that would eliminate this contention problem.
It doesn't look like you have configured a db location. You may want to move your sqlite db to a faster drive if possible. https://deploy-preview-4055--frigate-docs.netlify.app/frigate/installation#storage
The thread that manages recordings is one of many threads. It shares a single CPU with lots of other parts of Frigate's processing.
But different threads can execute on different CPUs, no? Or is this because of the infamous Python GIL? (I'm not much of a Python guy, so I just know that a GIL exists, not the intimate details of what is constrained by it and what's not).
We are already talking about some architecture changes that would eliminate this contention problem.
👍
It doesn't look like you have configured a db location. You may want to move your sqlite db to a faster drive if possible.
Only my recordings
directory is mounted from my big-slow-RAID-array. The rest of the storage directory is on my root volume, which is an SSD. I shouldn't need to configure anything to change that, no?
But different threads can execute on different CPUs, no?
All Python threads run on the same CPU. It's the infamous GIL.
The rest of the storage directory is on my root volume, which is an SSD. I shouldn't need to configure anything to change that, no?
Are you actually running the Addon in HassOS? If not, can you provide your compose file? The db is stored at /media/frigate/ by default, so yea it needs to be changed if you want it somewhere else.
Docker Compose on Ubuntu host.
docker-compose.yml
(located at /home/cody/docker/docker-compose.yml
on the host):
version: "3.9"
services:
frigate:
container_name: frigate
privileged: true
restart: unless-stopped
image: ghcr.io/blakeblackshear/frigate:0.12.0-beta8
shm_size: "512mb"
devices:
- /dev/bus/usb:/dev/bus/usb
- /dev/dri/renderD128
volumes:
- /etc/localtime:/etc/localtime:ro
- ./frigate/config.yml:/config/config.yml
- ./frigate/storage:/media/frigate
- type: tmpfs
target: /tmp/cache
tmpfs:
size: 2G
ports:
- 5000:5000
- 8554:8554 # RTSP feeds
- 8555:8555/tcp # WebRTC over tcp
- 8555:8555/udp # WebRTC over udp
/etc/fstab
:
# <file system> <mount point> <type> <options> <dump> <pass>
/dev/disk/by-id/dm-uuid-LVM-SGiok68jVsOcFj99cQNc1yf5Z7X7244l6OZdIgwvyQTBFdRFCvP3CBCB56J8B6y1 / ext4 defaults 0 1
/dev/disk/by-uuid/726a8f95-8dbf-4f7c-9be9-9d37429f5eea /boot ext4 defaults 0 1
/dev/disk/by-uuid/117B-B46D /boot/efi vfat defaults 0 1
/swap.img none swap sw 0 0
/dev/mapper/nvr-recordings /home/cody/docker/frigate/storage/recordings ext4 defaults 0 0
(nvr-recordings is my LVM volume composed of 6x WD Purple drives of varying sizes in a RAID-0 configuration, but due to their non-uniform size the volume was built two at a time)
I would recommend trying to put your database somewhere other than nvr-recordings to see if that helps. You should be able to follow the examples in the docs I linked to change the location. You stop the frigate container, move the existing .db
file to the new location and then start back up.
??
My database is not on the nvr-recordings volume. Unless there's something about docker I don't understand. nvr-recordings only maps to storage/recordings
inside docker; just storage
is on the root volume.
You are right. I misread your fstab.
Just to prove it to myself even more:
from outside the container:
cody@alabama:~/docker/frigate$ df -h storage/frigate.db
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv 226G 112G 103G 53% /
cody@alabama:~/docker/frigate$ df -h storage/recordings/
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/nvr-recordings 26T 17T 7.3T 70% /home/cody/docker/frigate/storage/recordings
from inside the container:
root@f7f0f02be3d9:/media/frigate# df -h frigate.db
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv 226G 112G 103G 53% /media/frigate
root@f7f0f02be3d9:/media/frigate# df -h recordings/
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/nvr-recordings 26T 17T 7.3T 70% /media/frigate/recordings
I think you are just bumping up against the limits of what's possible within the current architecture at those copy speeds. All my copy times are <0.09s. There are enough other things to do with >30 cameras in the shared processes that it just can't get it done.
Here are the workarounds I can think of until we get to re-architect some things:
motion
on all/some of the cameras and make sure your motion settings are sensitive enough to not miss any footage you care about, but still reduce the number of segments that need to be copied. I do this and it's the best way to be able to scrub a days worth of footage quickly.I wish we could just have an option to increase the keep count for recording segments I think I, like others using a "slower" NAS could avoid having to purchase a whole new pc to replace my Raspberry Pi.
I don't get it, my Frigate dedicated NAS is plenty fast in transferring files at around 100MBps whenever I test it, but Frigate chokes up like mad and Frigate starts discarding a load of recordings at various random times many times a day so I end up missing vital clips, the most important feature of CCTV... This should never happen in my opinion.
CPU load is quite low, never above 60% on the pi4 cores. Nothing else uses the NAS and the only thing running on the Pi4 is home assistant which hardly uses any resources and is the most basic setup.
I don't want to write to a SSD for recordings and cause high wear.
I guess I'll have to either pay up for a new mini computer or I might have a go at building my own Frigate version with this small one line change which will probably take me many hours (read days) as I'm useless at these things.
What's interesting is this issue seems to be getting worse every week which makes me suspect it's something to do with the database files (database is stored on local SSD). I might have a go at clearing that and starting fresh.
Only recordings is mounted on the NAS.
No one ever said an option in the config wouldn't happen, but we're talking about ways to actually fix the issue as opposed to a bandaid that makes it take longer to appear. Especially in OPs case where the storage is fast and it's just CPU time isn't enough along with the other work to keep up with all the segments.
Can one of you run a custom build with the limit increased to 20? I'm not convinced that will even help. Over time, you will most likely just hit the new limit anyway.
Can one of you run a custom build with the limit increased to 20? I'm not convinced that will even help. Over time, you will most likely just hit the new limit anyway.
I'm going to try and do it tonight / over the next few days as it'll probably take me longer 😭
Due to the gap between them I believe it'll recover but let's test.
I have created a build at crzynik/frigate:rec-seg-20
which sets the segment count to 20 and also adds a log to put how many segments are currently stored.
I've changed to motion
recording for all cameras for about a week now, and have only seen this issue with a single camera. That camera is my only camera on WiFi, so I just chalk up any issues unique to that camera as poor connectivity. I haven't had any issues (yet) with not having recording for something I wish I should have. 🤞. I'm just a personal user, so the need to have recordings proving lack of activity is only a minor factor for me.
I have created a build at
crzynik/frigate:rec-seg-20
which sets the segment count to 20 and also adds a log to put how many segments are currently stored.
Hi Nick,
I just wanted to say thanks so much for being so kind and creating that build for me, you sure saved me many hours. The additional logging was really useful too.
I made detailed notes on the issue but unfortunately like a fool I didn't save them before my PC hard crashed due to a nvidia driver issue :(, so I have a few less details and logs but I will try my best to explain it.
The segment drop issue was on average happening for 15-25 minutes of elevated segments at completely random times, 4-6 times a day. The build that you kindly created was better, but still during these times I think I recall the count going up to 50-60 at one point!! Other times around 30-40.
I checked the full syslogs on my Raspberry Pi and NAS and any other logs, any cron jobs that I could find. There was absolutely nothing apart from Frigate segment drop messages. CPU usage never above 50% on either device. RAM usage was high on the NAS but kind of expected as it only has 256MB RAM and is probably caching a lot. Nothing really obvious using the RAM, SMB wasn't using much.
Anyway, I thought I'd change the NAS to use NFS rather than SMB/CIFS/SAMBA. The difference is night and day. Exactly the same setup otherwise. Netgear decided to make the share path quite different when using NFS so that caught me out for a little bit... PS for anyone reading, use showmount -e to check what NFS path is being advertised, it's not always what you set. nfs-common package will need to be installed to be used as NFS client and to run showmount.
However, there were still about 3 times it went to a high of 6-8 segments for a few minutes since I performed this fix ~20 hours ago. The duration is clearly lower and it went no higher than a max of 8 segments at one time, a big improvement. Best of all, no recording losses for the first time in ages!!!!
Perhaps the issue was with the oplocks of SMB/SAMBA, although I believe this is requested by the client, which would be either Frigate, Docker or Portainer, I am not quite sure what the client would be considered in this case. The version of NFS I use doesn't support oplocks.
Hopefully this helps at least one other person.
Do you think changing this max segment count to 10 or so would be reasonable for all or would it have a large negative affect for some users with many cameras?
It looks like I can avoid purchasing new hardware for now. I was looking at 3.5" USB enclosures but none of them here are designed for 24/7 use, a new NAS would be £100s and the only alternative would be to find some kind of device with a SATA port that is fairly efficient with some HDD cooling unless I hacked something together with a fan and my 3D printer :).
Are you guys using 2.5" or 3.5" HDDs for recordings?
Thanks again
Thanks for the update, that is good to know for sure. I believe, as Blake said, the acceptable solution would be in making the amount of max segments configurable, not a hard coded increase. At the end of the day that will always be a bandaid fix compared to the arch changes that will come in the future.
I use 3.5" HDDs along with an SSD write cache pool
I've started to run into this as well as I increased from 30 to 35 cameras, even with it configured for motion recording only. The issue didn't happen when I was at 30, but 35 makes it happen on all my cameras when there is a lot of motion activity at once (usually the issue doesn't appear until 20 cameras have motion at the same time). Increasing the segment count from 5 to 20 made the issue go away for the first 5 or 6 hours, but after 24 hours of it running, its back with the segment issues, along with my /tmp/cache is filling up again (putting it back to 5 stopped that)
My storage is SATA, 2x of Samsung 870 EVO 2TB in a ZFS stripped RAID setup. Would I get performance improvements switching to ext4 and putting half the cameras storing onto one drive, and half storing onto the other? Would that cause issues with the logic for auto expiration? I was thinking about adding another drive or 2 since I'm adding cameras, is that a better choice to add to the existing ZFS array?
Was this fix put into version 0.12.1-367d724? I ask because we are seeing this issue with only four cameras, and the recordings are being saved to a standard hard drive that is part of the computer that runs Frigate, not some network location.
Was this fix put into version 0.12.1-367d724? I ask because we are seeing this issue with only four cameras, and the recordings are being saved to a standard hard drive that is part of the computer that runs Frigate, not some network location.
It is not mentioned in the release notes of 0.12.1 so it was not included
Resuming the conversation from #6458
My setup has 14 cameras and I was seeing this issue, which seems to have gone away after change keep_count to 20 for now. The cameras are VBR (with a pretty large band from max 10240kbps / target 5120kbps), and storing to NFS storage on a single HDD, so the bandwidth is extremely variable. Looking at the logs, sometimes it takes 0.2s to copy a segment, sometimes it takes 1s. There are other workloads on the device as well, so that causes the copy rate to be a bit variable as well.
While I'm happy to submit a PR to make keep_count configurable, I'm not sure this is the best approach. We should only start deleting segments if it looks like we're really about to run out of cache. Perhaps check the space usage of all the recording segments and triggering some sort of cleanup would be the best option.
We should only start deleting segments if it looks like we're really about to run out of cache. Perhaps check the space usage of all the recording segments and triggering some sort of cleanup would be the best option.
I am not sure I agree with that necessarily, at least not in all cases. For example, if there is a sudden power outage or frigate is restarted, all of the segments in the cache are just gone. It also means that when an event happens, it is possible that there could be a considerable delay before those clips are available.
Imagine a hypothetical case where a user's house was getting broken into and for whatever reason the system has a large queue of segments, if frigate is still working on moving old segments to storage then the new segments that contain the footage of the house being broken into could hypothetically have never been copied by the time the computer was unplugged. Obviously very hypothetical but something to consider.
At the same time though, I can see that the preference for new segments over old segments isn't always going to be the right choice either.
That being said, I definitely can see how different users would have different preferences in keeping all segments regardless of how large the cache is getting vs favoring recent segments so perhaps some option that could configure a hard keep_count or a max_cache usage would make sense
I think the fundamental problem here is that some people have systems with variable bandwidth, and keep_count unnecessarily deletes recordings if there are transient spikes. Transient spikes can happen for all sorts of reasons; bandwidth isn't necessarily fixed in real systems, heat can cause power throttling, traffic spikes in real networks can cause network shares to be backed up, frigate should not silently drop frames because of that.
For me personally, this really sucked because we had an incident this morning and because there were more than 5 elements in the cache, those recordings were dropped, so we literally don't have the footage anymore. I used to have a bunch of rock solid ffmpeg containers that just pulled recordings from the camera to disk, so frigate should at minimum be able to do that.
The power outage example is a really weird edge case. For what you described to become a problem the system would have to be experiencing a transient load spike at the exact time the event was happening, and some sequence of events where the first 5 segments are not relevant but the rest of them are would have to occur. More than likely the first 5 segments are highly relevant.
I think the general expectation should be that frigate shouldn't lose recording clips, and it should be considered a system failure if it did. So actually, now after typing all of that, I think that we should delete lines 113-124 in maintainer.py, and frigate should really just crash if it runs out of cache space. This accomplishes two things:
First, if the system really cannot keep up with the incoming camera bandwidth, frigate fails, (probably with ENOSPACE/-13 in the log), so the user knows they need to adjust camera recording settings or add faster disk, I can't really think of a reason that someone would want random subsets of recordings.
Second, if there is a transient spike frigate will actually try to use as much of the cache as it possibly can to absorb it. If it cannot absorb the spike presumably it will fail and then docker-compose will restart it.
Perhaps a little better than just failing would be to copy all segments if ENOSPACE occurs, so there is no data loss of what is already in cache.
I think that we should delete lines 113-124 in maintainer.py, and frigate should really just crash if it runs out of cache space.
Just deleting the lines would simply mean that the ffmpeg process would fail due to out of space error, Frigate would continue working as is (including moving segments from the cache) and ffmpeg would continue restarting until it was able to sustain itself without running out of space.
Even in the case that Frigate did crash, that would mean the all of the future recordings would be lost until the user realized and fixed / restarted frigate (I don't think it is safe to assume everyone sets their addons or docker to restart on failure since it is not the default option). This is a similar problem to before when frigate would fail if the users host storage was full. This is IMO a regression and frigate should never crash / just stop in these circumstances.
All of that being said, I do understand the usecase where a user would want all recordings to be kept unless their system resources did not allow for it. @blakeblackshear what do you think makes sense for this case?
Well, You're right, perhaps that is a little drastic. But I think that not losing data by default is how Frigate should work - this is how nearly every other piece of software behaves - Postgres doesn't start dropping inserts if it thinks too many rows are being inserted at once, and I'm pretty sure it just crashes if the disk runs out of space (what can it do?). So I don't think that having Frigate crash is bad in this case (their recordings would stop, but what can Frigate do? this would prompt the user to fix the issue!). So part of this is converting a transient (potentially undetected error) into a fail-stop error that is much easier to debug and diagnose.
Another option would be to just catch ENOSPC and purge the cache in an attempt to try to free space. But I see this as what @blakeblackshear was thinking earlier in the thread, if you are getting ENOSPC you're probably hosed anyway because your disks are too slow, and what you really need to do is reduce your camera bitrate / framerate!
Perhaps a compromise would be to have Frigate purge the cache in case it gets near ENOSPC, and warn the user very prominently that it had to flush the cache due to lack of resources.
I think this is significantly better than setting some sort of limit and having the user figure out how much they need to keep their system stable. And there could be something in the readme about (I got a cache flush error, what do I do?)
So I think we should catch when ffmpeg crashes (presumably, it does that when /tmp/cache runs out of space), flush the entire cache, copying whatever segments were recorded and display a warning somewhere in the frontend and via mqtt that there was a cache flush.
Now - for the concern that we'd drop some recent frames that way - sure, but I don't see how that's worse than dropping old frames - you pick something to drop! And in some ways, it's better because at least you get a contiguous history. And remember, that's only in the really unlikely case you hit ENOSPC in the first place!
The challenge is that ffmpeg crashes for all sorts of reasons unrelated to ENOSPC. Also, usage of /tmp/cache
is a little unpredictable since the clip.mp4
endpoint files are also written there.
Postgres isn't really an appropriate comparison. This part of Frigate is more like a stream processor. Data is streaming in and you need to inspect it a bit before passing it along. There is only so much memory available for caching incoming data, so if you get too far behind, what do you do? You can't simply start rejecting requests like a database because the cameras won't stop sending data and all that data will be lost. If you crash, that fixes nothing and you just start back up with the same problem and all the data sent during the restart is lost.
If the goal is to minimize data loss, I think we already have the right approach. Go as fast as you can and log to the user when it isn't able to keep up, but drop some segments to keep things running. This results in the minimum amount of data lost. The question is really about how to best manage the cache, and it probably makes sense to let the limit scale with the available cache size in hopes that maybe it will catch back up, warning users that they are behind. Users that really want to be sure they don't lose anything can mount a persistent disk at /tmp/cache
, but that will slow things down. We don't want to purge this because Frigate will try and recover those segments on startup and the goal is to minimize data loss.
I also used to run a rock solid ffmpeg process for years to just write segments directly to disk, but lots of users don't want their disks spinning constantly, which is why the cache was introduced.
I think there is still a lot to explore to get the recordings maintainer to never be more than 5 segments (50 seconds) behind. Now that it's in a dedicated process, we should be able to give that process priority. I still think there are lots of options to optimize the maintainer too. In theory, as long as the segments can be moved out of the cache as fast as they come in, then it should be possible for it to keep up. We just need to keep the average copy time for a 10s segment to 10s, which should be possible.
Well, I'm sure we could go on a very long discussion on whats best, but I'll just say that the introduction of this cache has changed Frigate from being rock solid to losing data, which I would classify as a major regression, and the fix for my setup is to just increase the usage of the cache so it can absorb the hit.
If the goal is to minimize data loss, I think we already have the right approach. Go as fast as you can and log to the user when it isn't able to keep up, but drop some segments to keep things running. This results in the minimum amount of data lost.
But it doesn't, without the record "maintainer" my setup would have never have lost any data. And neither would @ccutrer.
We don't want to purge this because Frigate will try and recover those segments on startup and the goal is to minimize data loss.
So, always exhaust all available space in the cache and don't use the maintainer. The whole problem is the maintainer causes dataloss by calling unlink. And the only way to know that is to read the log and know what a very specific log message means.
I’m not sure that’s true. How long have you been using Frigate? 0.12, and I believe 0.11 (about when I started) both had the cache. The maintainer (which is new, and actually helped improve the archiving speed) is not what caused me to start dropping segments… I was (unknowingly) on the brink of running out of CPU resources, and a performance regression in the early days of the 0.13 dev branch pushed me over the edge. That has been rectified, and I’ve since upgraded my CPU anyway (I fixed a bug that the stats page didn’t show when frames had been dropped, and I realized I was still dropping frames occasionally anyway). I haven’t had a problem since.
Now, I agree that the five segment limit can be low, especially if I have plenty of available RAM to willingly dedicate to the occasional IO slow down, or CPU spike slowing down archiving.
I think maintainer
is being used to describe a lot of different things here.
The maintainer as frigate is concerned is the class that handles the logic of moving segments from cache and storing on the disk (when the retention config specifies it) and also inserting the recordings info to the db.
The maintainer has been around for many releases now (at least since 0.10, when I started contributing).
In 0.13 the recordings cleanup and maintainer were broken out to their own process from the main process as well as other multi threading improvements.
The cache is important for users that don't want to record every single second and only keep motion or object recordings. In this case writing and then deleting directly to disk is wasteful and wears the storage much faster than writing to cache.
I've introduced a PR to make the segment keep count dynamic based on the size of the cache. https://github.com/blakeblackshear/frigate/pull/7265
You are misunderstanding the sequence of changes:
Clearly, there is still a lot of room for improvement here as it should be able to keep up if copy times are under the segment time of 10s.
Just to clarify, are you running recent 0.13 development builds?
Would it help to mount /tmp/cache
to a SSD when Frigate currently uses HDDs, and cache
it is not mounted separately yet because you don't have enough memory for tmpfs
(memdrive)?
How long are segments stored on cache before they are moved to storage? During an event or when the event is finished? Documented tmpfs
recommendation is 1GB but that seems pretty large, so I am probably misunderstanding how it works.
Would it help to mount /tmp/cache to a SSD when Frigate currently uses HDDs, and cache it is not mounted separately yet because you don't have enough memory for tmpfs (memdrive)?
That would only slow down the recording segment management, making the problem worse
How long are segments stored on cache before they are moved to storage? During an event or when the event is finished? Documented tmpfs recommendation is 1GB but that seems pretty large, so I am probably misunderstanding how it works.
The segments are moved as soon as possible if they fit the recording retention config. The reason the /tmp/cache is large is because when a user downloads an mp4 clip for an event, it is assembled from segments in /tmp/cache
The segments are moved as soon as possible if they fit the recording retention config. The reason the /tmp/cache is large is because when a user downloads an mp4 clip for an event, it is assembled from segments in /tmp/cache
This makes sense! Thank you.
That would only slow down the recording segment management, making the problem worse
Is the opposite true, or is it more nuanced than that? E.g.: If you use the docker image, and docker lib (including volumes) lives on a SSD, but /media/frigate
is mounted to a HDD for large storage purposes, does it make sense to explicitly mount /tmp/cache
to that same (slower) HDD because Frigate will actually perform faster?
Is the opposite true, or is it more nuanced than that? E.g.: If you use the docker image, and docker lib (including volumes) lives on a SSD, but /media/frigate is mounted to a HDD for large storage purposes, does it make sense to explicitly mount /tmp/cache to that same (slower) HDD because Frigate will actually perform faster?
No it doesn't, because then the same HDD is being used for reading and writing at the same time which will be much slower.
RAM is always recommended because it will always be faster than the SSD and especially HDD, meaning Frigate running segment_time metadata reading, optimizing the segment, etc. will all be done faster on RAM. Also because using RAM means there is no unnecessary wear due to writing segments on the SSD / HDD that are not retained
No it doesn't, because then the same HDD is being used for reading and writing at the same time which will be much slower.
But this was the case in my initial question :smile: I think I have phrased my thoughts badly. I'm trying to find out what is the best alternative when memory is too sparse.
Initially I thought mounting the same HDD (same partition) would just move (as in rename) the segments and that would be much faster than copying from SSD to HDD. But I understand Frigate also remuxes and concatenates segments in cache. So it should just be on whichever is fastest, regardless on where media is stored?
When /media/frigate is on |
/tmp/cache should be on |
---|---|
HDD | tmpfs preferred, alternatively SSD |
SDD | tmpfs preferred, alternatively SSD |
I could probably get away with a small tmpfs
of 96 MB, as long as I don't download any long events. Then Frigate will be able to keep up with recording segments in cache.
Right, it is never a move
so the data will always be re-written and that will be slower.
tmpfs should always be setup for /tmp/cache, because you want the benefit of reducing wear and increasing speeds for recording segment management. When you increaes the /tmp/cache it is not pre-allocated, meaning you are setting a limit but if frigate is not currently using that cache it can still be used by other services on the system.
In 0.13 there are quite a few improvements for recording management as well as a new recording exporting feature that the blueprint will hopefully be able to migrate to, and won't have this /tmp/cache issue
Thank you. I think a slightly amended documentation may be helpful. Your extra explanation is helpful to me, but to keep the docs from unnecessary verbosity, we can link to tmpfs on wikipedia which explains the same concept. I've created a PR for your convenience, but feel free to alter or reject it.
Describe the problem you are having
I get the log message
Unable to keep up with recording segments in cache for <camera>. Keeping the 5 most recent segments out of <x> and discarding the rest...
in my log a LOT. I've enabled debug logging for recordings, but I don't see any noticeable slowdown in file copies (generally 0.2s or less per file) out of the cache. Watchingiotop
the DISK write is mostly in the range of 50-100 K/s, with occasional spurts of 20-50 M/s.iftop
indicates I'm doing ~205 Mbps of incoming bandwidth. My recording volume is a RAID-0 of 6x Western Digital Purple drives (though in an unusual configuration, so it's highly likely only two spindles will be used at any given time). Even if it was a single drive, the sustained throughput is rated at 145MB/s, so I shouldn't be coming anywhere near the actual storage throughput. Snapshots and database all go to the main OS volume, which is a 250 GB SSD. All hard drives are connected via SATA (4 of them via an external eSATA enclosure connected to a PCI Express eSATA card).Analyzing my log file, of the last 14,502 segments that were copied, the slowest one took 0.84s, and on average they take 0.17s.
CPU usage normally sits at 50-60%.
I have 64GB of RAM. Usually about ~7GB is in use, and the rest is in buffers/cache.
It might be that my storage is too slow, but it sure seems like either Frigate is being too impatient with a large number of cameras, or something else is happening outside of the actual copy process that is being too slow.
Version
0.12.0-27A31E7
Frigate config file
Relevant log output
FFprobe output from your camera
Frigate stats
Operating system
HassOS
Install method
HassOS Addon
Coral version
USB
Network connection
Wired
Camera make and model
Mostly Hikvisions of various models. One UniFi doorbell cam. All cameras are >= 4MP
Any other information that may be helpful
No response