Optimize threading+latency of ffmpeg configuration

I've been playing around with the new presets (specifically preset-http-reolink, and diffed it versus the configuration I was using on version 0.11.

Resulting detect command lines (preset one first):

ffmpeg -hide_banner -loglevel warning -user_agent FFmpeg Frigate/0.12.0-27a31e7 -avoid_negative_ts make_zero -fflags +genpts+discardcorrupt -flags low_delay -strict experimental -analyzeduration 1000M -probesize 1000M -rw_timeout 5000000 -i http://10.0.0.71/flv?port=1935&app=bcs&stream=channel0_sub.bcs&user=frigate&password=<redacted> -r 7 -s 640x480 -f rawvideo -pix_fmt yuv420p pipe:
ffmpeg -hide_banner -loglevel warning -flags low_delay -threads 1 -fflags nobuffer+genpts -strict experimental -rw_timeout 5000000 -f live_flv -i http://10.0.0.71/flv?port=1935&app=bcs&stream=channel0_sub.bcs&user=frigate&password=<redacted> -r 7 -s 640x480 -avioflags direct -threads 1 -fps_mode vfr -muxdelay 0 -flags low_delay -flush_packets 1 -f rawvideo -pix_fmt yuv420p pipe:

I configured the same camera twice (motion etc elided):

  chimney:
    ffmpeg:
      input_args: preset-http-reolink
      inputs:
        - path: http://10.0.0.71/flv?port=1935&app=bcs&stream=channel0_main.bcs&user=frigate&password={FRIGATE_CAM_PASSWORD}
          roles:
            - record
        - path: http://10.0.0.71/flv?port=1935&app=bcs&stream=channel0_sub.bcs&user=frigate&password={FRIGATE_CAM_PASSWORD}
          roles:
            - detect

  chimney2:
    ffmpeg:
      inputs:
        - path: http://10.0.0.71/flv?port=1935&app=bcs&stream=channel0_main.bcs&user=frigate&password={FRIGATE_CAM_PASSWORD}
          roles:
            - record 
          input_args:
            - -flags
            - low_delay+genpts
            - -analyzeduration
            - "1"
            - -probesize
            - "32"
            - -threads
            - "1"
            - -strict
            - experimental
            - -rw_timeout
            - '5000000'
            - -f
            - live_flv
        - path: http://10.0.0.71/flv?port=1935&app=bcs&stream=channel0_sub.bcs&user=frigate&password={FRIGATE_CAM_PASSWORD}
          roles:
            - detect
          input_args:
            - -flags
            - low_delay
            - -threads
            - "1"
            - -fflags
            - nobuffer+genpts
            - -strict
            - experimental
            - -rw_timeout
            - '5000000'
            - -f
            - live_flv
      output_args:
        detect: -avioflags direct -threads 1 -fps_mode vfr -muxdelay 0 -flags low_delay -flush_packets 1 -f rawvideo -pix_fmt yuv420p

In both the birds-eye-view and when detecting events, chimney2 is quicker. It also appears to use slightly less CPU. Here's the timestamp difference for the detected event:

chimney-1676170460.172301-xkz9st.jpg
chimney2-1676170460.02309-bhv0co.jpg

So approx. 149ms improvement which looks very close to one frame at 7 fps. That implies that 5 fps substreams might see a larger improvement.

Threads configuration

ffmpeg will create thread pools for parallel processing that are not actually beneficial for Frigate's use case. I think it's very unlikely that frigate users will have both fewer cameras than cores and sufficient CPU load from decoding detect streams that any one stream more than saturates a single core. Inspecting /proc/ shows that chimney's ffmpeg detect process has 13 threads, and chimney2's has 1.

vsync/fps_mode

By default, ffmpeg will choose between CFR (dupe/drop) and VFR (drop or passthrough). The -r command to set framerate will either duplicate or drop frames to keep the rate constant; but there is no value in ever duplicating a frame on the detect stream. vfr mode will drop extra frames or pass through but not duplicate. Would be worth looking into -fpsmax rather than -r for reducing framerate of detect streams - since many cameras offer low-fps substreams anyway.

Buffering/delay options

I just went through the docs and picked literally everything that promised to reduce delay. For detect streams, timestamps don't matter, stuttering doesn't matter, all that matters (to me!) is minimizing delay as much as possible for previews and events. Since we're writing to a pipe, I'm not concerned about any small buffer issues.

These in particular might be worthwhile to add to the birdseye stream.

genpts/live_flv

See discussion on #1467

A few thoughts after looking through this:

If a camera has bad incorrect timestamps / is known to send generally bad data, then it is recommended to try it with go2rtc restream as opposed to connecting to it directly as go2rtc will cleanup the timestamps and provide a more stable stream. This is what I use with my http reolink cams and it has been lower latency and more stable. I'd be curious how that compares.
I tried the threads configuration on my setup and at least in my testing the CPU usage was identical. I can see some cases where multiple threads don't help, but for a user that records and runs detect on the same stream a single thread could be suboptimal so more thought is needed for this.
Based on the ffmpeg documentation I am not sure there is a benefit in not using -r. I don't see user setting the fps higher than their sub stream
as far as buffer and delay goes, we offer a preset for that https://github.com/blakeblackshear/frigate/blob/c74c9ff16161a8539e1cc41b76a5bdea953ce71b/frigate/ffmpeg_presets.py#L280-L290 but in beta 7 a number of users saw issues with those args so it definitely doesn't work as a default config

Thank you for your contributions to this latest version, and I also apologize in advance for talking like I know more about FFmpeg than I actually do.

If a camera has bad incorrect timestamps / is known to send generally bad data, then it is recommended to try it with go2rtc restream as opposed to connecting to it directly as go2rtc will cleanup the timestamps and provide a more stable stream. This is what I use with my http reolink cams and it has been lower latency and more stable. I'd be curious how that compares.

But for the detect stream, timestamps truly don't matter at all - don't we just want to push frames as quickly as we can, dropping frames if they come in too fast? The rawvideo muxer doesn't have any concept of timestamps. Why does passing the stream through go2rtc result in lower latency? As far as stability, I have not had any issues running this same configuration for weeks of uptime (I have to reboot or update for unrelated reasons).

I tried the threads configuration on my setup and at least in my testing the CPU usage was identical. I can see some cases where multiple threads don't help, but for a user that records and runs detect on the same stream a single thread could be suboptimal so more thought is needed for this.

I also did not see significantly different CPU time, since there is no meaningful parallelism to be had it seems like just a bad default for a live decode multiple cam use case. I tested a detect+record setup that produced this commandline:

ffmpeg -hide_banner -loglevel warning -flags low_delay -threads 1 -fflags nobuffer+genpts -strict experimental -rw_timeout 5000000 -f live_flv -i http://10.0.0.71/flv?port=1935&app=bcs&stream=channel0_sub.bcs&user=frigate&password=<redacted> -f segment -segment_time 10 -segment_format mp4 -reset_timestamps 1 -strftime 1 -c copy -an /tmp/cache/chimney2-%Y%m%d%H%M%S.mp4 -r 7 -s 640x480 -avioflags direct -threads 1 -fps_mode vfr -muxdelay 0 -flags low_delay -flush_packets 1 -f rawvideo -pix_fmt yuv420p pipe:

The threads param is a codec param; if the given stream does not support multithreading, it has no effect. The 13-vs-1 on my 12-core CPU indicates that only one cpu-sized pool is created in the default config, for the software x264 decoder. So specifying -threads 1 configures the specific codec used for an input/output, and would not affect I/O parallelism as I understand it. Since it's using the copy codec to write to a shared memory region, there is not likely to be any meaningful parallelism available from threads. If a more expensive record pipeline were used (like transcoding x265 to x264) I believe that the threads option used for the filter graph or encoder would provide whatever parallelism is required.

Based on this (admittedly very old) ffmpeg documentation - when running multiple encoders, all encoders proceed in lockstep; and whether the different encoders run concurrently with each other is determined by the encoder implementation. If we want parallelism between the decode and encode, or between different outputs, we should use a fifo or tee, respectively. But in an environment where we are encoding live, and we already have at least three concurrent processes per camera from ffmpeg, motion detection, and object detection, it's hard for me to imagine why using the full CPU count of threads for every camera process makes sense.

Lastly, there's this bit of the FFmpeg docs:

thread_type flags (decoding/encoding,video) Select which multithreading methods to use.

Use of ‘frame’ will increase decoding delay by one frame per thread, so clients which cannot provide future frames should not use it.

Possible values:

‘slice’ Decode more than one part of a single frame at once.

Multithreading using slices works only when the video was encoded with slices.

‘frame’ Decode more than one frame at once.

Default value is ‘slice+frame’.

Since slice-based threading is not likely to work on cameras, and frame parallelism costs a frame of latency (which is significant at 5 fps) disabling multithreading in codecs is probably the lowest risk config option I've proposed.

Based on the ffmpeg documentation I am not sure there is a benefit in not using -r. I don't see user setting the fps higher than their sub stream

I have occasionally seen my cameras reporting "6.8 fps" while having "7" configured in frigate. My thinking was just that if the default behavior is to automatically choose between two options, and one of them does not make sense for our use case, it's better to be explicit about the behavior we want.

as far as buffer and delay goes, we offer a preset for that but in beta 7 a number of users saw issues with those args so it definitely doesn't work as a default config

Can you link to some of those? I've gone pretty far down the rabbit hole on these reolink timestamp issues and I'm interested to see if I can figure out what went wrong. My experience is that we want all these wild nobuffer and low_delay options on detect streams where timestamps are irrelevant and we're just pushing frames, and we don't want that for clips or recording, because as long as pre_capture and post_capture are larger than the latency difference between streams, the recorded clip will include the detected event, and rewriting timestamps results in weird motion effects where the video slows down and speeds up during clips.

To be clear I don't disagree that lower latency is better and we should make improvements where possible. But this current and previous release (for me anyway) is a reminder that it can work optimally in my setup and then be broken for many other users. There's care that needs to be put into it to ensure continued stability, also if lots of args changed at the same time it could prove difficult to know which one is causing the problem.

Even though rawvideo doesn't care about timestamps or having a buffer doesn't mean ffmpeg will work seamlessly without them in all cases

Many users have had problems with those args because when a stream crashed it would not be able to reconnect to it. A few examples:

https://github.com/blakeblackshear/frigate/issues/5375 https://github.com/blakeblackshear/frigate/issues/5365

There's also more in the beta7 discussion. It was a non-negligible amount of users

I have done some more testing and so far the input_args (threading, nobuffer, etc.) seem to have minimal effects if any at all. However, when applying the detect output args I do see my CPU usage reduced for each ffmpeg process by ~25% (meaning as an example it went from ~5% to ~3.8% of one core) and I do also notice a ~100 ms improvement in latency.

I have been thinking in the future that we may be better off running separate ffmpeg processes for separate roles rather than multiple outputs from the same input. This way we could have something a little more aggressive for detect without worrying that it may impact the ffmpeg process that writes the recordings. This is all great research and the kind of thing it takes a lot of time to gather from the ffmpeg docs. I think there are improvements to be made in the future, but I definitely don't want to introduce most of these changes so late in the release cycle.

I have done some more testing and so far the input_args (threading, nobuffer, etc.) seem to have minimal effects if any at all. However, when applying the detect output args I do see my CPU usage reduced for each ffmpeg process by ~25% (meaning as an example it went from ~5% to ~3.8% of one core) and I do also notice a ~100 ms improvement in latency.

I will say, after trying to run these for a bit, they have been much less stable with my cameras

blakeblackshear / frigate