bml1g12 / benchmarking_video_reading_python

Comparing speed of different implementations of reading video into numpy arrays
MIT License
43 stars 5 forks source link

I've DOUBLED the FFmpeg Speed! #1

Closed roninpawn closed 2 years ago

roninpawn commented 2 years ago

Say! Remember me?

Once upon a time we exchanged code on a thread about how OpenCV doesn't... ya know, work... for video.

Then, over a year later, I started dipping my toes back into Python video processing / software rendering, and pretty quickly stumbled onto this article: Lightning Fast Video Reading in Python, written by the guy I traded code with back in the day. Which is YOU! So I have a peek into the github repo linked in the article, and sure enough! That's my little FFmpeg script lurking behind the polished graphs of tests complete!

So I immediately file suit for copyright infringement in the 11th circuit court of... No, j/k.

So I immediately reach out to tell him that after revisiting my methodology, (removing head from betwixt buttock) I've just released a new script that is currently ingesting raw 1080p frames, unblocked, at about 220fps on my system; (which is 50+ frames faster than I've gotten reproducing your OpenCV read() baseline test) ...crunching through 720x480 frames at 1734fps; (~400 frames faster) And that I'm currently playing back full frame 1080 video to screen at 76.5fps through PyGame using it.

My new script is here: FFmpeg VideoStream

Given how disgustingly my old solution stacked up against the others, and the chance that this method potentially bests all competitors, (I genuinely don't know) I hope you'll have a look. I haven't pushed it through your complete benchmarking mill myself, and would love to know what figures you see out of it. And maybe there's fodder for another article here, if you're still writing.

Why was it so slow?

The old script I'd written was more task-specific than I'd even recognized when I shared it with you. I was forcing the output format of the video to a BGR 24-bit pixel format that was fast and convenient to interact with once the video's frames were in my hands, but ill-advised for getting it INTO my hands. It was reshaping the bytestream into a numpy array in-line to hand back a ready formatted frame -- as though it shouldn't be up to the user to do what they want to the raw output. And the call I was placing to FFmpeg was invoking the slowest 'seek' method available.

In other words: I wrote it to do what I was doing at the time. And didn't really know what I was doing with FFmpeg while I was doing it! When I saw that the speeds I got were better than the multiprocessed OpenCV method I'd written, (and that the FFmpeg script handed out the correct frame, unlike OpenCV) I was happy and left it there. Then you came along and I happily passed my little script along, to save you a headache I'd already suffered, trying to figure out what the ffmpeg-python library wanted of me, just to hand back some frames.

Why is it faster now?

My FFmpeg VideoStream script now defaults to the YUV420p pixel format. This is the format of virtually ALL video circulating in the modern world. Mp4s, Webm's, DVDs, Blu-Ray discs... These are all using a YUV 4:2:0 pixel format of one variant or another. The reason is that this format packs full color pixel data into a space of just 12-bits per pixel. AKA: 1.5 bytes. AKA: This many 1s and 0s: '010101010101'.

RGB, BGR, and many other YUV formats package that same pixel's worth of data into 24-bits. AKA: 3 bytes. AKA: This many 1s and 0s: '010101010101010101010101'.

It's literally double the data. And while YUV420p technically loses color information by packing the pixels up so tightly, the loss is nothing that 99.9% of the world notices. As evidenced by the mass proliferation of the format across all mediums.

So, by having FFmpeg read in YUV420p data two major gains are achieved. First, odds are that the video being accessed is already stored as YUV420p. So there's no conversion from one pixel space to another to eat clock cycles. Second, our bytestream is literally half the size it was. So moving the raw binary data into Python through a 'stdout' pipe is theoretically twice as fast.

In fact, I've found that even where you require the final frame to be in an RGB / BGR format to be processed, it is FASTER to ingest the raw YUV data and convert the frame in Python using OpenCV's .cvtColor() method. Which is a little shocking! That FFmpeg's multi-core processing engine can't unpack the 12-bit YUV into a 24-bit RGB format fast enough to overcome the simple fact that there is twice as much binary data to push through the pipe, is an unintuitive truth to land upon.

What else?

I've also managed to shoe-horn the 'ss:' and 'to:' input properties into ffmpeg-python's call constructor. This means being able to 'seek' near-instantaneously to any point in the video requested. Where the old method sent FFmpeg sloshing, one frame at a time, through the entire video to find the starting position.

'Showinfo' data is now available. With an optional configuration of showinfo=True, when .open_stream() is called, FFmpeg's per-frame information is liberated. This contains the current frame number, presentation time stamp, byte position in the file, mean channel values, etc. But 'showinfo' comes at a penalty of access-speed that, situation depending, can either be invisibly negligible when the process is blocked by a heavy action like rendering frames to screen, or a significantly meaningful slowdown to raw frame access when there's nothing to block it at all.

Wrap-up

There's a lot of little tweaks in the new code, brief as it is. But the main takeaway here is the access speed. And knowing that YUV420p is the fastest pixel format to ingest gets me thinking about all kinds of ways to do the kind of frame-matching and analysis that I was pulling raw video into Python for before. But now using BLAZINGLY FAST methods that operate directly on unconverted, array-reshaped, YUV frames.

Incidentally, 'reshaping' with Numpy adds ZERO overhead. It's the conversion to other color spaces that eats clock cycles. (10-20% loss depending on the size of frame) And knowing THAT gets me excited to locate a non OpenCV library that crunches YUV to BGR as fast as possible. it's got me looking into the SDL2 library for access to hardware acceleration methods that might render raw YUV 4:2:0 to the screen space without even touching the CPU.

The potential for Python to become a Video-Processing powerhouse by way of C-compiled execution, is wild!

Oh yeah... "Wrap-up," I said. Right!

I'm glad you were able to give FFmpeg a chance in 2020 using my poorly-considered little script. That said, it leaves me a bit guilty for FFmpeg performing so poorly in your tests. I hope you'll take the time to push this new version through the test bed you fashioned here, and share with me the results you get. And if you do test it, make sure to acquire the latest version of FFmpeg and FFprobe. I didn't think to do that until I was completely done testing, tweaking, and developing this first release. When I did, I suddenly got 20-30+ more fps at 1080p and 200-300+ more fps at lower resolutions.

One last time, here's the new script: FFmpeg VideoStream

bml1g12 commented 2 years ago

@roninpawn sounds very exciting and looks! Thanks for sharing your hard work here, with your permission, I'll add it to the benchmarks presented (with credit and links to your repo). It might be a few weeks until I get the time for this though, so apologies about that.

Thanks also for sharing your original FFMPEG script. One thing I noticed is that the type of video and the use case (the amount of CPU usage at time of reading) really has a huge effect on the speed any given reader, so it's tougher than I initially imagined to choose a "best" video reader in Python.

Getting video reading and writing in Python working nicely has been something of a hobby of mine, so excited to see a new methodology!

bml1g12 commented 2 years ago

It might also be worth us investigating numpy converters from YUV (https://gist.github.com/Quasimondo/c3590226c924a06b276d606f4f189639) and whether parallelising the colorspace conversion is much help here.

ill-advised for getting it INTO my hands

Are you aware of any use cases whereby a Python user might want to obtain a YUV format frame from a video file not in RGB/BGR colorspace? (aside from the case of simply viewing the video unmodified, but in that case I guess they won't be loading it into Python at all as they could use a standard viewer). I can imagine one use case, maybe similar to your ongoing one with PyGame, would be if the user wishes to pass the video data into another piece of Python software in a frame-dependent manner.

Your code's showinfo=True sounds very intriguing. I wonder if it can reveal .mkv fragment tags, as I could imagine this could be used as a way of analysing e.g. Kinesis Video Streams with metadata attached, for example, if a video is annotated with events at certain frames, one could use Python to extract those frames without needing to decode them into RGB/BGR.

bml1g12 commented 2 years ago

@roninpawn so I had a go running the code but ran into some issues.

I am using

Python 3.8.5 ffmpeg-python==0.2.0 ffmpeg version 4.4.1-0ubuntu1~20.04.sav0 ffprobe version 4.4.1-0ubuntu1~20.04.sav0

With the following code, using the video from this repo:

    import cv2
    cap = VideoStream("/data/benchmarking_video_reading/assets/video_720x480.mkv")
    cap.config(output_resolution=[720, 480])
    cap.open_stream()
    print(cap.__dict__)
    while True:
        eof, img = cap.read()
        arr = np.frombuffer(img, np.uint8).reshape(int(cap._shape[1]*1.5), cap._shape[0])
        bgr = cv2.cvtColor(arr, cv2.COLOR_YUV2BGR_I420)
        if eof:
            break
        cv2.imshow("img", bgr)
        k = cv2.waitKey(1)
        if ord("q") == k:
            break

This creates a 720x720 output video and displays fine, but the video should be 720x480. I have tried specifying output_resolution parameter and without, but get the same result. How do I produce an output in the same resolution as the input?

A minor (easy for us to fix) bug I spotted was:

            out = [w, h]
            if out != self._shape:

This line of code will always evaluate as True if shape is provided as a tuple rather than a list. e.g. Casting self._shape to a list would fix this.

roninpawn commented 2 years ago

I'm thrilled that you're excited to test and tinker at this!

I'm definitely keen to look into the threading and multiprocessing possibilities on the backend of this thing. I've wondered similarly about the colorspace conversion, as well as the 'showinfo' accesses multiprocess/hyperthreading potential. For 'showinfo,' there are two pipes that need read from: 'stdout' for the frames and 'stderr' for showinfo's text. If that relatively tiny bytestream of 'showinfo' text could be initialized DURING the 'stdout' pull instead of waiting for it to finish, (just with threading -- no multiprocessing even) I wonder if it would get some of those 'showinfo'-active frame losses back.

I'll try, in future not to write a dictionary volume at a time, but while thinking about your suggestions and working out the bugs you found, I've already gone and done it again. Sorry. Deal with me. ;)

Potential Use cases for YUV over RGB / BGR

First, if we can find a library that allows a raw YUV420 bytestream, or raw YUV420 frames to be pushed to hardware decoding for render to screen, that could enable ~200fps playback of 1080p video in Python, while only taxing the CPU for FFmpeg's decoding process.

But programmatically, the reason I started developing python video stuff in the first place was for the comparison of images to detect the similarity of a held image to each frame within video. In my use case I see no reason why the methods I use for comparison would need to be in RGB over YUB. And by staying in this native, half-the-size format I would just be conforming the handful of comparison images to the YUV 4:2:0 standard at the initialization of the process, instead of conforming every single frame in the video to the RGB encodings of the comparison frames.

Jumping off from that general premise, I imagine a LOT of image analysis and even some editing / color / contrast adjustments could easily be done in the YUV frame space. The 'Y' or 'luminance' channel of this format contains a square, easily acquired, and highly detailed grayscale representation of the frame right at the top of the bytestream. A lot of analysis, comparison, and even some basic image manipulations could be done exclusively on this greyscale portion of the frame. Then color-based operations that are only possible / faster / have more library support in a 24-bit pixel space, could be invoked EXCLUSIVELY when they are needed. Allowing the lion's share of access and analysis to be performed on what I think breaks down to half OF HALF the RGB / BGR bytes.

Showinfo data

I first went down the 'showinfo' rabbit hole in the hope that I could get the 'presentation time stamps' of the ORIGINAL frames alongside the output frames FFmpeg delivers. I was thinking about how to accurately measure 'variable frame rate' video sources while receiving 'constant frame rate' video out of FFmpeg. But the data FFmpeg send out when the 'showinfo' filter is in place seems to be exclusively related to its output stream. There's a lot of potentially useful stuff in there, including key-frame indicators, and the mean/average value of each channel in the frame. But the only thing returned in reference to the input seems to be the 'pos' value, as the byte-position in the source file that the frame begins.

But I too see the possibility of building something like frame-tables from the per-frame data without bothering to access the frames. I've messed around with that via the FFprobe side of things. With the right arguments forced into a call to FFprobe (no FFmpeg required), I think what you get back is a bytestream of just text information about each frame in the file.

720 x 720 Bug

Running the code you provided on the 720x480 video in the assets folder gave me the expected results. image

My best guess for a difference in output would be something between Linux and Windows? ...rather than Python versions, anyway. I developed in a Python 3.7 environment with whatever versions of Numpy and OpenCV I had in my last virtual environment. Then I upgraded everything to Python 3.10, and the latest libraries supported before I put together the release. Its very odd that such a simple set of calls would come out different.

Stepping through it in my head... If you can read the images at all, the correct bytes were definitely delivered from the stream. Because if the length of the bytestream was wrong Numpy would error trying to reshape to an array size that isn't mathematically possible. Then OpenCV takes over, and if /it/ didn't have the right shape, it'd probably spit out a frame with bad colors and weird 'permanent' artifacts throughout playback.

The arr object should be 720x720 after the numpy call, and video rendered from /it/ would appear in grayscale and be segmented in the YUV frame format. But the bgr object that OpenCV returns after the .cvtColor() operation should be back to a 720x480 resolution in full color. If it's somehow still 720x720 at the end of the process, something would have to be going wrong inside OpenCV's cvtColor call.

even_test() Bug

I KNEW that the tuple / list thing would back to haunt me. I kept looking at whether crop or output_resolution would have a problem as a tuple (because I tend to code in lists) like I knew there was a hole somewhere in it. But never sorted out the cause behind the the recurring, vague 'something's off there.'

I propose casting both '_crop' and '._shape' to tuples throughout the script. So at .init the objects are built as () instead of []. In .config ._crop = tuple(crop_rect), and ._shape = (crop_rect[...), and ._shape = tuple(output_resolution) And in ._even_test, out = (w, h) for comparison.

...Figure if we're casting, it might as well be to the theoretically faster-access object. And since these are private, protected variables anyway, casting them to read-only tuples feels right. Does that sound reasonable to you?

bml1g12 commented 2 years ago

Regarding the even_test() bug, casting all to tuples makes sense

The arr object should be 720x720 after the numpy call, and video rendered from /it/ would appear in grayscale and be segmented in the YUV frame format. But the bgr object that OpenCV returns after the .cvtColor() operation should be back to a 720x480 resolution in full color. If it's somehow still 720x720 at the end of the process, something would have to be going wrong inside OpenCV's cvtColor call.

I looked into this more carefully and realised I made a mistake so apologies for this, it does indeed create the correct shape - I was just thrown off because I did not understand that cvtColor actually is expected to change the shape from square (greyscale) to Before (720, 720) to (480, 720, 3) after.

Thanks for your background explanations and context, next I'll actually benchmark it

bml1g12 commented 2 years ago

So I've made a draft PR with the class here: https://github.com/bml1g12/benchmarking_video_reading_python/blob/d46c6f4ee4ba96473f2326ee9931bf9168f78dc0/video_reading_benchmarks/benchmarks.py#L541

Does this benchmark seem appropriate to you? As with all the others, it does assume the user wishes to convert to RGB space, so includes the opencv cvt color command.

(I understand that if we remove that conversion and assume the user works in YUV space we'll get a significant speedup, so I could add that as a separate benchmark, but as it is a separate use case I can't quite compare it as like-for-like. )

The results are here, with the new benchmark called "ffmpeg_upgraded_benchmark" official_benchmark_timings_cpulimited_video_720x480.csv official_benchmark_timings_iolimited_video_720x480.csv official_benchmark_timings_unblocked_video_720x480.csv

As per the article, there are three use cases I consider:

CPU limited - CPU work (multiplication) done between reading frames io limited - (sleeping between reading frames) unblocked - no sleeping or work done in between reading frames

On the 720p video: cpulimited - the latest version of your code is still slower than the OpenCV baseline iolimited - the latest version of your code outperforms the OpenCV baseline but is slower than the older version of the code unblocked - both versions are outperformed by the OpenCV baseline

It seems unfortunately that the performance in RGB colorspace is not beating OpenCV.

I appreciate maybe this benchmark is missing the point, and your point was that in YUV space we can get a speedup maybe? if that's the case, I'll move on to adding a benchmark in YUV space and put it as a separate section of the repo.

new script that is currently ingesting raw 1080p frames, unblocked, at about 220fps on my system; (which is 50+ frames faster than I've gotten reproducing your OpenCV read() baseline test)

By the way, have you tried imutils or camgears solutions for your use case? I appreciate you might not need to work in RGB space, but these both seem to come out a lot faster than the baseline opencv reader, so would be worth using them as the baseline for your comparisons if you want to know if ffmpeg is faster than alternatives. Regarding camgears, I believe the latest version of their code base is "camgears_with_queue" so you can just use the official repo's version rather than my modified version (as they incorporated an optimisation from this repo of camgears_benchmark --> camgeats_with_queue_benchmark).

roninpawn commented 2 years ago

I pushed the fix for that _even_test issue. And more importantly, I managed to get your benchmarking app running on my system.

I had to discard the multiproc_test. It locked everything up - CPU went inactive - but never failed outright. And in the final 1080p pass I also disabled the simple ffmpeg_benchmark... Got 7 frames per second on my first pass. Wasn't going to wait it out again. ;)

So what I did was, I switched the numpy & opencv conversion part of the ffmpeg_upgraded test in and out of the if config["show_img"]: bit. I presume that just means that it was completely inactive on the tests where it was nested, but it was convenient enough to copy / paste and then ctrl + z / ctrl + shift + z between runs.

My system is an AMD Ryzen 5 3600 - 6 cores (12 logical), 16gb's ram -- but I don't remember the timings and CPU-Z don't know either. :( You'll also see that some of the tests end at decord. I ran a few tests before finding where I could comment out the multiproc_results, so they wouldn't interrupt the final print.

w/decode = With the numpy + opencv operations. no decode = Without 'em.

My results:

-- 480x270 --
        w/ decode:
            ffmpeg_upgraded_benchmark: time_for_all_frames: = 6.937486840666668 +/- 0.0086 or FPS = 144.14441756316234
            max_possible_fps: time_for_all_frames: = 6.994203522999999 +/- 0.0008 or FPS = 142.97553634399728
            baseline_benchmark: time_for_all_frames: = 7.239374510666664 +/- 0.0628 or FPS = 138.13348080370432
            ffmpeg_benchmark: time_for_all_frames: = 8.955316108333335 +/- 0.0780 or FPS = 111.66551664987614
            pyav_benchmark: time_for_all_frames: = 7.33190845466667 +/- 0.0298 or FPS = 136.39013719047628
            decord_sequential_cpu_benchmark: time_for_all_frames: = 7.1608947740000035 +/- 0.0261 or FPS = 139.6473529580173
            decord_batch_cpu_benchmark: time_for_all_frames: = 7.897736816666661 +/- 0.0126 or FPS = 126.61855202489042
            imutils_benchmark: time_for_all_frames: = 6.643801982333334 +/- 0.0132 or FPS = 150.5162258988332
            camgears_benchmark: time_for_all_frames: = 9.832924599000004 +/- 0.0070 or FPS = 101.69914250147903
            camgears_with_queue_benchmark: time_for_all_frames: = 6.6424368413333355 +/- 0.0231 or FPS = 150.54715970762172
            camgears_with_queue_official_benchmark: time_for_all_frames: = 9.843388265666666 +/- 0.0703 or FPS = 101.59103481551762
        no decode:
            ffmpeg_upgraded_benchmark: time_for_all_frames: = 6.671300379333332 +/- 0.0176 or FPS = 149.89581388028142
            max_possible_fps: time_for_all_frames: = 7.0549601273333336 +/- 0.1400 or FPS = 141.74424546010647
            baseline_benchmark: time_for_all_frames: = 7.102134722000002 +/- 0.0558 or FPS = 140.80273595801265
            ffmpeg_benchmark: time_for_all_frames: = 8.967317983333329 +/- 0.0222 or FPS = 111.51606331554224
            pyav_benchmark: time_for_all_frames: = 7.33880490833333 +/- 0.0193 or FPS = 136.26196805756263
        no decode 2:
            ffmpeg_upgraded_benchmark: time_for_all_frames: = 6.688214703 +/- 0.0086 or FPS = 149.51673120652808
            max_possible_fps: time_for_all_frames: = 6.971790921000001 +/- 0.0100 or FPS = 143.43516771104842
            baseline_benchmark: time_for_all_frames: = 7.155153937333334 +/- 0.0143 or FPS = 139.75939703859
            ffmpeg_benchmark: time_for_all_frames: = 8.963709237666668 +/- 0.0342 or FPS = 111.56095913931148
            pyav_benchmark: time_for_all_frames: = 7.230827660999992 +/- 0.0225 or FPS = 138.29675479524627
            decord_sequential_cpu_benchmark: time_for_all_frames: = 7.171205605000002 +/- 0.0146 or FPS = 139.44656660001033
            decord_batch_cpu_benchmark: time_for_all_frames: = 7.8344372213333315 +/- 0.0177 or FPS = 127.64158697665988
            imutils_benchmark: time_for_all_frames: = 6.728909493666663 +/- 0.0333 or FPS = 148.612490767072
            camgears_benchmark: time_for_all_frames: = 9.817301291333337 +/- 0.0392 or FPS = 101.8609870803084
            camgears_with_queue_benchmark: time_for_all_frames: = 6.6782355253333305 +/- 0.0203 or FPS = 149.74015160240802
            camgears_with_queue_official_benchmark: time_for_all_frames: = 9.857607298333335 +/- 0.0114 or FPS = 101.44449557947739
-- 720x480.mkv --
        w/ decode:
            ffmpeg_upgraded_benchmark: time_for_all_frames: = 7.448014086666667 +/- 0.0870 or FPS = 134.26397807036727
            max_possible_fps: time_for_all_frames: = 7.058907673333337 +/- 0.0406 or FPS = 141.6649779650373
            baseline_benchmark: time_for_all_frames: = 7.5759822823333325 +/- 0.0359 or FPS = 131.99608482875297
            ffmpeg_benchmark: time_for_all_frames: = 13.863986507666672 +/- 0.0373 or FPS = 72.12932582176188
            pyav_benchmark: time_for_all_frames: = 7.497036354666662 +/- 0.0253 or FPS = 133.3860411891337
        w/ decode 2: (skewed? out of window during portion of test)
            ffmpeg_upgraded_benchmark: time_for_all_frames: = 7.372446637333333 +/- 0.0207 or FPS = 135.640181501769
            max_possible_fps: time_for_all_frames: = 7.034509088333333 +/- 0.0585 or FPS = 142.1563306611531
            baseline_benchmark: time_for_all_frames: = 7.438972271666667 +/- 0.0431 or FPS = 134.4271713189159
            ffmpeg_benchmark: time_for_all_frames: = 13.723683434000003 +/- 0.1198 or FPS = 72.866734707865
            pyav_benchmark: time_for_all_frames: = 7.51430263266667 +/- 0.0273 or FPS = 133.07954828073258
            decord_sequential_cpu_benchmark: time_for_all_frames: = 7.272698299000003 +/- 0.0055 or FPS = 137.50054778671353
            decord_batch_cpu_benchmark: time_for_all_frames: = 9.609287539999988 +/- 0.0283 or FPS = 104.06598780995591
            imutils_benchmark: time_for_all_frames: = 6.764138994666676 +/- 0.0786 or FPS = 147.83847593736178
            camgears_benchmark: time_for_all_frames: = 9.805945192333326 +/- 0.0332 or FPS = 101.97895056377017
            camgears_with_queue_benchmark: time_for_all_frames: = 6.6389276196666644 +/- 0.0303 or FPS = 150.62673631772614
            camgears_with_queue_official_benchmark: time_for_all_frames: = 9.817076076333327 +/- 0.0157 or FPS = 101.86332388833839
        no decode:
            ffmpeg_upgraded_benchmark: time_for_all_frames: = 7.0138285730000005 +/- 0.0046 or FPS = 142.5754835026248
            max_possible_fps: time_for_all_frames: = 6.973705503333336 +/- 0.0278 or FPS = 143.39578858356057
            baseline_benchmark: time_for_all_frames: = 7.573080191333332 +/- 0.0676 or FPS = 132.04666723909838
            ffmpeg_benchmark: time_for_all_frames: = 13.797400321333328 +/- 0.1587 or FPS = 72.47742159469095
            pyav_benchmark: time_for_all_frames: = 7.653672300666661 +/- 0.0243 or FPS = 130.65623412082806
        no decode 2:
            ffmpeg_upgraded_benchmark: time_for_all_frames: = 7.042113056666667 +/- 0.0097 or FPS = 142.00283238186788
            max_possible_fps: time_for_all_frames: = 7.004227027666666 +/- 0.0610 or FPS = 142.77092904756006
            baseline_benchmark: time_for_all_frames: = 7.781756177333333 +/- 0.1204 or FPS = 128.50569681337433
            ffmpeg_benchmark: time_for_all_frames: = 13.803133848000002 +/- 0.1408 or FPS = 72.44731602344741
            pyav_benchmark: time_for_all_frames: = 7.534158322333326 +/- 0.0141 or FPS = 132.72882745717246
            decord_sequential_cpu_benchmark: time_for_all_frames: = 7.366615333333338 +/- 0.1517 or FPS = 135.7475522680112
            decord_batch_cpu_benchmark: time_for_all_frames: = 9.563080596 +/- 0.0136 or FPS = 104.56881440675876
            imutils_benchmark: time_for_all_frames: = 6.748190342666675 +/- 0.0282 or FPS = 148.1878769301032
            camgears_benchmark: time_for_all_frames: = 9.837831484333341 +/- 0.0023 or FPS = 101.64841729526381
            camgears_with_queue_benchmark: time_for_all_frames: = 6.708463831666667 +/- 0.0108 or FPS = 149.06542318669065
            camgears_with_queue_official_benchmark: time_for_all_frames: = 9.803564256333354 +/- 0.0078 or FPS = 102.00371761259935
-- 1920x1080 --
        w/ decode:
            ffmpeg_upgraded_benchmark: time_for_all_frames: = 13.376433224333333 +/- 0.0481 or FPS = 74.75834426331829
            max_possible_fps: time_for_all_frames: = 7.021421739999998 +/- 0.0741 or FPS = 142.42129828253277
            baseline_benchmark: time_for_all_frames: = 13.614240776333334 +/- 0.0689 or FPS = 73.45249848514327
            ffmpeg_benchmark: time_for_all_frames: = 145.05310507366667 +/- 3.8000 or FPS = 6.894026842735562
            pyav_benchmark: time_for_all_frames: = 11.252520618333355 +/- 0.4434 or FPS = 88.86897735345923
            decord_sequential_cpu_benchmark: time_for_all_frames: = 14.550943369666621 +/- 0.3865 or FPS = 68.72406651548333
            decord_batch_cpu_benchmark: time_for_all_frames: = 48.644605762333335 +/- 0.1026 or FPS = 20.55726394177756
            imutils_benchmark: time_for_all_frames: = 10.025844253000022 +/- 0.0346 or FPS = 99.74222367365931
            camgears_benchmark: time_for_all_frames: = 15.025020987333354 +/- 0.0280 or FPS = 66.55564746585291
            camgears_with_queue_benchmark: time_for_all_frames: = 9.959157355666699 +/- 0.0184 or FPS = 100.41010140591926
            camgears_with_queue_official_benchmark: time_for_all_frames: = 15.041391945666646 +/- 0.0277 or FPS = 66.48320870915775
        no decode: (no ffmpeg_benchmark b/c my god... 7fps?! I got places to be, man!)
            ffmpeg_upgraded_benchmark: time_for_all_frames: = 11.23083394666667 +/- 0.0189 or FPS = 89.04058280523341
            max_possible_fps: time_for_all_frames: = 6.976269982000001 +/- 0.0256 or FPS = 143.3430762542412
            baseline_benchmark: time_for_all_frames: = 13.67885171133333 +/- 0.0123 or FPS = 73.10555162839222
            pyav_benchmark: time_for_all_frames: = 10.850826804333328 +/- 0.0254 or FPS = 92.15887581955002
            decord_sequential_cpu_benchmark: time_for_all_frames: = 13.954143310333345 +/- 0.2679 or FPS = 71.6633029889752
            decord_batch_cpu_benchmark: time_for_all_frames: = 48.77683590600001 +/- 0.3685 or FPS = 20.501534825406555
            imutils_benchmark: time_for_all_frames: = 9.931557595333325 +/- 0.0433 or FPS = 100.6891406912732
            camgears_benchmark: time_for_all_frames: = 15.023964940666664 +/- 0.0230 or FPS = 66.56032571623045
            camgears_with_queue_benchmark: time_for_all_frames: = 9.891995607333342 +/- 0.0200 or FPS = 101.09183623764035
            camgears_with_queue_official_benchmark: time_for_all_frames: = 15.059191112333318 +/- 0.0774 or FPS = 66.40462907606044

All that said... When I run the little "Example Usage" code I've got at the bottom of my script on the file from assets I get:

Read 3488 frames at (1920, 1080) resolution from 'C:\Users\Roninpawn\PycharmProjects\benchmarking_video_reading_python-feature-ffmmpeg_optimisation\assets\video_1920x1080.mkv' in 16.235 seconds.
Effective read rate of 215 frames per second.

And when I add in the numpy + OpenCV color space conversion to that code I get:

Read 3488 frames at (1920, 1080) resolution from 'C:\Users\Roninpawn\PycharmProjects\benchmarking_video_reading_python-feature-ffmmpeg_optimisation\assets\video_1920x1080.mkv' in 24.157 seconds.
Effective read rate of 144 frames per second.

Which is weird to me because the output of your tests say that full access happened in 11.23s/it = 89fps. I know you're testing various blocking processes, but all three iterations reported the same 11.22/23 seconds per iteration. So, the time is faster than the 16 seconds my little Example script produces, but 3488 frames divided by 11.23 doesn't equal 310 fps. Still, my script was definitely beat by pyav and the two queue'd methods in the 1080 test.

So, simply, I clearly don't understand what magics you've done here. ;)

That said, I'm super happy to see that my script for FFmpeg now holds its own across the non 1080p tests. Even with in-line Numpy & OpenCV converstion, in the 720x480 test FFmpeg only loses 15fps to the camgears_with_queue test.

Which necessarily makes me even more curious to ask than I already was: What is the nature of the 'queue' method you helped implement? And... might it help empty the FFmpeg pipe faster as well?

There's more to be said, but this is a lot already.

bml1g12 commented 2 years ago

Great stuff, I'll take a look in detail as soon as I can.

Which is weird to me because the output of your tests say that full access happened in 11.23s/it = 89fps. I know you're testing various blocking processes, but all three iterations reported the same 11.22/23 seconds per iteration. So, the time is faster than the 16 seconds my little Example script produces, but 3488 frames divided by 11.23 doesn't equal 310 fps. Still, my script was definitely beat by pyav and the two queue'd methods in the 1080 test.

Each benchmark can be run in three ways,

        echo "Running unblocked (no consumer bottleneck) benchmark"
        python video_reading_benchmarks/main.py --isiolimited --duration 0 --inputvideo $filename
        echo "Running IO limited benchmark"
        python video_reading_benchmarks/main.py --isiolimited --inputvideo $filename
        echo "Running CPU limited benchmark"
        python video_reading_benchmarks/main.py --inputvideo $filename 

i.e. the command line flags correspond to this key in the config dictionary

        "consumer_blocking_config": {"io_limited": False,
                                     "duration": args.duration},

What parameters did you use for your above benchmarks (which use case)? If the duration is 0 and io_limited is False, then it basically means just doing the video reading as fast as possible ("unblocked" case), and I would expect it to give timings very similar to if your little "example usage" code, assuming its the same video and output resolution etc.

Which necessarily makes me even more curious to ask than I already was: What is the nature of the 'queue' method you helped implement? And... might it help empty the FFmpeg pipe faster as well?

I can quickly answer this one. The imutils and camgears_with_queue are very similar implementations, they both just use the OpenCV VideoCapture class, but have one thread that reads frames onto a queue, and then when the master thread wants to process a frame, it can simply take it from that queue. This allows us to read in frames at the same time as processing them.

i.e. they are simple: 1) A thread that puts RGB image arrays into a queue 1 2) A master thread that pulls them off queue 1 and processes them

In your FFMPEG version, I'd have to think on it more carefully, but I suspect we can do the same for a speedup - just putting the raw arrays onto a queue and processing them on the master. Interestingly given your script has an extra couple of (mandatory if we work in RGB) processing steps - it might be possible to go one step further, and have:

1) A thread for putting raw arrays onto the queue 1 2) A thread for pulling from queue 1, processing the frames and putting onto queue 2 the RGB frames 3) A master thread that reads from queue 2

So there's likely scope for optimisation

roninpawn commented 2 years ago

I couldn't execute the 'run.sh' so I just invoked main.py directly, once at a go and changed the file name in the default configurations when I finished one of the resolutions. So whatever the PARSER does by default -- all my tests were that.

I wanted to add that before I got your benchmark suite going, this was my 'quick and dirty' version of what I thought I was seeing in you baseline test:

import cv2
from time import time

cap = cv2.VideoCapture(r"C:\Users\Roninpawn\PycharmProjects\benchmarking_video_reading_python-feature-ffmmpeg_optimisation\assets\video_1920x1080.mkv")
raw_frames_read = 0

print("Reading file...")
timer = time()
while True:
    ret, img = cap.read()
    if not ret:
        break
    raw_frames_read += 1
timer = time() - timer

print(round(timer, 3), raw_frames_read, round(raw_frames_read / timer, 3))

And it gives me a processing time of ~21 seconds at a rate of ~167fps. While your benchmark's baseline - in the figures above - finishes in 13.6 seconds and reports with a framerate of 73fps. Can you give me a quick idea of how the test works? Specifically, how it finishes an iteration at nearly twice the speed of the simple code above, but reports less than half the frame rate?

I'll be working on other things today. So I hope to chat with you more about this stuff soon.

roninpawn commented 2 years ago

I just integrated a simple queued threading, taking bits and ends from the code you've got in the camgear_queue benchmark. And even doing the color space conversion in line (without its own third thread) I see a bump from 148fps to 157fps on my example test with the mp4 I've used for testing throughout development.

I really like the idea of threading it all out, like you suggest, including the showinfo data. And I've started to imagine a setup where the raw bytes from the stream are made available alongside an optional color conversion. I guess I'm thinking of some kind of managed queue exchange. Where pulling a frame empties all standing queues into a per-frame access space. current_frame = (color, raw, showinfo), but with a cleaner access method than that.

Of course, all methods would still be throttled by the slowest invoked, but it makes my brain happy to conceptualize it this way. Seems to make clear what would need to happen. I'll let you know if I go on to thread out the OpenCV conversion and what kind of results I see.

bml1g12 commented 2 years ago

And it gives me a processing time of ~21 seconds at a rate of ~167fps. While your benchmark's baseline - in the figures above - finishes in 13.6 seconds and reports with a framerate of 73fps. Can you give me a quick idea of how the test works? Specifically, how it finishes an iteration at nearly twice the speed of the simple code above, but reports less than half the frame rate?

By default my benchmark will be reading only the first 1000 frames, so that would be why the faster runtime. As to the difference in FPS, I cannot reproduce that. I tried running the following code in main of benchmarks.py


if __name__ == "__main__":

    import cv2
    import time

    cap = cv2.VideoCapture(str(Path(video_reading_benchmarks.__file__).parent.parent.joinpath(
                "assets/video_1920x1080.mkv")))

    raw_frames_read = 0
    print("Reading file...")
    t0 = time.time()
    while True:
        ret, img = cap.read()
        if not ret:
            break
        raw_frames_read += 1
        if raw_frames_read == 1000:
            break
    time_taken = time.time() - t0
    print("Time Taken", round(time_taken, 3), \
          "Frames Read:", raw_frames_read, \
          "FPS", round(raw_frames_read / time_taken, 3))

    from video_reading_benchmarks.shared import get_timings

    CONFIG = {
        "video_path":
            str(Path(video_reading_benchmarks.__file__).parent.parent.joinpath(
                "assets/video_1920x1080.mkv")),
        "n_frames": 1000,
        "repeats": 3,
        "resize_shape": False,  # (320, 240),
        "show_img": False,
        "downsample": 1,
        "consumer_blocking_config": {"io_limited": False,
                                     "duration": 0},
    }
    baseline_benchmark(CONFIG)
    metagroupname = "video_reading_benchmarks.benchmarks"
    timings = get_timings("__main__", "baseline_benchmark",
                          times_calculated_over_n_frames=CONFIG["n_frames"])
    print(timings)

Which gives

╰─❯ python video_reading_benchmarks/benchmarks.py                                                                                                                                                                                        ─╯
__main__
Reading file...
Time Taken 3.746 Frames Read: 1000 FPS 266.929
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:03<00:00, 276.33it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:03<00:00, 280.25it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:03<00:00, 297.54it/s]
3it [00:10,  3.54s/it]█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌    | 977/1000 [00:03<00:00, 296.19it/s]
__main__
baseline_benchmark: time_for_all_frames: = 3.544231232003464 +/- 0.1128 or FPS = 282.1486338053416
{'groupname': 'baseline_benchmark', 'time_per_frame': '0.0035', 'time_for_all_frames': 3.544231232003464, 'stddev_for_all_frames': '0.1128', 'fps': '282.1486338053416'}

i.e. about 280 FPS and 3.5 seconds in both cases.

My best guess is that maybe your machine ran low on RAM at some point during the 13.6 seconds and that dropped the FPS ?

bml1g12 commented 2 years ago

Ah one other possible explanation is that after the first 1000 frames, there are some frames that decode slower (Due to more activity/detail in the video) - so by limiting both benchmarks to 1000 frames we get similar results.

bml1g12 commented 2 years ago

Closed as I was unable to reproduce in https://github.com/bml1g12/benchmarking_video_reading_python/pull/2 https://colab.research.google.com/drive/1V9nv3Jn1_rfIcvvsyRucZA3IC3wzUSN-?usp=sharing

AstonishedByTheLackOfCake commented 1 year ago

It might also be worth us investigating numpy converters from YUV (https://gist.github.com/Quasimondo/c3590226c924a06b276d606f4f189639) and whether parallelising the colorspace conversion is much help here.

ill-advised for getting it INTO my hands

Are you aware of any use cases whereby a Python user might want to obtain a YUV format frame from a video file not in RGB/BGR colorspace? (aside from the case of simply viewing the video unmodified, but in that case I guess they won't be loading it into Python at all as they could use a standard viewer). I can imagine one use case, maybe similar to your ongoing one with PyGame, would be if the user wishes to pass the video data into another piece of Python software in a frame-dependent manner.

Your code's showinfo=True sounds very intriguing. I wonder if it can reveal .mkv fragment tags, as I could imagine this could be used as a way of analysing e.g. Kinesis Video Streams with metadata attached, for example, if a video is annotated with events at certain frames, one could use Python to extract those frames without needing to decode them into RGB/BGR.

I have a use case where I'm extracting I-frames from videos and turning them into hashes for later analysis, I have absolutely no need for RGB/BGR conversions or even un-correcting gamma, as I'd much prefer to just be able to grab the raw 8-bits of the gamma-corrected luma component Y' into an array and generating my perceptual hashes from that.