Closed NevermindNilas closed 16 hours ago
Here are some of my internal tests on pure decode and decord at the very least topped all results nearly performing as well as simply using FFMPEG without any other extra tasks.
Try this:
import video_reader
import cv2
from tqdm import tqdm
videoname = "/home/nilas/Downloads/videoplayback.mp4"
video_shape = video_reader.get_shape(videoname)
for i in tqdm(range(video_shape[0])):
frame = video_reader.get_batch(videoname, [i])
# frame shape will be: (1, H, W, C) so we need to squeeze to remove the first dim
cv2.imshow("frame", frame.squeeze())
cv2.waitKey(1)
Note: to improve the perf you should probably get multiple images at once. Just need to find an acceptable chunk size. You can start with 64 frames maybe ? so you can fully use the get_batch method to get 64 frames at once.
python test.py
0%| | 0/3649 [00:00<?, ?it/s]thread '<unnamed>' panicked at src/video_io.rs:479:21:
No Reducer to get the frames
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
0%| | 0/3649 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/nilas/Downloads/test/test.py", line 10, in <module>
frame = video_reader.get_batch(videoname, [i])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: No Reducer to get the frames
works fine here, maybe you need to git pull again if you cloned the repo before last commits ?
All up to date on my side :thinking:
Ok I see. It looks like video_reader is falling back to get_batch_safe
because it did not like what was found in the metadata of the video. Can you share the video ?
Meanwhile you can replace get_batch
in the code snippet above with get_batch_nofb
.
So
import video_reader
import cv2
from tqdm import tqdm
videoname = "/home/nilas/Downloads/videoplayback.mp4"
video_shape = video_reader.get_shape(videoname)
for i in tqdm(range(video_shape[0])):
frame = video_reader.get_batch_nofb(videoname, [i])
# frame shape will be: (1, H, W, C) so we need to squeeze to remove the first dim
cv2.imshow("frame", frame.squeeze())
cv2.waitKey(1)
It was a random video I got from youtube, can sure do.
Github gave me a flat NO so I uploaded it here, hopefully it doesn't alter the video metadata, otherwise I Can try to do it through discord. https://gofile.io/d/dgiBAP
So I've pushed some changes: the get_batch
method no longer falls back to decoding without seeking unless you explicitly ask for it with get_batch(filename, indices, with_fallback=True)
. Thus I removed the get_batch_nofb
method as it is redundant.
Coming back to the previous example, you can do something like this to be a bit more efficient and get chunk of frames instead of frame by frame:
import sys
from time import time
import cv2
import video_reader
from tqdm import tqdm
videoname = sys.argv[1]
chunk_size = int(sys.argv[2])
video_shape = video_reader.get_shape(videoname)
print("Video shape:", video_shape)
i = 0
start = time()
for i in tqdm(range(0, video_shape[0], chunk_size)):
indices = list(range(i, min(i + chunk_size, video_shape[0])))
frames = video_reader.get_batch(videoname, indices)
for j in range(len(indices)):
cv2.imshow("frame", frames[j])
cv2.waitKey(1)
print(f"Done in {time() - start} seconds")
and then run it with python script.py /path/to/your/video.mp4 200
if you want to have chunks of 200 frames. Memory consumption remains very low in my experience even with bigger chunk sizes.
after some small modifications for simplicity purposes
import sys
import video_reader
from tqdm import tqdm
videoname = "/home/nilas/Downloads/videoplayback.mp4"
chunk_size = 500
video_shape = video_reader.get_shape(videoname)
print("Video shape:", video_shape)
for i in tqdm(range(0, video_shape[0], chunk_size)):
indices = list(range(i, min(i + chunk_size, video_shape[0])))
frames = video_reader.get_batch(videoname, indices)
I get the following it/s at 500 batch size:
100%|█████████████████████████████████████████████████████████████████████████████████| 8/8 [00:17<00:00, 2.16s/it]
So video_reader-rs got approximately 250 FPS using the script you've shared with very small modifications on my side ( I did test a 1000 batch size as well but saw no major differences )
Using the command:
ffmpeg -i videoplayback.mp4 -f null -
I get the following:
frame= 3649 fps=1475 q=-0.0 Lsize=N/A time=00:02:01.63 bitrate=N/A speed=49.2x
( Of course this is just to know the theoretical maximum, it doesn't account for numpy and other conversions )
Benching with a similar setup and decord:
import decord
from tqdm import tqdm
videoname = "/home/nilas/Downloads/videoplayback.mp4"
chunk_size = 500
vr = decord.VideoReader(videoname)
video_length = len(vr)
print("Video length:", video_length)
for i in tqdm(range(0, video_length, chunk_size)):
end = min(i + chunk_size, video_length)
frames = vr.get_batch(range(i, end)).asnumpy() # turning it into a numpy array for fairness
python test.py
Video length: 3649
100%|███████████████████████████████████████████████████████| 8/8 [00:07<00:00, 1.06it/s]
This should equate to roughly 500FPS which is still marginally lower than my theoretical maximum of 590-600 but it suffices for now.
Ofc I tested your memory claims as well and saw significantly better memory usage using video_reader-rs, I didn't go out of my way to log it but I can confirm that claim 100%.
I did a very quick test using FFMPEG + Subprocess:
import numpy as np
import subprocess
import cv2
from tqdm import tqdm
video_name = "/home/nilas/Downloads/videoplayback.mp4"
ffmpeg_command = [
'ffmpeg',
'-i', video_name,
'-f', 'image2pipe',
'-pix_fmt', 'rgb24',
'-vcodec', 'rawvideo',
'-'
]
process = subprocess.Popen(ffmpeg_command, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL, bufsize=10**8)
cap = cv2.VideoCapture(video_name)
total_frame_count = cap.get(cv2.CAP_PROP_FRAME_COUNT)
cap.release()
frame_width = 1920
frame_height = 1080
bytes_per_frame = frame_width * frame_height * 3
for _ in tqdm(range(int(total_frame_count))):
raw_frame = process.stdout.read(bytes_per_frame)
if not raw_frame:
break
frame = np.frombuffer(raw_frame, dtype=np.uint8).reshape((frame_height, frame_width, 3))
process.stdout.close()
process.wait()
The reported performance seems to be:
python test.py
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3649/3649 [00:05<00:00, 640.02it/s]
640FPS is weird because I am only getting approximately 440FPS on Windows, interesting....
Thanks for the benchmark. Indeed this use case is not exactly what I had in mind when developing video_reader. I think there's 2 differences here with decord:
video_reader.get_batch()
will instantiate the VideoReader struct (on the rust side) each time it is called so it will get all the metada from the video each time, including the key frames information which has a small cost, but might still have an impact in the end.video_reader.get_batch()
is single threaded at the moment, because it was easier to implement and the performance difference on my use case was almost negligible: getting a small sub-clip of the video, like 32 or 64 frames, with resolution of 256 to 512. So it is still nice that it is only 2x slower than decord on a big video, while decord is multi-threaded AFAIK.I could try to solve the first point by creating a VideoReader python class similar to the decord one. Regarding the second point it will probably take more time.
Also what is your use case exactly that requires to process frames at 500 FPS ? Any ML model that can do inference this fast ? :smile:
Hi, don't sweat it, first achieve the goal that you've envisioned and then maybe consider the edge cases like mine.
I am developing a little 'infrastructure/arch' for super fast ML / AI Related workloads such as upscaling, interpolation, depth map extraction.
https://github.com/NevermindNilas/TheAnimeScripter/
Currently it has gotten a false DMCA report from someone so sadly the repository is down for the time being but to elaborate the only real 2 tasks that reach such insane numbers are Interpolation with Rife ( specifically TensorRT ) and a couple of algorithms such as SSIM-CUDA / MSE that can get to these numbers.
Here's a benchmark a friend of mine ran with a 7950x/4090
Honestly decode is still fast enough to not be a real bottleneck but as time goes on and I fix a couple of things I will get near decode / encode limits ( I already am heavily limited by encode speeds honestly ).
But to put it simply, if it can get faster, why not?
Oh yes I forgot to mention, that's a relatively old version of my script / benchmark.
Nowadays SSIM-CUDA can reach as much as 350FPS on my 3090 and SSIM I think peaked at 450(?), I am still ironing things out but eventually I hope to outperform more robust libraries like Vapoursynth.
matter of fact, the reason why I can't achieve even higher numbers for TensorRT and CUDA specifically is because that transferring data from CPU -> GPU takes long enough to cap my performance. I need to figure out a way to non blockingly transfer data but also make sure the data is 'safe' and not riddled with visual errors/artifacts. It's a pain
Deduplication using MSE and no encode + FFMPEG Subprocess speed
Just pure decode benchmarking ( deduplication logic is commented out since my script doesn't yet have a direct benchmark method for decode )
You can see why I am shooting for the stars 🤣
Something like this instantly causes a mem overflow
Which is understandable since basically all frames are stored inside one numpy array of some huge N dimension before any processing is done.
Problem arises when the workflow requires let's say a movie to be processed that may have hundreds of thousands of frames to be stored. This would innately make the decode function virtually useless since unless you went out of your way to download ram off of Google you won't be able to process this in a reliable manner that would work with most
Having a
.read
or.get_frame
function that simply returns a numpy array of shape [ 1, w, h, 3 ] would suffice. Something like this in theory could work just fine:This would be slightly harder but with subprocesses at least in python + ffmpeg pipes it can be done and it is relatively speedy ( 13700k, 1080p, ffmpeg 7 results are about 440fps +- for pure decode ).