[ Question / Feature Request ] - Is it possible to loop over a video for Super Resolution or other tasks

NevermindNilas commented 3 days ago

Something like this instantly causes a mem overflow

import video_reader
import cv2

from tqdm import tqdm

video = video_reader.decode("/home/nilas/Downloads/videoplayback.mp4")

for i in tqdm(range(video.shape[0])):
    frame = video[i]
    cv2.imshow("frame", frame)
    cv2.waitKey(1)

Which is understandable since basically all frames are stored inside one numpy array of some huge N dimension before any processing is done.

Problem arises when the workflow requires let's say a movie to be processed that may have hundreds of thousands of frames to be stored. This would innately make the decode function virtually useless since unless you went out of your way to download ram off of Google you won't be able to process this in a reliable manner that would work with most

Having a .read or .get_frame function that simply returns a numpy array of shape [ 1, w, h, 3 ] would suffice. Something like this in theory could work just fine:

video = video.decode_loop("path_here")

num_frames = video.num_frames # assuming a function like this exists

for i in tqdm(range(num_frames)):
    frame = video.get_frame(i)

    # Do some processing here, like upscaling
    frame = cv2.resize(frame, (3840, 2160))
    cv2.imshow("frame", frame)
    cv2.waitKey(1)

This would be slightly harder but with subprocesses at least in python + ffmpeg pipes it can be done and it is relatively speedy ( 13700k, 1080p, ffmpeg 7 results are about 440fps +- for pure decode ).

NevermindNilas commented 3 days ago

Here are some of my internal tests on pure decode and decord at the very least topped all results nearly performing as well as simply using FFMPEG without any other extra tasks.

gcanat commented 3 days ago

Try this:

import video_reader
import cv2

from tqdm import tqdm

videoname = "/home/nilas/Downloads/videoplayback.mp4"
video_shape = video_reader.get_shape(videoname)

for i in tqdm(range(video_shape[0])):
    frame = video_reader.get_batch(videoname, [i])
   # frame shape will be: (1, H, W, C) so we need to squeeze to remove the first dim
    cv2.imshow("frame", frame.squeeze())
    cv2.waitKey(1)

Note: to improve the perf you should probably get multiple images at once. Just need to find an acceptable chunk size. You can start with 64 frames maybe ? so you can fully use the get_batch method to get 64 frames at once.

NevermindNilas commented 3 days ago

python test.py
  0%|                                                                                                                                                                                                                     | 0/3649 [00:00<?, ?it/s]thread '<unnamed>' panicked at src/video_io.rs:479:21:
No Reducer to get the frames
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
  0%|                                                                                                                                                                                                                     | 0/3649 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/nilas/Downloads/test/test.py", line 10, in <module>
    frame = video_reader.get_batch(videoname, [i])
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: No Reducer to get the frames

gcanat commented 3 days ago

works fine here, maybe you need to git pull again if you cloned the repo before last commits ?

NevermindNilas commented 3 days ago

All up to date on my side :thinking:

gcanat commented 3 days ago

Ok I see. It looks like video_reader is falling back to get_batch_safe because it did not like what was found in the metadata of the video. Can you share the video ?

Meanwhile you can replace get_batch in the code snippet above with get_batch_nofb. So

import video_reader
import cv2

from tqdm import tqdm

videoname = "/home/nilas/Downloads/videoplayback.mp4"
video_shape = video_reader.get_shape(videoname)

for i in tqdm(range(video_shape[0])):
    frame = video_reader.get_batch_nofb(videoname, [i])
   # frame shape will be: (1, H, W, C) so we need to squeeze to remove the first dim
    cv2.imshow("frame", frame.squeeze())
    cv2.waitKey(1)

NevermindNilas commented 3 days ago

It was a random video I got from youtube, can sure do.

Github gave me a flat NO so I uploaded it here, hopefully it doesn't alter the video metadata, otherwise I Can try to do it through discord. https://gofile.io/d/dgiBAP

gcanat commented 3 days ago

So I've pushed some changes: the get_batch method no longer falls back to decoding without seeking unless you explicitly ask for it with get_batch(filename, indices, with_fallback=True). Thus I removed the get_batch_nofb method as it is redundant.

Coming back to the previous example, you can do something like this to be a bit more efficient and get chunk of frames instead of frame by frame:

import sys
from time import time

import cv2
import video_reader
from tqdm import tqdm

videoname = sys.argv[1]
chunk_size = int(sys.argv[2])

video_shape = video_reader.get_shape(videoname)
print("Video shape:", video_shape)

i = 0
start = time()
for i in tqdm(range(0, video_shape[0], chunk_size)):
    indices = list(range(i, min(i + chunk_size, video_shape[0])))
    frames = video_reader.get_batch(videoname, indices)
    for j in range(len(indices)):
        cv2.imshow("frame", frames[j])
        cv2.waitKey(1)
print(f"Done in {time() - start} seconds")

and then run it with python script.py /path/to/your/video.mp4 200 if you want to have chunks of 200 frames. Memory consumption remains very low in my experience even with bigger chunk sizes.

NevermindNilas commented 3 days ago

after some small modifications for simplicity purposes

import sys
import video_reader
from tqdm import tqdm

videoname = "/home/nilas/Downloads/videoplayback.mp4"
chunk_size = 500

video_shape = video_reader.get_shape(videoname)
print("Video shape:", video_shape)

for i in tqdm(range(0, video_shape[0], chunk_size)):
    indices = list(range(i, min(i + chunk_size, video_shape[0])))
    frames = video_reader.get_batch(videoname, indices)

I get the following it/s at 500 batch size:

100%|█████████████████████████████████████████████████████████████████████████████████| 8/8 [00:17<00:00,  2.16s/it]

So video_reader-rs got approximately 250 FPS using the script you've shared with very small modifications on my side ( I did test a 1000 batch size as well but saw no major differences )

Using the command: ffmpeg -i videoplayback.mp4 -f null -

I get the following: frame= 3649 fps=1475 q=-0.0 Lsize=N/A time=00:02:01.63 bitrate=N/A speed=49.2x
( Of course this is just to know the theoretical maximum, it doesn't account for numpy and other conversions )

Benching with a similar setup and decord:

import decord
from tqdm import tqdm

videoname = "/home/nilas/Downloads/videoplayback.mp4"
chunk_size = 500

vr = decord.VideoReader(videoname)
video_length = len(vr)
print("Video length:", video_length)

for i in tqdm(range(0, video_length, chunk_size)):
    end = min(i + chunk_size, video_length)
    frames = vr.get_batch(range(i, end)).asnumpy() # turning it into a numpy array for fairness

python test.py
Video length: 3649
100%|███████████████████████████████████████████████████████| 8/8 [00:07<00:00,  1.06it/s]

This should equate to roughly 500FPS which is still marginally lower than my theoretical maximum of 590-600 but it suffices for now.

Ofc I tested your memory claims as well and saw significantly better memory usage using video_reader-rs, I didn't go out of my way to log it but I can confirm that claim 100%.

NevermindNilas commented 3 days ago

I did a very quick test using FFMPEG + Subprocess:

import numpy as np
import subprocess
import cv2
from tqdm import tqdm

video_name = "/home/nilas/Downloads/videoplayback.mp4"

ffmpeg_command = [
    'ffmpeg',
    '-i', video_name,
    '-f', 'image2pipe',
    '-pix_fmt', 'rgb24',
    '-vcodec', 'rawvideo',
    '-'
]

process = subprocess.Popen(ffmpeg_command, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL, bufsize=10**8)
cap = cv2.VideoCapture(video_name)
total_frame_count = cap.get(cv2.CAP_PROP_FRAME_COUNT)
cap.release()

frame_width = 1920
frame_height = 1080
bytes_per_frame = frame_width * frame_height * 3

for _ in tqdm(range(int(total_frame_count))):
    raw_frame = process.stdout.read(bytes_per_frame)
    if not raw_frame:
        break

    frame = np.frombuffer(raw_frame, dtype=np.uint8).reshape((frame_height, frame_width, 3))

process.stdout.close()
process.wait()

The reported performance seems to be:

python test.py
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3649/3649 [00:05<00:00, 640.02it/s]

640FPS is weird because I am only getting approximately 440FPS on Windows, interesting....

gcanat commented 2 days ago

Thanks for the benchmark. Indeed this use case is not exactly what I had in mind when developing video_reader. I think there's 2 differences here with decord:

video_reader.get_batch() will instantiate the VideoReader struct (on the rust side) each time it is called so it will get all the metada from the video each time, including the key frames information which has a small cost, but might still have an impact in the end.
video_reader.get_batch() is single threaded at the moment, because it was easier to implement and the performance difference on my use case was almost negligible: getting a small sub-clip of the video, like 32 or 64 frames, with resolution of 256 to 512. So it is still nice that it is only 2x slower than decord on a big video, while decord is multi-threaded AFAIK.

I could try to solve the first point by creating a VideoReader python class similar to the decord one. Regarding the second point it will probably take more time.

Also what is your use case exactly that requires to process frames at 500 FPS ? Any ML model that can do inference this fast ? :smile:

NevermindNilas commented 2 days ago

Hi, don't sweat it, first achieve the goal that you've envisioned and then maybe consider the edge cases like mine.

I am developing a little 'infrastructure/arch' for super fast ML / AI Related workloads such as upscaling, interpolation, depth map extraction.

https://github.com/NevermindNilas/TheAnimeScripter/

Currently it has gotten a false DMCA report from someone so sadly the repository is down for the time being but to elaborate the only real 2 tasks that reach such insane numbers are Interpolation with Rife ( specifically TensorRT ) and a couple of algorithms such as SSIM-CUDA / MSE that can get to these numbers.

Here's a benchmark a friend of mine ran with a 7950x/4090

Honestly decode is still fast enough to not be a real bottleneck but as time goes on and I fix a couple of things I will get near decode / encode limits ( I already am heavily limited by encode speeds honestly ).

But to put it simply, if it can get faster, why not?

Oh yes I forgot to mention, that's a relatively old version of my script / benchmark.

Nowadays SSIM-CUDA can reach as much as 350FPS on my 3090 and SSIM I think peaked at 450(?), I am still ironing things out but eventually I hope to outperform more robust libraries like Vapoursynth.

matter of fact, the reason why I can't achieve even higher numbers for TensorRT and CUDA specifically is because that transferring data from CPU -> GPU takes long enough to cap my performance. I need to figure out a way to non blockingly transfer data but also make sure the data is 'safe' and not riddled with visual errors/artifacts. It's a pain

NevermindNilas commented 2 days ago

Deduplication using MSE and no encode + FFMPEG Subprocess speed

Just pure decode benchmarking ( deduplication logic is commented out since my script doesn't yet have a direct benchmark method for decode )

You can see why I am shooting for the stars 🤣

gcanat / video_reader-rs

[ Question / Feature Request ] - Is it possible to loop over a video for Super Resolution or other tasks #5