`video_reader-rs`

A python module to decode videos based on rust ffmpeg-next, with a focus on ML use cases.

:bulb: Why yet another library based on ffmpeg ?

When training ML models on videos, it is usefull to load small sub-clips of videos. So decoding the entire video is not necessary.

The great decord library seems to be unmaintained, while having a few issues. The main one (for us) is bad memory management, which makes it crash on large videos. Indeed it allocates memory for the whole video when instantiating a VideoReader object. While in fact you might want to only get a few frames from this video.

So we took great inspiration from this library to rewrite the get_batch function using ffmpeg-next rust bindings. We also added the decode function which is usefull for decoding the entire video or for temporally reducing it using a compression_factor. Option to resize the video while decoding is also added.

NOTE: other functionalities of decord are not implemented (yet?).

Benchmark indicates that video_reader-rs is performing equally or better than decord, while using less memory. At least on the intended ML uses cases where video resolution remains reasonable, eg not 4K videos.

:hammer_and_wrench: Installation

Install via pip

pip install video-reader-rs

Should work with python >= 3.8 on recent linux x86_64, macos and windows.

Manual installation

You need to have ffmpeg installed on your system. Install maturin:

pip install maturin

Activate a virtual-env where you want to use the video_reader library and build the library as follows:

maturin develop --release

maturin develop builds the crate and installs it as a python module directly in the current virtualenv. the --release flag ensures the Rust part of the code is compiled in release mode, which enables compiler optimizations.

:warning: If you are using a version of ffmpeg >= 6.0 you need to enable the ffmpeg_6_0 feature:

maturin develop --release --features ffmpeg_6_0

:computer: Usage

Decoding a video is as simple as:

import video_reader as vr
frames = vr.decode(filename, resize, compression_factor, threads, start_frame, end_frame)

filename: path to the video file to decode
resize: optional resizing for the video.
compression_factor: temporal sampling, eg if 0.25, take 25% of the frames, evenly spaced.
threads: number of CPU cores to use for ffmpeg decoding, 0 means auto (let ffmpeg pick the optimal number).
start_frame - Start decoding from this frame index
end_frame - Stop decoding at this frame index

Returns a numpy array of shape (N, H, W, C).

We can do the same thing if we want grayscale frames, and it will retun an array of shape (N, H, W).

frames = vr.decode_gray(filename, resize, compression_factor, threads, start_frame, end_frame)

If we only need a sub-clip of the video we can use the get_batch function:

frames = vr.get_batch(filename, indices, threads=0, resize_shorter_side=None, with_fallback=False)

filename: path to the video file to decode
indices: list of indices of the frames to get
threads: number of CPU cores to use for ffmpeg decoding, currently has no effect as get_batch does not support multithreading. (NOTE: it is still as fast as decord from our benchmarking)
resize_shorter_side: optional resizing for the video.
with_fallback: False by default, if True will fallback to decoding without seeking (ie slower) if suspicious metadata is detected in the video, eg multiple key frames have pts <= 0, first key frames duration <= 0, etc. This might be usefull if your application requires you to be 100% sure you get the exact frames you asked for.

We can also get the shape of the raw video

(n, h, w) = vr.get_shape(filename)

Or get a dict with information about the video, returned as Dict[str, str]

info_dict = vr.get_info(filename)
print(info_dict["fps"])

We can encode the video with h264 codec

vr.save_video(frames, "video.mp4", fps=15, codec="h264")

:rocket: Performance comparison

Decoding a video with shape (2004, 1472, 1472, 3). Tested on a laptop (12 cores Intel i7-9750H CPU @ 2.60GHz), 15Gb of RAM with Ubuntu 22.04.

Options:

f: compression factor
r: resize shorter side
g: grayscale

Options	OpenCV	decord*	video_reader
f 0.5	33.96s	14.6s	26.76s
f 0.25	7.16s	14.03s	6.73s
f 0.25, r 512	6.49s	13.33s	3.92s
f 0.25, g	20.24s	25.67s	14.11s

* decord was tested on a machine with more RAM and CPU cores because it was crashing on the laptop with only 15Gb. See below.

:boom: Crash test

Tested on a laptop with 15Gb of RAM, with ubuntu 22.04 and python 3.10. Run this script:

import video_reader as vr
from time import time

def bench_video_decode(filename, compress_factor, resize):
    start =  time()
    vid = vr.decode(filename, resize_shorter_side=resize, compression_factor=compress_factor, threads=0)
    duration = time() - start
    print(f"Duration {duration:.2f}sec")
    return vid

vid = bench_video_decode("sample.mp4", 0.25)
print("video shape:", vid.shape)

# Terminal output:
# Duration 4.81sec
# video shape: (501, 1472, 1472, 3)

And then run this script:

from decord import VideoReader

vr = VideoReader("sample.mp4")

# Terminal output:
# terminate called after throwing an instance of 'std::bad_alloc'
#  what():  std::bad_alloc
# [1]    9636 IOT instruction (core dumped)

:stars: Credits

decord for showing how to get_batch efficiently.
ffmpeg-next for the Rust bindings to ffmpeg.
video-rs for the nice high level api which makes it easy to encode videos and for the code snippet to convert ffmpeg frames to ndarray ;-)

gcanat / video_reader-rs

readme