dmlc / decord

An efficient video loader for deep learning with smart shuffling that's super easy to digest
Apache License 2.0
1.86k stars 160 forks source link

VideoReader with gpu ctx is slower than cpu ctx #106

Open SunDoge opened 3 years ago

SunDoge commented 3 years ago

I test the tests/benchmark/bench_decord.py on TITAN Xp. The gpu version get stuck at the third benchmark and is much slower than the cpu version. Is this expected?

python tests/bench_decord.py --file data/ActivityNet256/anet_1.3_video_train/0acEl97ZBME.mp4 --width -1 --height -1
2569  frames, elapsed time for sequential read:  0.6075031757354736
300  frames, elapsed time for random access(not accurate):  1.3444850444793701
300  frames, elapsed time for random access(accurate):  45.86395478248596
python tests/bench_decord.py --file data/ActivityNet256/anet_1.3_video_train/0acEl97ZBME.mp4 --width -1 --height -1 --gpu 0
2569  frames, elapsed time for sequential read:  2.0776472091674805
300  frames, elapsed time for random access(not accurate):  6.072170257568359

The video can be downloaded here https://www.youtube.com/watch?v=0acEl97ZBME. I resize it to 342x256, 25fps.

zhreshold commented 3 years ago

This could happen if gpu isn't warmed up, I didn't include that in the bench script.

miraodasilva commented 3 years ago

Hi, I am experiencing something similar, loading a batch of 20 videos using the CPU takes 0.20 seconds and loading them using the GPU takes 0.45 seconds. The GPU is an RTX 3080 btw.

@zhreshold Sorry I didn't really understand your comment, what would it take to properly "warm up" the GPU. And with this warm up, what speedup could one expect? Is it expected that the GPU will be faster on VideoReader sequential reads?

Thanks a lot in advance.

miraodasilva commented 3 years ago

Btw, my GPU was hovering between 10% and 50% utilization, around 60 degrees celsius.

miraodasilva commented 3 years ago

Actually, here is the concrete benchmark I just ran with precise timings:

import os
import time

import cv2
from tqdm import tqdm

from decord import VideoReader, cpu, gpu

def get_video_frames_cv2(video_path, frame_index, num_frames):
    cap = cv2.VideoCapture(video_path)
    video = []
    for _ in range(frame_index):
        cap.read()
    for _ in range(num_frames):
        _, frame = cap.read()
        video += [cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)]
    cap.release()

    return video

folder = "/home/miraodasilva/datasets/21Uxsk56VDQ"

def test1():
    t0 = time.time()
    for f in tqdm(os.listdir(folder)):
        vr = VideoReader(os.path.join(folder, f), ctx=gpu(0))
        vr.get_batch([i for i in range(100)]).shape
    t = time.time() - t0
    return t

def test2():
    t0 = time.time()
    for f in tqdm(os.listdir(folder)):
        vr = VideoReader(os.path.join(folder, f), ctx=cpu(0))
        vr.get_batch([i for i in range(100)]).shape
    t = time.time() - t0
    return t

def test3():
    t0 = time.time()
    for f in tqdm(os.listdir(folder)):
        get_video_frames_cv2(os.path.join(folder, f), 0, 100)
    t = time.time() - t0
    return t

print(test1())  # 0.6917195320129395 seconds
print(test2())  # 0.21443819999694824 seconds
print(test3())  # 0.18276548385620117 seconds

Looks like opencv is faster than decord cpu which is much faster than decord GPU. Am I missing something?

Specs are RTX 3080, Ryzen 5600x. Thanks.

SunDoge commented 3 years ago

Hi, @miraodasilva, the first time copying memory from CPU to GPU takes time, thus the second run time is more accurate. However, the VideoReader with GPU context is slower than with CPU context, which is a little bit strange.

miraodasilva commented 3 years ago

Hi @SunDoge, thanks for the reply. I'm nto sure what you mean by "the first time", but if I run these experiments separately the timings are the same. If you mean that getting the batch of frames a second time for the same reader is faster, that's fine but this will basically never happen in a deep learning training run.

Don't really know what's going on since the benchmarks on decord seemed very impressive. Currently I'm just left with using opencv for my data loader.

miraodasilva commented 3 years ago

BTW I have replicated these results on another machine. I have also tried to install decord via pip install rather than from source, and the CPU videoreader is still slower than openCV.

miraodasilva commented 3 years ago

@zhreshold Any thoughts on this? I suppose maybe something is wrong with my config but it seems everything is installed correctly. Would appreciate any feedback. Thanks.

zhreshold commented 3 years ago

@miraodasilva

From my test environment(i7 8700k, no gpu) with kinetics400 test videos(in examples folder), I got: print(test2()) # 0.42748308181762695 seconds print(test3()) # 0.5103216171264648 seconds

The performance can vary depending on the decoder version(included in ffmpeg), the built types(debug vs release), the number of cores utilized in decord/opencv(in record you can set the # threads, in opencv I believe there's a environment variable as well)