Open SunDoge opened 3 years ago
This could happen if gpu isn't warmed up, I didn't include that in the bench script.
Hi, I am experiencing something similar, loading a batch of 20 videos using the CPU takes 0.20 seconds and loading them using the GPU takes 0.45 seconds. The GPU is an RTX 3080 btw.
@zhreshold Sorry I didn't really understand your comment, what would it take to properly "warm up" the GPU. And with this warm up, what speedup could one expect? Is it expected that the GPU will be faster on VideoReader sequential reads?
Thanks a lot in advance.
Btw, my GPU was hovering between 10% and 50% utilization, around 60 degrees celsius.
Actually, here is the concrete benchmark I just ran with precise timings:
import os
import time
import cv2
from tqdm import tqdm
from decord import VideoReader, cpu, gpu
def get_video_frames_cv2(video_path, frame_index, num_frames):
cap = cv2.VideoCapture(video_path)
video = []
for _ in range(frame_index):
cap.read()
for _ in range(num_frames):
_, frame = cap.read()
video += [cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)]
cap.release()
return video
folder = "/home/miraodasilva/datasets/21Uxsk56VDQ"
def test1():
t0 = time.time()
for f in tqdm(os.listdir(folder)):
vr = VideoReader(os.path.join(folder, f), ctx=gpu(0))
vr.get_batch([i for i in range(100)]).shape
t = time.time() - t0
return t
def test2():
t0 = time.time()
for f in tqdm(os.listdir(folder)):
vr = VideoReader(os.path.join(folder, f), ctx=cpu(0))
vr.get_batch([i for i in range(100)]).shape
t = time.time() - t0
return t
def test3():
t0 = time.time()
for f in tqdm(os.listdir(folder)):
get_video_frames_cv2(os.path.join(folder, f), 0, 100)
t = time.time() - t0
return t
print(test1()) # 0.6917195320129395 seconds
print(test2()) # 0.21443819999694824 seconds
print(test3()) # 0.18276548385620117 seconds
Looks like opencv is faster than decord cpu which is much faster than decord GPU. Am I missing something?
Specs are RTX 3080, Ryzen 5600x. Thanks.
Hi, @miraodasilva, the first time copying memory from CPU to GPU takes time, thus the second run time is more accurate. However, the VideoReader with GPU context is slower than with CPU context, which is a little bit strange.
Hi @SunDoge, thanks for the reply. I'm nto sure what you mean by "the first time", but if I run these experiments separately the timings are the same. If you mean that getting the batch of frames a second time for the same reader is faster, that's fine but this will basically never happen in a deep learning training run.
Don't really know what's going on since the benchmarks on decord seemed very impressive. Currently I'm just left with using opencv for my data loader.
BTW I have replicated these results on another machine. I have also tried to install decord via pip install rather than from source, and the CPU videoreader is still slower than openCV.
@zhreshold Any thoughts on this? I suppose maybe something is wrong with my config but it seems everything is installed correctly. Would appreciate any feedback. Thanks.
@miraodasilva
From my test environment(i7 8700k, no gpu) with kinetics400 test videos(in examples folder), I got: print(test2()) # 0.42748308181762695 seconds print(test3()) # 0.5103216171264648 seconds
The performance can vary depending on the decoder version(included in ffmpeg), the built types(debug vs release), the number of cores utilized in decord/opencv(in record you can set the # threads, in opencv I believe there's a environment variable as well)
I test the tests/benchmark/bench_decord.py on TITAN Xp. The gpu version get stuck at the third benchmark and is much slower than the cpu version. Is this expected?
The video can be downloaded here https://www.youtube.com/watch?v=0acEl97ZBME. I resize it to 342x256, 25fps.