Open TANGnlp0711 opened 1 year ago
hello, the bottleneck is not write_video
, is the for-loop and iter_frames
function.
hello, the bottleneck is not
write_video
, is the for-loop anditer_frames
function.
yeah, look into the ffmpeg_write.py
, in write_frame
function, img_array.tobytes() will cost about 90% of the total time running write_videofile
, this is the bottleneck.
hello, the bottleneck is not
write_video
, is the for-loop anditer_frames
function.yeah, look into the
ffmpeg_write.py
, inwrite_frame
function, img_array.tobytes() will cost about 90% of the total time runningwrite_videofile
, this is the bottleneck.
you can use torch to accelerate, in the file moviepy/video/tools/drawing.py
, modify blit
to blit_gpu
, shown as follows:
import numpy as np
import torch
def blit_gpu(im1, im2, pos=None, mask=None, ismask=False):
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if pos is None:
pos = [0, 0]
xp, yp = pos
x1 = max(0, -xp)
y1 = max(0, -yp)
h1, w1 = im1.shape[:2]
h2, w2 = im2.shape[:2]
xp2 = min(w2, xp + w1)
yp2 = min(h2, yp + h1)
x2 = min(w1, w2 - xp)
y2 = min(h1, h2 - yp)
xp1 = max(0, xp)
yp1 = max(0, yp)
if (xp1 >= xp2) or (yp1 >= yp2):
return im2
if not isinstance(im1, torch.Tensor): # 5.43 ms per loop / 100 loops
im1 = torch.tensor(im1, device=device)
if not isinstance(im2, torch.Tensor):
im2 = torch.tensor(im2, device=device)
blitted = im1[y1:y2, x1:x2]
new_im2 = im2.clone()
if mask is None:
new_im2[yp1:yp2, xp1:xp2] = blitted
else:
if not isinstance(mask, torch.Tensor): # 2.71 ms per loop / 10 loops
mask = torch.tensor(mask[y1:y2, x1:x2], device=device) # 1.45 ms / 100 loops
else:
mask = mask[y1:y2, x1:x2]
if len(im1.shape) == 3:
mask = mask.unsqueeze(-1).repeat(1, 1, 3)
blit_region = new_im2[yp1:yp2, xp1:xp2]
new_im2[yp1:yp2, xp1:xp2] = mask * blitted + (1 - mask) * blit_region
# return new_im2.cpu().numpy().astype("uint8") if not ismask else new_im2.cpu().numpy() # 6.13 ms / 100 loops
return new_im2 if not ismask else new_im2
then modify file moviepy/video/VideoClip.py
line 565 to return blit_gpu(img, picture, pos, mask=mask, ismask=self.ismask)
.
This works a lot, provided that you have a GPU
Please always include your specs like we ask for in our issue templates – MoviePy version, platform used etc.
Code samples and logs should be code-formatted for better readability.
This works a lot, provided that you have a GPU
This is giving me the following error with my RTX 3070:
File "C:\Python311\Lib\site-packages\moviepy\Clip.py", line 474, in iter_frames
frame = frame.astype(dtype)
^^^^^^^^^^^^
AttributeError: 'Tensor' object has no attribute 'astype'. Did you mean: 'dtype'?
I really do appreciate the thought being put into this though, being able to utilize a GPU to help mitigate this bottleneck would be massive.
This works a lot, provided that you have a GPU
This is giving me the following error with my RTX 3070:
File "C:\Python311\Lib\site-packages\moviepy\Clip.py", line 474, in iter_frames frame = frame.astype(dtype) ^^^^^^^^^^^^ AttributeError: 'Tensor' object has no attribute 'astype'. Did you mean: 'dtype'?
I really do appreciate the thought being put into this though, being able to utilize a GPU to help mitigate this bottleneck would be massive.
You can use
return new_im2.cpu().numpy().astype("uint8") if not ismask else new_im2.cpu().numpy() # 6.13 ms / 100 loops
It works for me! 3 times faster.
File "C:\Python311\Lib\site-packages\moviepy\Clip.py", line 474, in iter_frames
frame = frame.astype(dtype)
^^^^^^^^^^^^
AttributeError: 'Tensor' object has no attribute 'astype'. Did you mean: 'dtype'?
@JasonChoate As for this error, just modify function iter_frames in Clip.py as follows:
if (dtype is not None) and (frame.dtype != dtype):
# frame = frame.astype(dtype)
frame = frame.cpu().numpy().astype(dtype)
hello, the bottleneck is not
write_video
, is the for-loop anditer_frames
function.yeah, look into the
ffmpeg_write.py
, inwrite_frame
function, img_array.tobytes() will cost about 90% of the total time runningwrite_videofile
, this is the bottleneck.you can use torch to accelerate, in the file
moviepy/video/tools/drawing.py
, modifyblit
toblit_gpu
, shown as follows:import numpy as np import torch def blit_gpu(im1, im2, pos=None, mask=None, ismask=False): device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") if pos is None: pos = [0, 0] xp, yp = pos x1 = max(0, -xp) y1 = max(0, -yp) h1, w1 = im1.shape[:2] h2, w2 = im2.shape[:2] xp2 = min(w2, xp + w1) yp2 = min(h2, yp + h1) x2 = min(w1, w2 - xp) y2 = min(h1, h2 - yp) xp1 = max(0, xp) yp1 = max(0, yp) if (xp1 >= xp2) or (yp1 >= yp2): return im2 if not isinstance(im1, torch.Tensor): # 5.43 ms per loop / 100 loops im1 = torch.tensor(im1, device=device) if not isinstance(im2, torch.Tensor): im2 = torch.tensor(im2, device=device) blitted = im1[y1:y2, x1:x2] new_im2 = im2.clone() if mask is None: new_im2[yp1:yp2, xp1:xp2] = blitted else: if not isinstance(mask, torch.Tensor): # 2.71 ms per loop / 10 loops mask = torch.tensor(mask[y1:y2, x1:x2], device=device) # 1.45 ms / 100 loops else: mask = mask[y1:y2, x1:x2] if len(im1.shape) == 3: mask = mask.unsqueeze(-1).repeat(1, 1, 3) blit_region = new_im2[yp1:yp2, xp1:xp2] new_im2[yp1:yp2, xp1:xp2] = mask * blitted + (1 - mask) * blit_region # return new_im2.cpu().numpy().astype("uint8") if not ismask else new_im2.cpu().numpy() # 6.13 ms / 100 loops return new_im2 if not ismask else new_im2
then modify file
moviepy/video/VideoClip.py
line 565 toreturn blit_gpu(img, picture, pos, mask=mask, ismask=self.ismask)
. This works a lot, provided that you have a GPU
@sixyang Is this fully utilizing the NVENC of 40-series GPUs?
hello, the bottleneck is not
write_video
, is the for-loop anditer_frames
function.
In my case, I use VizTracer
for measurements and find that iter_frames
averages 500ms per frame, while write_frame
averages 2ms per frame.
from viztracer import VizTracer
with VizTracer(ignore_frozen=True, ignore_c_function=True) as _:
final_clip.write_videofile(f"{fn}.mp4",
# threads=16, # ffmpeg 不是瓶颈
codec='h264_nvenc', # 2ms per frame, 不是瓶颈
write_logfile=f"{fn}.log"
)
import numpy as np import torch
def blit_gpu(im1, im2, pos=None, mask=None, ismask=False): device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
if pos is None: pos = [0, 0] xp, yp = pos x1 = max(0, -xp) y1 = max(0, -yp) h1, w1 = im1.shape[:2] h2, w2 = im2.shape[:2] xp2 = min(w2, xp + w1) yp2 = min(h2, yp + h1) x2 = min(w1, w2 - xp) y2 = min(h1, h2 - yp) xp1 = max(0, xp) yp1 = max(0, yp) if (xp1 >= xp2) or (yp1 >= yp2): return im2 if not isinstance(im1, torch.Tensor): # 5.43 ms per loop / 100 loops im1 = torch.tensor(im1, device=device) if not isinstance(im2, torch.Tensor): im2 = torch.tensor(im2, device=device) blitted = im1[y1:y2, x1:x2] new_im2 = im2.clone() if mask is None: new_im2[yp1:yp2, xp1:xp2] = blitted else: if not isinstance(mask, torch.Tensor): # 2.71 ms per loop / 10 loops mask = torch.tensor(mask[y1:y2, x1:x2], device=device) # 1.45 ms / 100 loops else: mask = mask[y1:y2, x1:x2] if len(im1.shape) == 3: mask = mask.unsqueeze(-1).repeat(1, 1, 3) blit_region = new_im2[yp1:yp2, xp1:xp2] new_im2[yp1:yp2, xp1:xp2] = mask * blitted + (1 - mask) * blit_region # return new_im2.cpu().numpy().astype("uint8") if not ismask else new_im2.cpu().numpy() # 6.13 ms / 100 loops return new_im2 if not ismask else new_im2
The number of iterations processed per second has tripled, and now I am maintaining over 90% GPU usage with faster results. Thank you.
你好,瓶颈不是,是for循环和函数。
write_video``iter_frames
是的,看看,在函数中,img_array.tobytes() 将花费大约 90% 的总运行时间,这就是瓶颈。
ffmpeg_write.py``write_frame``write_videofile
可以使用火炬进行加速,在文件中,修改为,如下所示:
moviepy/video/tools/drawing.py``blit``blit_gpu
import numpy as np import torch def blit_gpu(im1, im2, pos=None, mask=None, ismask=False): device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") if pos is None: pos = [0, 0] xp, yp = pos x1 = max(0, -xp) y1 = max(0, -yp) h1, w1 = im1.shape[:2] h2, w2 = im2.shape[:2] xp2 = min(w2, xp + w1) yp2 = min(h2, yp + h1) x2 = min(w1, w2 - xp) y2 = min(h1, h2 - yp) xp1 = max(0, xp) yp1 = max(0, yp) if (xp1 >= xp2) or (yp1 >= yp2): return im2 if not isinstance(im1, torch.Tensor): # 5.43 ms per loop / 100 loops im1 = torch.tensor(im1, device=device) if not isinstance(im2, torch.Tensor): im2 = torch.tensor(im2, device=device) blitted = im1[y1:y2, x1:x2] new_im2 = im2.clone() if mask is None: new_im2[yp1:yp2, xp1:xp2] = blitted else: if not isinstance(mask, torch.Tensor): # 2.71 ms per loop / 10 loops mask = torch.tensor(mask[y1:y2, x1:x2], device=device) # 1.45 ms / 100 loops else: mask = mask[y1:y2, x1:x2] if len(im1.shape) == 3: mask = mask.unsqueeze(-1).repeat(1, 1, 3) blit_region = new_im2[yp1:yp2, xp1:xp2] new_im2[yp1:yp2, xp1:xp2] = mask * blitted + (1 - mask) * blit_region # return new_im2.cpu().numpy().astype("uint8") if not ismask else new_im2.cpu().numpy() # 6.13 ms / 100 loops return new_im2 if not ismask else new_im2
然后将文件行 565 修改为 .只要您有 GPU,这就可以很好地工作
moviepy/video/VideoClip.py``return blit_gpu(img, picture, pos, mask=mask, ismask=self.ismask)
Hello, when I tried your method, my 3080ti had the following error:
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
,
Would you like to ask how to solve it
@tburrows13 @mgaitan <!-- Hello! If you think that it is a simple problem, then consider asking instead on our Gitter channel: https://gitter.im/movie-py/. This makes it easier to have a back-and-forth discussion in real-time.
You can format code by putting ``` (that's 3 backticks) on a line by itself at the beginning and end of each code block. For example: I rewrite the file:ffmpeg_writer: add -hwaccle nvdec line[97] cmd = [ FFMPEG_BINARY, "-hwaccel","nvdec", "-y", "-loglevel", "error" if logfile == sp.PIPE else "info", "-f", "rawvideo", "-vcodec", "rawvideo", "-s", "%dx%d" % (size[0], size[1]), "-pix_fmt", pix_fmt, "-r", "%.02f" % fps, "-an", "-i", "-", ] if audiofile is not None: cmd.extend(["-i", audiofile, "-acodec", "copy"]) cmd.extend(["-vcodec", codec, "-preset", preset]) if ffmpeg_params is not None: cmd.extend(ffmpeg_params) if bitrate is not None: cmd.extend(["-b", bitrate])
The GPU memory is being occupied, but the GPU utilization is almost negligible. As a result, the time taken to write the video does not show any significant improvement.
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 | | N/A 37C P0 34W / 70W | 216MiB / 15360MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 3427015 C /usr/local/bin/ffmpeg 211MiB | +-----------------------------------------------------------------------------+ -->