Zulko / moviepy

Video editing with Python
https://zulko.github.io/moviepy/
MIT License
12.42k stars 1.55k forks source link

how can I use GPU in write_videofile #2011

Open TANGnlp0711 opened 1 year ago

TANGnlp0711 commented 1 year ago

@tburrows13 @mgaitan <!-- Hello! If you think that it is a simple problem, then consider asking instead on our Gitter channel: https://gitter.im/movie-py/. This makes it easier to have a back-and-forth discussion in real-time.


You can format code by putting ``` (that's 3 backticks) on a line by itself at the beginning and end of each code block. For example: I rewrite the file:ffmpeg_writer: add -hwaccle nvdec line[97] cmd = [ FFMPEG_BINARY, "-hwaccel","nvdec", "-y", "-loglevel", "error" if logfile == sp.PIPE else "info", "-f", "rawvideo", "-vcodec", "rawvideo", "-s", "%dx%d" % (size[0], size[1]), "-pix_fmt", pix_fmt, "-r", "%.02f" % fps, "-an", "-i", "-", ] if audiofile is not None: cmd.extend(["-i", audiofile, "-acodec", "copy"]) cmd.extend(["-vcodec", codec, "-preset", preset]) if ffmpeg_params is not None: cmd.extend(ffmpeg_params) if bitrate is not None: cmd.extend(["-b", bitrate])

video_clips.write_videofile(file_name, temp_audiofile=file_name.replace(VIDEO_EXT_NAME, '.mp3'),
                                        fps=24,codec='h264_nvenc') 

The GPU memory is being occupied, but the GPU utilization is almost negligible. As a result, the time taken to write the video does not show any significant improvement.

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 | | N/A 37C P0 34W / 70W | 216MiB / 15360MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 3427015 C /usr/local/bin/ffmpeg 211MiB | +-----------------------------------------------------------------------------+ -->

sixyang commented 1 year ago

hello, the bottleneck is not write_video, is the for-loop and iter_frames function.

antsmallant commented 10 months ago

hello, the bottleneck is not write_video, is the for-loop and iter_frames function.

yeah, look into the ffmpeg_write.py, in write_frame function, img_array.tobytes() will cost about 90% of the total time running write_videofile, this is the bottleneck.

sixyang commented 8 months ago

hello, the bottleneck is not write_video, is the for-loop and iter_frames function.

yeah, look into the ffmpeg_write.py, in write_frame function, img_array.tobytes() will cost about 90% of the total time running write_videofile, this is the bottleneck.

you can use torch to accelerate, in the file moviepy/video/tools/drawing.py, modify blit to blit_gpu, shown as follows:

import numpy as np
import torch

def blit_gpu(im1, im2, pos=None, mask=None, ismask=False):
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    if pos is None:
        pos = [0, 0]

    xp, yp = pos
    x1 = max(0, -xp)
    y1 = max(0, -yp)
    h1, w1 = im1.shape[:2]
    h2, w2 = im2.shape[:2]
    xp2 = min(w2, xp + w1)
    yp2 = min(h2, yp + h1)
    x2 = min(w1, w2 - xp)
    y2 = min(h1, h2 - yp)
    xp1 = max(0, xp)
    yp1 = max(0, yp)

    if (xp1 >= xp2) or (yp1 >= yp2):
        return im2

    if not isinstance(im1, torch.Tensor):               # 5.43 ms per loop / 100 loops
        im1 = torch.tensor(im1, device=device)
    if not isinstance(im2, torch.Tensor):
        im2 = torch.tensor(im2, device=device)

    blitted = im1[y1:y2, x1:x2]

    new_im2 = im2.clone()

    if mask is None:
        new_im2[yp1:yp2, xp1:xp2] = blitted
    else:
        if not isinstance(mask, torch.Tensor):          # 2.71 ms per loop / 10 loops
            mask = torch.tensor(mask[y1:y2, x1:x2], device=device)  # 1.45 ms / 100 loops
        else:
            mask = mask[y1:y2, x1:x2]
        if len(im1.shape) == 3:
            mask = mask.unsqueeze(-1).repeat(1, 1, 3)
        blit_region = new_im2[yp1:yp2, xp1:xp2]
        new_im2[yp1:yp2, xp1:xp2] = mask * blitted + (1 - mask) * blit_region

    # return new_im2.cpu().numpy().astype("uint8") if not ismask else new_im2.cpu().numpy()   # 6.13 ms / 100 loops
    return new_im2 if not ismask else new_im2

then modify file moviepy/video/VideoClip.py line 565 to return blit_gpu(img, picture, pos, mask=mask, ismask=self.ismask). This works a lot, provided that you have a GPU

keikoro commented 7 months ago

Please always include your specs like we ask for in our issue templates – MoviePy version, platform used etc.

Code samples and logs should be code-formatted for better readability.

JasonChoate commented 7 months ago

This works a lot, provided that you have a GPU

This is giving me the following error with my RTX 3070:

File "C:\Python311\Lib\site-packages\moviepy\Clip.py", line 474, in iter_frames
    frame = frame.astype(dtype)
            ^^^^^^^^^^^^
AttributeError: 'Tensor' object has no attribute 'astype'. Did you mean: 'dtype'?

I really do appreciate the thought being put into this though, being able to utilize a GPU to help mitigate this bottleneck would be massive.

zhangdanq commented 6 months ago

This works a lot, provided that you have a GPU

This is giving me the following error with my RTX 3070:

File "C:\Python311\Lib\site-packages\moviepy\Clip.py", line 474, in iter_frames
    frame = frame.astype(dtype)
            ^^^^^^^^^^^^
AttributeError: 'Tensor' object has no attribute 'astype'. Did you mean: 'dtype'?

I really do appreciate the thought being put into this though, being able to utilize a GPU to help mitigate this bottleneck would be massive.

You can use

return new_im2.cpu().numpy().astype("uint8") if not ismask else new_im2.cpu().numpy() # 6.13 ms / 100 loops

icynare commented 5 months ago

It works for me! 3 times faster.

File "C:\Python311\Lib\site-packages\moviepy\Clip.py", line 474, in iter_frames
    frame = frame.astype(dtype)
            ^^^^^^^^^^^^
AttributeError: 'Tensor' object has no attribute 'astype'. Did you mean: 'dtype'?

@JasonChoate As for this error, just modify function iter_frames in Clip.py as follows:

if (dtype is not None) and (frame.dtype != dtype):
       # frame = frame.astype(dtype)
       frame = frame.cpu().numpy().astype(dtype)
maxin9966 commented 5 months ago

hello, the bottleneck is not write_video, is the for-loop and iter_frames function.

yeah, look into the ffmpeg_write.py, in write_frame function, img_array.tobytes() will cost about 90% of the total time running write_videofile, this is the bottleneck.

you can use torch to accelerate, in the file moviepy/video/tools/drawing.py, modify blit to blit_gpu, shown as follows:

import numpy as np
import torch

def blit_gpu(im1, im2, pos=None, mask=None, ismask=False):
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    if pos is None:
        pos = [0, 0]

    xp, yp = pos
    x1 = max(0, -xp)
    y1 = max(0, -yp)
    h1, w1 = im1.shape[:2]
    h2, w2 = im2.shape[:2]
    xp2 = min(w2, xp + w1)
    yp2 = min(h2, yp + h1)
    x2 = min(w1, w2 - xp)
    y2 = min(h1, h2 - yp)
    xp1 = max(0, xp)
    yp1 = max(0, yp)

    if (xp1 >= xp2) or (yp1 >= yp2):
        return im2

    if not isinstance(im1, torch.Tensor):               # 5.43 ms per loop / 100 loops
        im1 = torch.tensor(im1, device=device)
    if not isinstance(im2, torch.Tensor):
        im2 = torch.tensor(im2, device=device)

    blitted = im1[y1:y2, x1:x2]

    new_im2 = im2.clone()

    if mask is None:
        new_im2[yp1:yp2, xp1:xp2] = blitted
    else:
        if not isinstance(mask, torch.Tensor):          # 2.71 ms per loop / 10 loops
            mask = torch.tensor(mask[y1:y2, x1:x2], device=device)  # 1.45 ms / 100 loops
        else:
            mask = mask[y1:y2, x1:x2]
        if len(im1.shape) == 3:
            mask = mask.unsqueeze(-1).repeat(1, 1, 3)
        blit_region = new_im2[yp1:yp2, xp1:xp2]
        new_im2[yp1:yp2, xp1:xp2] = mask * blitted + (1 - mask) * blit_region

    # return new_im2.cpu().numpy().astype("uint8") if not ismask else new_im2.cpu().numpy()   # 6.13 ms / 100 loops
    return new_im2 if not ismask else new_im2

then modify file moviepy/video/VideoClip.py line 565 to return blit_gpu(img, picture, pos, mask=mask, ismask=self.ismask). This works a lot, provided that you have a GPU

@sixyang Is this fully utilizing the NVENC of 40-series GPUs?

notmmao commented 4 months ago

hello, the bottleneck is not write_video, is the for-loop and iter_frames function.

In my case, I use VizTracer for measurements and find that iter_frames averages 500ms per frame, while write_frame averages 2ms per frame.

from viztracer import VizTracer
with VizTracer(ignore_frozen=True, ignore_c_function=True) as _:
    final_clip.write_videofile(f"{fn}.mp4",
        # threads=16,   # ffmpeg 不是瓶颈
        codec='h264_nvenc', # 2ms per frame, 不是瓶颈
        write_logfile=f"{fn}.log"
    )

write_frame iter_frame

alimusdu commented 1 month ago

import numpy as np import torch

def blit_gpu(im1, im2, pos=None, mask=None, ismask=False): device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

if pos is None:
    pos = [0, 0]

xp, yp = pos
x1 = max(0, -xp)
y1 = max(0, -yp)
h1, w1 = im1.shape[:2]
h2, w2 = im2.shape[:2]
xp2 = min(w2, xp + w1)
yp2 = min(h2, yp + h1)
x2 = min(w1, w2 - xp)
y2 = min(h1, h2 - yp)
xp1 = max(0, xp)
yp1 = max(0, yp)

if (xp1 >= xp2) or (yp1 >= yp2):
    return im2

if not isinstance(im1, torch.Tensor):               # 5.43 ms per loop / 100 loops
    im1 = torch.tensor(im1, device=device)
if not isinstance(im2, torch.Tensor):
    im2 = torch.tensor(im2, device=device)

blitted = im1[y1:y2, x1:x2]

new_im2 = im2.clone()

if mask is None:
    new_im2[yp1:yp2, xp1:xp2] = blitted
else:
    if not isinstance(mask, torch.Tensor):          # 2.71 ms per loop / 10 loops
        mask = torch.tensor(mask[y1:y2, x1:x2], device=device)  # 1.45 ms / 100 loops
    else:
        mask = mask[y1:y2, x1:x2]
    if len(im1.shape) == 3:
        mask = mask.unsqueeze(-1).repeat(1, 1, 3)
    blit_region = new_im2[yp1:yp2, xp1:xp2]
    new_im2[yp1:yp2, xp1:xp2] = mask * blitted + (1 - mask) * blit_region

# return new_im2.cpu().numpy().astype("uint8") if not ismask else new_im2.cpu().numpy()   # 6.13 ms / 100 loops
return new_im2 if not ismask else new_im2

The number of iterations processed per second has tripled, and now I am maintaining over 90% GPU usage with faster results. Thank you.

FURYFOR commented 1 month ago

你好,瓶颈不是,是for循环和函数。write_video``iter_frames

是的,看看,在函数中,img_array.tobytes() 将花费大约 90% 的总运行时间,这就是瓶颈。ffmpeg_write.py``write_frame``write_videofile

可以使用火炬进行加速,在文件中,修改为,如下所示:moviepy/video/tools/drawing.py``blit``blit_gpu

import numpy as np
import torch

def blit_gpu(im1, im2, pos=None, mask=None, ismask=False):
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    if pos is None:
        pos = [0, 0]

    xp, yp = pos
    x1 = max(0, -xp)
    y1 = max(0, -yp)
    h1, w1 = im1.shape[:2]
    h2, w2 = im2.shape[:2]
    xp2 = min(w2, xp + w1)
    yp2 = min(h2, yp + h1)
    x2 = min(w1, w2 - xp)
    y2 = min(h1, h2 - yp)
    xp1 = max(0, xp)
    yp1 = max(0, yp)

    if (xp1 >= xp2) or (yp1 >= yp2):
        return im2

    if not isinstance(im1, torch.Tensor):               # 5.43 ms per loop / 100 loops
        im1 = torch.tensor(im1, device=device)
    if not isinstance(im2, torch.Tensor):
        im2 = torch.tensor(im2, device=device)

    blitted = im1[y1:y2, x1:x2]

    new_im2 = im2.clone()

    if mask is None:
        new_im2[yp1:yp2, xp1:xp2] = blitted
    else:
        if not isinstance(mask, torch.Tensor):          # 2.71 ms per loop / 10 loops
            mask = torch.tensor(mask[y1:y2, x1:x2], device=device)  # 1.45 ms / 100 loops
        else:
            mask = mask[y1:y2, x1:x2]
        if len(im1.shape) == 3:
            mask = mask.unsqueeze(-1).repeat(1, 1, 3)
        blit_region = new_im2[yp1:yp2, xp1:xp2]
        new_im2[yp1:yp2, xp1:xp2] = mask * blitted + (1 - mask) * blit_region

    # return new_im2.cpu().numpy().astype("uint8") if not ismask else new_im2.cpu().numpy()   # 6.13 ms / 100 loops
    return new_im2 if not ismask else new_im2

然后将文件行 565 修改为 .只要您有 GPU,这就可以很好地工作moviepy/video/VideoClip.py``return blit_gpu(img, picture, pos, mask=mask, ismask=self.ismask)

Hello, when I tried your method, my 3080ti had the following error: TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first., Would you like to ask how to solve it