BabitMF / bmf

Cross-platform, customizable multimedia/video processing framework. With strong GPU acceleration, heterogeneous design, multi-language support, easy to use, multi-framework compatible and high performance, the framework is ideal for transcoding, AI inference, algorithm integration, live video streaming, and more.
https://babitmf.github.io/
Apache License 2.0
773 stars 65 forks source link

Yuvalloctor #25

Closed HuHeng closed 11 months ago

HuHeng commented 1 year ago

包含多个plane的hmp::Frame内部使用一个连续的buffer用来存储多个plane tensor. 这个buffer通过storage_tensor这个tensor来管理,可以通过storage_tensor的h2d,d2h操作达到整个hmp::Frame的h2d/d2h目的。 同时保留对hmp::Frame接口的兼容,修改部分接口实现。

Jie-Fang commented 1 year ago

我这边用cpu_gpu_trans_module.py里面的Frame.cpu() https://github.com/BabitMF/bmf/blob/master/bmf/python_modules/cpu_gpu_trans_module/cpu_gpu_trans_module.py#L40 发现还是会有多次拷贝

image

这里下面的灰色表示cpu()的执行阶段,红色表示两次D2H的拷贝。灰色是用nvtx去标记的,

in_frame = in_pkt.get(bmf.VideoFrame)
 torch.cuda.nvtx.range_push("cpu before")     
video_frame = in_frame.cpu()
torch.cuda.nvtx.range_pop()

测试脚本

def test_gpu_decode():
    input_video_path = "img.mp4"
    output_path = "./gpu_decode_result.yuv"

    (
        bmf.graph()
            .decode({"input_path": input_video_path,
                    "video_params": {
                        "hwaccel": "cuda",
                    }})["video"]
           .module("cpu_gpu_trans_module", {"to_gpu": 0})
           .encode(
                None,
                {
                    "video_params": {
                        "codec": "rawvideo",
                    },
                    "format": "rawvideo",
                    "output_path": output_path
                }
            ).run()
    )
HuHeng commented 1 year ago

我这边用cpu_gpu_trans_module.py里面的Frame.cpu() https://github.com/BabitMF/bmf/blob/master/bmf/python_modules/cpu_gpu_trans_module/cpu_gpu_trans_module.py#L40 发现还是会有多次拷贝

image

这里下面的灰色表示cpu()的执行阶段,红色表示两次D2H的拷贝。灰色是用nvtx去标记的,

in_frame = in_pkt.get(bmf.VideoFrame)
 torch.cuda.nvtx.range_push("cpu before")     
video_frame = in_frame.cpu()
torch.cuda.nvtx.range_pop()

测试脚本

def test_gpu_decode():
    input_video_path = "img.mp4"
    output_path = "./gpu_decode_result.yuv"

    (
        bmf.graph()
            .decode({"input_path": input_video_path,
                    "video_params": {
                        "hwaccel": "cuda",
                    }})["video"]
           .module("cpu_gpu_trans_module", {"to_gpu": 0})
           .encode(
                None,
                {
                    "video_params": {
                        "codec": "rawvideo",
                    },
                    "format": "rawvideo",
                    "output_path": output_path
                }
            ).run()
    )

这个可能还没有支持,在AVFrame到hmp::Frame互相转换的某些链路上还是分开拷贝,我看看这个场景

HuHeng commented 1 year ago

AVFrame的av_frame_get_buffer接口看实现对多plane的frame内存分配是连续的一大块内存,此时AVFrame的buf成员变量数组只有第一个是有值的,其它都是NULL。 但是从解码接口出来的AVFrame或者从filter接口出来的很难去判断它是否使用一块连续内存,比如下面这个,使用gdb调试这个程序: ffmpeg -i in.mp4 -c:v libx264 out.mp4

Thread 1 "ffmpeg_g" hit Breakpoint 1, do_video_out (of=0x5596da32d580, ost=0x5596da328f00, next_picture=0x5596db373780) at fftools/ffmpeg.c:1370
1370            ret = avcodec_send_frame(enc, in_picture);
(gdb) p *in_picture
$1 = {data = {0x7f0dfa244040 '\252' <repeats 118 times>, "\253\254\254\255\255", '\256' <repeats 77 times>..., 0x7f0df411bbc0 '~' <repeats 112 times>, "{{{{{{{{", '~' <repeats 31 times>, "}|", '{' <repeats 15 times>, '~' <repeats 32 times>...,
    0x7f0df419b980 '\200' <repeats 16 times>, "~~~~~~~~", '\200' <repeats 176 times>..., 0x0, 0x0, 0x0, 0x0, 0x0}, linesize = {1920, 960, 960, 0, 0, 0, 0, 0}, extended_data = 0x5596db373780, width = 1920, height = 1080, nb_samples = 0, format = 0,
  key_frame = 1, pict_type = AV_PICTURE_TYPE_NONE, sample_aspect_ratio = {num = 1, den = 1}, pts = 0, pkt_pts = 48, pkt_dts = 48, coded_picture_number = 0, display_picture_number = 0, quality = 0, opaque = 0x0, error = {0, 0, 0, 0, 0, 0, 0, 0},
  repeat_pict = 0, interlaced_frame = 0, top_field_first = 0, palette_has_changed = 0, reordered_opaque = -9223372036854775808, sample_rate = 0, channel_layout = 0, buf = {0x7f0dbc2b0f40, 0x7f0dbc2b0f80, 0x7f0dbc2b0fc0, 0x0, 0x0, 0x0, 0x0, 0x0},
  extended_buf = 0x0, nb_extended_buf = 0, side_data = 0x0, nb_side_data = 0, flags = 0, color_range = AVCOL_RANGE_UNSPECIFIED, color_primaries = AVCOL_PRI_UNSPECIFIED, color_trc = AVCOL_TRC_UNSPECIFIED, colorspace = AVCOL_SPC_UNSPECIFIED,
  chroma_location = AVCHROMA_LOC_LEFT, best_effort_timestamp = 48, pkt_pos = 515, pkt_duration = 0, metadata = 0x0, decode_error_flags = 0, channels = 0, pkt_size = 29162, qscale_table = 0x0, qstride = 0, qscale_type = 0, qp_table_buf = 0x0,
  hw_frames_ctx = 0x0, opaque_ref = 0x0, crop_top = 0, crop_bottom = 0, crop_left = 0, crop_right = 0, private_ref = 0x0, crf = 0}
(gdb) p in_picture->data[0]
$2 = (uint8_t *) 0x7f0dfa244040 '\252' <repeats 118 times>, "\253\254\254\255\255", '\256' <repeats 77 times>...
(gdb) p in_picture->data[1]
$3 = (uint8_t *) 0x7f0df411bbc0 '~' <repeats 112 times>, "{{{{{{{{", '~' <repeats 31 times>, "}|", '{' <repeats 15 times>, '~' <repeats 32 times>...
(gdb) p in_picture->data[2]

这个AVFrame的yuv plane并不是连续的,且从接口上也很难去判断AVFrame的多个plane内存是否使用同一块buffer,所以暂时没有太好的方案从AVFrame到hmp::Frame内存映射为同一块buffer,操作多plane拷贝或者d2h/h2d时最好还是分开操作。

ProjetNice commented 12 months ago

你们好牛啊,我都不知道咋跑起来,咋开发,能交流一下么