Closed HuHeng closed 11 months ago
我这边用cpu_gpu_trans_module.py里面的Frame.cpu() https://github.com/BabitMF/bmf/blob/master/bmf/python_modules/cpu_gpu_trans_module/cpu_gpu_trans_module.py#L40 发现还是会有多次拷贝
这里下面的灰色表示cpu()的执行阶段,红色表示两次D2H的拷贝。灰色是用nvtx去标记的,
in_frame = in_pkt.get(bmf.VideoFrame)
torch.cuda.nvtx.range_push("cpu before")
video_frame = in_frame.cpu()
torch.cuda.nvtx.range_pop()
测试脚本
def test_gpu_decode():
input_video_path = "img.mp4"
output_path = "./gpu_decode_result.yuv"
(
bmf.graph()
.decode({"input_path": input_video_path,
"video_params": {
"hwaccel": "cuda",
}})["video"]
.module("cpu_gpu_trans_module", {"to_gpu": 0})
.encode(
None,
{
"video_params": {
"codec": "rawvideo",
},
"format": "rawvideo",
"output_path": output_path
}
).run()
)
我这边用cpu_gpu_trans_module.py里面的Frame.cpu() https://github.com/BabitMF/bmf/blob/master/bmf/python_modules/cpu_gpu_trans_module/cpu_gpu_trans_module.py#L40 发现还是会有多次拷贝
这里下面的灰色表示cpu()的执行阶段,红色表示两次D2H的拷贝。灰色是用nvtx去标记的,
in_frame = in_pkt.get(bmf.VideoFrame) torch.cuda.nvtx.range_push("cpu before") video_frame = in_frame.cpu() torch.cuda.nvtx.range_pop()
测试脚本
def test_gpu_decode(): input_video_path = "img.mp4" output_path = "./gpu_decode_result.yuv" ( bmf.graph() .decode({"input_path": input_video_path, "video_params": { "hwaccel": "cuda", }})["video"] .module("cpu_gpu_trans_module", {"to_gpu": 0}) .encode( None, { "video_params": { "codec": "rawvideo", }, "format": "rawvideo", "output_path": output_path } ).run() )
这个可能还没有支持,在AVFrame到hmp::Frame互相转换的某些链路上还是分开拷贝,我看看这个场景
AVFrame的av_frame_get_buffer接口看实现对多plane的frame内存分配是连续的一大块内存,此时AVFrame的buf成员变量数组只有第一个是有值的,其它都是NULL。 但是从解码接口出来的AVFrame或者从filter接口出来的很难去判断它是否使用一块连续内存,比如下面这个,使用gdb调试这个程序: ffmpeg -i in.mp4 -c:v libx264 out.mp4
Thread 1 "ffmpeg_g" hit Breakpoint 1, do_video_out (of=0x5596da32d580, ost=0x5596da328f00, next_picture=0x5596db373780) at fftools/ffmpeg.c:1370
1370 ret = avcodec_send_frame(enc, in_picture);
(gdb) p *in_picture
$1 = {data = {0x7f0dfa244040 '\252' <repeats 118 times>, "\253\254\254\255\255", '\256' <repeats 77 times>..., 0x7f0df411bbc0 '~' <repeats 112 times>, "{{{{{{{{", '~' <repeats 31 times>, "}|", '{' <repeats 15 times>, '~' <repeats 32 times>...,
0x7f0df419b980 '\200' <repeats 16 times>, "~~~~~~~~", '\200' <repeats 176 times>..., 0x0, 0x0, 0x0, 0x0, 0x0}, linesize = {1920, 960, 960, 0, 0, 0, 0, 0}, extended_data = 0x5596db373780, width = 1920, height = 1080, nb_samples = 0, format = 0,
key_frame = 1, pict_type = AV_PICTURE_TYPE_NONE, sample_aspect_ratio = {num = 1, den = 1}, pts = 0, pkt_pts = 48, pkt_dts = 48, coded_picture_number = 0, display_picture_number = 0, quality = 0, opaque = 0x0, error = {0, 0, 0, 0, 0, 0, 0, 0},
repeat_pict = 0, interlaced_frame = 0, top_field_first = 0, palette_has_changed = 0, reordered_opaque = -9223372036854775808, sample_rate = 0, channel_layout = 0, buf = {0x7f0dbc2b0f40, 0x7f0dbc2b0f80, 0x7f0dbc2b0fc0, 0x0, 0x0, 0x0, 0x0, 0x0},
extended_buf = 0x0, nb_extended_buf = 0, side_data = 0x0, nb_side_data = 0, flags = 0, color_range = AVCOL_RANGE_UNSPECIFIED, color_primaries = AVCOL_PRI_UNSPECIFIED, color_trc = AVCOL_TRC_UNSPECIFIED, colorspace = AVCOL_SPC_UNSPECIFIED,
chroma_location = AVCHROMA_LOC_LEFT, best_effort_timestamp = 48, pkt_pos = 515, pkt_duration = 0, metadata = 0x0, decode_error_flags = 0, channels = 0, pkt_size = 29162, qscale_table = 0x0, qstride = 0, qscale_type = 0, qp_table_buf = 0x0,
hw_frames_ctx = 0x0, opaque_ref = 0x0, crop_top = 0, crop_bottom = 0, crop_left = 0, crop_right = 0, private_ref = 0x0, crf = 0}
(gdb) p in_picture->data[0]
$2 = (uint8_t *) 0x7f0dfa244040 '\252' <repeats 118 times>, "\253\254\254\255\255", '\256' <repeats 77 times>...
(gdb) p in_picture->data[1]
$3 = (uint8_t *) 0x7f0df411bbc0 '~' <repeats 112 times>, "{{{{{{{{", '~' <repeats 31 times>, "}|", '{' <repeats 15 times>, '~' <repeats 32 times>...
(gdb) p in_picture->data[2]
这个AVFrame的yuv plane并不是连续的,且从接口上也很难去判断AVFrame的多个plane内存是否使用同一块buffer,所以暂时没有太好的方案从AVFrame到hmp::Frame内存映射为同一块buffer,操作多plane拷贝或者d2h/h2d时最好还是分开操作。
你们好牛啊,我都不知道咋跑起来,咋开发,能交流一下么
包含多个plane的hmp::Frame内部使用一个连续的buffer用来存储多个plane tensor. 这个buffer通过storage_tensor这个tensor来管理,可以通过storage_tensor的h2d,d2h操作达到整个hmp::Frame的h2d/d2h目的。 同时保留对hmp::Frame接口的兼容,修改部分接口实现。