aihacker111 / Efficient-Live-Portrait

Fast running Live Portrait with TensorRT and ONNX models
MIT License
114 stars 9 forks source link

onnxruntime问题 #2

Open warmshao opened 1 month ago

warmshao commented 1 month ago

我看你装的是onnxruntime而不是onnxruntime-gpu, 这样就没法用gpu加速了。但是我装了onnxruntime-gpu又报错:grid_sample不支持5D

aihacker111 commented 1 month ago

Got it, I'll update the new model for this, or you can try convert by upgrade install torch==2.3.1 with cuda 11.8

aihacker111 commented 1 month ago

Suggestion @warmshao : latest torch==2.3.1+cu118 and try to run converted code in my source again with original Live-Portrait

warmshao commented 1 month ago

Got it, I'll update the new model for this, or you can try convert by upgrade install torch==2.3.1 with cuda 11.8

这个被验证是可行的吗?

aihacker111 commented 1 month ago

@warmshao yes, I and my teammate is testing with this , I’ll update new model for warpping model tonight

aihacker111 commented 1 month ago

@warmshao by the way , you can run warp.onnx with CPU

warmshao commented 1 month ago

@warmshao by the way , you can run warp.onnx with CPU

cpu是可以正常跑的,但是就是比较慢

aihacker111 commented 1 month ago

Yeah, wait me tonight , update new onnx model fixed on GPU, and we still plan to public tensorRT for maximum optimize speed

mocmocmoc commented 1 month ago

Yeah, wait me tonight , update new onnx model fixed on GPU, and we still plan to public tensorRT for maximum optimize speed

while you're at it, I git to 'C:\AI\Efficient-Live-Portrait\' but ONNX models are downloaded into 'C:\live_portrait_onnx_weights\'

oh boy tensorRT 😍

aihacker111 commented 1 month ago

@mocmocmoc don’t worry, it’ll return correct path

Echolink50 commented 1 month ago

I moved the models around following the structure on the code page but still no luck.

aihacker111 commented 1 month ago

does you have free time right now

Echolink50 commented 1 month ago

does you have free time right now

Are you asking me?

aihacker111 commented 1 month ago

@warmshao @Echolink50 , Hey converted new onnx model for fixing onnxruntime-gpu has error inference Grid5D is succesfully, Please check new model Live-Portrait ONNX model on Huggingface after 1 hour , counting from now

mocmocmoc commented 1 month ago

Hey converted new onnx model for fixing onnxruntime-gpu has error inference Grid5D is succesfully

Which onnxruntime-gpu to use? currently 1.17.x works but 1.18.x has "Only 4-D tensor is supported" error

aihacker111 commented 1 month ago

@mocmocmoc My new model update is fixed it in inference , please install onnxruntime-gpu with correct cuda and cudnn version require on onnx document https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html

aihacker111 commented 1 month ago

@mocmocmoc Already update in repo. Please testing it

aihacker111 commented 1 month ago

@mocmocmoc @warmshao @Echolink50 Already working on GPU I'm tested every single model onnx

Screenshot 2024-07-14 at 14 57 46

Only 2.8ms for Warping model I'm also update 2 requirements for install on CPU and GPU

aihacker111 commented 1 month ago
Screenshot 2024-07-14 at 15 05 00

Already done with 4 minute in 354 frames, and 1s/1frame

mocmocmoc commented 1 month ago

I'm running locally on Windows11, CPU:13600KF, GPU:3090, Python 3.10.11, Cuda 11.8, cuDNN 8.9.x.x

Getting very bad results, bench using \source\s2.jpg and \driving\d0.mp4 Official-Liveportrait=16s Efficient-Live-Portrait=1:18m

I made sure Cudaexecutionprovider is working

aihacker111 commented 1 month ago

The official is already to use half precision and it work currently better with cuda , but the onnx run full process with fp32, that why it slower than, but the different between it is that you can run it everywhere instead of push the model pytorch to it and also it can run with C++ code and finally it can enhance your VRAM GPU usage

aihacker111 commented 1 month ago
Screenshot 2024-07-14 at 17 45 15
aihacker111 commented 1 month ago

@mocmocmoc if you want it faster as faster, please wait TensorRT update

warmshao commented 1 month ago

@mocmocmoc @warmshao @Echolink50 Already working on GPU I'm tested every single model onnx Screenshot 2024-07-14 at 14 57 46 Only 2.8ms for Warping model I'm also update 2 requirements for install on CPU and GPU

thanks,great job!

warmshao commented 1 month ago

@mocmocmoc @warmshao @Echolink50 Already working on GPU I'm tested every single model onnx Screenshot 2024-07-14 at 14 57 46 Only 2.8ms for Warping model I'm also update 2 requirements for install on CPU and GPU

这是在什么机型下测试的,说实话有点不可思议,我看官方在4090上用pytorch compile+half+triton优化后也要5.21ms,根据我的经验onnx不至于能快一倍

warmshao commented 1 month ago

另外我用gpu跑你的warping模型挺慢的,也有你图片上的warning,我打开后看了显示:2024-07-14 14:31:56.027589591 [I:onnxruntime:, cuda_execution_provider.cc:2517 GetCapability] CUDA kernel not found in registries for Op type: GridSample node name: /GridSample

westNeighbor commented 1 month ago

Mine: Windows11, RTX 4080, Python 3.10.11, Cuda 12.4, cuDNN 8.9.2.26

For \driving\d3.mp4 Official-Liveportrait=35s (~14it/s) Efficient-Live-Portrait=3:30m (1.7it/s)

I thought this is a faster version, but it's way slower than official.

aihacker111 commented 1 month ago

@mocmocmoc @warmshao @Echolink50 Already working on GPU I'm tested every single model onnx Screenshot 2024-07-14 at 14 57 46 Only 2.8ms for Warping model I'm also update 2 requirements for install on CPU and GPU

这是在什么机型下测试的,说实话有点不可思议,我看官方在4090上用pytorch compile+half+triton优化后也要5.21ms,根据我的经验onnx不至于能快一倍

@westNeighbor Follow this , official: compile + half + triton and onnx: only fp32

wjwzy commented 1 month ago

@mocmocmoc @warmshao @Echolink50 Already working on GPU I'm tested every single model onnx Screenshot 2024-07-14 at 14 57 46 Only 2.8ms for Warping model I'm also update 2 requirements for install on CPU and GPU

The unit of time. time() should be seconds. After 10 cycles, the inference time on the RTX3060 12GB graphics card is 320-350ms. When initializing the model, I set the providers, 2024-07-15 16-48-29 的屏幕截图

aihacker111 commented 1 month ago

the Warping model is not a problem, The problem is from spade generator , it's take 14s for this and when I check onnx load , I see some layers fall back to CPU for process, this is a reason why it so slowly @wjwzy

aihacker111 commented 1 month ago

@warmshao @westNeighbor @wjwzy
I'm already found where is the problem, in progress fixed it and update new model in 1 hour

aihacker111 commented 1 month ago

@wjwzy please help me check all the model expect spade_generator in 10 loop and send result in here. Thank you

wjwzy commented 1 month ago

@wjwzy please help me check all the model expect spade_generator in 10 loop and send result in here. Thank you

Apart from spade_generationor, I only used these four models on my end. Among them, stitching.retargeting is the onnx that I exported myself, and I have verified that the inference results of these models are all correct appearance_feature_extractor:10ms, motion_extractor:8ms, stitching_retargeting:<1ms, warping_module:380ms 2024-07-15 17-31-03 的屏幕截图 2024-07-15 17-31-13 的屏幕截图 2024-07-15 17-31-19 的屏幕截图 2024-07-15 17-31-25 的屏幕截图

aihacker111 commented 1 month ago

@wjwzy I'm update new spade model, you can download from my huggingface link , full process and please give me the run time all . Thank you

wjwzy commented 1 month ago

@wjwzy I'm update new spade model, you can download from my huggingface link , full process and please give me the run time all . Thank you

The inference time of the model you provided is 170~180ms 2024-07-15 17-47-49 的屏幕截图

The spade_generationor model can use fp16 inference. The input format I exported locally is fp16, and the inference time is 60~70ms, but sometimes it can run to over 90ms. The inference time of the first frame must be ignored here because the graphics card needs to be preheated. instable. 2024-07-15 17-55-46 的屏幕截图

wjwzy commented 1 month ago

test code:

import time import onnxruntime as ort import torch

def to_numpy(tensor): return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

def appearance_feature_extractor(): ort_session = ort.InferenceSession('pretrained_weights/onnx_weights/appearance_feature_extractor.onnx', providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])

inputs = torch.randn(1, 3, 256, 256)  # Example shape, adjust as necessary
# Prepare inputs for ONNX runtime
ort_inputs = {
    'input': to_numpy(inputs),
}
print("-------------------------")
print("appearance_feature_extractor inference:")
# Run inference
for i in range(10):
    t = time.time()
    ort_outs = ort_session.run(None, ort_inputs)
    print(time.time() - t)

def motion_extractor(): ort_session = ort.InferenceSession('pretrained_weights/onnx_weights/motion_extractor.onnx', providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])

inputs = torch.randn(1, 3, 256, 256)  # Example shape, adjust as necessary
# Prepare inputs for ONNX runtime
ort_inputs = {
    'input': to_numpy(inputs),
}
print("-------------------------")
print("motion_extractor inference:")
# Run inference
for i in range(10):
    t = time.time()
    ort_outs = ort_session.run(None, ort_inputs)
    print(time.time() - t)

def stitching_retargeting(): ort_session = ort.InferenceSession('pretrained_weights/onnx_weights/stitching_retargeting.onnx', providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])

inputs = torch.randn(1, 126)  # Example shape, adjust as necessary
# Prepare inputs for ONNX runtime
ort_inputs = {
    'input': to_numpy(inputs),
}
print("-------------------------")
print("stitching_retargeting inference:")
# Run inference
for i in range(10):
    t = time.time()
    ort_outs = ort_session.run(None, ort_inputs)
    print(time.time() - t)

def warping_module(): ort_session = ort.InferenceSession('pretrained_weights/onnx_weights/warping_module.onnx', providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])

feature_3d = torch.randn(1, 32, 16, 64, 64)  # Example shape, adjust as necessary
kp_source = torch.randn(1, 21, 3)  # Example shape, adjust as necessary
kp_driving = torch.randn(1, 21, 3)  # Example shape, adjust as necessary
# Prepare inputs for ONNX runtime
ort_inputs = {
    'feature_3d': to_numpy(feature_3d),
    'kp_source': to_numpy(kp_source),
    'kp_driving': to_numpy(kp_driving)
}
print("-------------------------")
print("warping_module inference:")
# Run inference
for i in range(10):
    t = time.time()
    ort_outs = ort_session.run(None, ort_inputs)
    print(time.time() - t)

def spade_generator(): ort_session = ort.InferenceSession('pretrained_weights/onnx_weights//spade_generator.onnx', providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])

inputs = torch.randn(1, 256, 64, 64)  # Example shape, adjust as necessary
# Prepare inputs for ONNX runtime
ort_inputs = {
    'input': to_numpy(inputs)
}
print("-------------------------")
print("spade_generator inference:")
# Run inference
for i in range(10):
    t = time.time()
    ort_outs = ort_session.run(None, ort_inputs)
    print(time.time() - t)

if name == 'main':

appearance_feature_extractor()

# motion_extractor()
# stitching_retargeting()
# warping_module()
spade_generator()
aihacker111 commented 1 month ago

so you run all process to generate a video, give me how long it take 1 video 78 frames

aihacker111 commented 1 month ago

@wjwzy seem like the first run is very slow

wjwzy commented 1 month ago

so you run all process to generate a video, give me how long it take 1 video 78 frames

It takes 45 seconds to execute, and the key time is still spent on warming.onnx. Single needle inference takes more than 300 milliseconds. Can this optimize inference speed? I think PyTorch inference only takes 100ms

aihacker111 commented 1 month ago

@wjwzy I found it, that's the reason when converting to TensorRT , it's raise error

aihacker111 commented 1 month ago

@wjwzy already updated , Please testing this

aihacker111 commented 1 month ago

@wjwzy fp16 is reduce half time, I'll updated

aihacker111 commented 1 month ago

@wjwzy TensorRT is already work and will be update, the inference speed and gpu vram usage is very incredible

warmshao commented 1 month ago

不折腾了,我都实现了:https://github.com/warmshao/FasterLivePortrait