FunAudioLLM / SenseVoice

Multilingual Voice Understanding Model
https://funaudiollm.github.io/
Other
2.61k stars 249 forks source link

onnxruntime-gpu inference is more slow then CPU #76

Closed aofengdaxia closed 1 month ago

aofengdaxia commented 1 month ago

Notice: In order to resolve issues more efficiently, please raise issue following the template. (注意:为了更加高效率解决您遇到的问题,请按照模板提问,补充细节)

🐛 Bug

When I use onnxruntime-GPU to inference, It's slowly then CPU.

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

GPU takes 2800ms,but cpu only task 128ms to inference.

Code sample

Expected behavior

# use cpu inference
model_dir = "iic/SenseVoiceSmall"
model = SenseVoiceSmall(model_dir, device_id=-1, quantize=True)
res = model(np.array(frames),language="zh",use_itn=True)
# use GPU inference
model_dir = "iic/SenseVoiceSmall"
model = SenseVoiceSmall(model_dir, device_id=0, quantize=True)
res = model(np.array(frames),language="zh",use_itn=True)

Environment

aofengdaxia commented 1 month ago

cuda 11.8 onnxruntime 1.18.1 cuDNN 8.9.5

LauraGPT commented 1 month ago

It is the bug of onnxruntime. To verify the bug of gpu, you could infer a wav for 10 times. You would get the inference time:

1st: long time 2nd: short time 3rd: short ...

For a new wav input, it is the 1st time.