onnxruntime-gpu inference is more slow then CPU

aofengdaxia commented 1 month ago

Notice: In order to resolve issues more efficiently, please raise issue following the template. （注意：为了更加高效率解决您遇到的问题，请按照模板提问，补充细节）

🐛 Bug

When I use onnxruntime-GPU to inference, It's slowly then CPU.

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

GPU takes 2800ms,but cpu only task 128ms to inference.

Code sample

Expected behavior

# use cpu inference
model_dir = "iic/SenseVoiceSmall"
model = SenseVoiceSmall(model_dir, device_id=-1, quantize=True)
res = model(np.array(frames),language="zh",use_itn=True)

# use GPU inference
model_dir = "iic/SenseVoiceSmall"
model = SenseVoiceSmall(model_dir, device_id=0, quantize=True)
res = model(np.array(frames),language="zh",use_itn=True)

Environment

GPU 3090 or 4090
CPU Xeon(R) Platinum 8474C

aofengdaxia commented 1 month ago

cuda 11.8 onnxruntime 1.18.1 cuDNN 8.9.5

LauraGPT commented 1 month ago

It is the bug of onnxruntime. To verify the bug of gpu, you could infer a wav for 10 times. You would get the inference time:

1st: long time 2nd: short time 3rd: short ...

For a new wav input, it is the 1st time.

FunAudioLLM / SenseVoice