YuanGongND / ltu

Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand".
367 stars 33 forks source link

Question: Half Float Inference? #38

Open IanZ2020 opened 3 months ago

IanZ2020 commented 3 months ago

I found ltu/src/ltu_as/inference_gradio.py line 60 converts all params into float32.

convert_params_to_float32(model)

Inferring with float32 is really slow and costly in GPU memory. Have you guys tested inference with float16? Does it have a negative impact on the performance?

YuanGongND commented 3 months ago

https://github.com/YuanGongND/ltu/blob/2002aad8305ee5579a2237a85a6e792c1174cda7/src/ltu_as/inference_gradio.py#L29C9-L29C53

I guess it only do for ln layers, which is common practice.

The model is trained in mixed precision, so I guess if you can find a way to do 16bit inference that should be fine.

-Yuan