推理速度很慢 - Githubissues

zhusy09 commented 4 months ago

目前测试推理速度很慢，用的 A100 推理，推理生成 1 分钟的音频，有时推理时间能达到接近 1 分钟。请问有什么优化的方法麽？

aluminumbox commented 4 months ago

well you can try fp16 inference, we may try some inference optimization method later, but now we focus on fixing some bugs and make this repo easier to use.

eshoyuan commented 4 months ago

I think you can split text into small pieces.

xwang0415 commented 4 months ago

I think you can split text into small pieces.

Text segmentation inference only speeds up the stream response, but does not improve the overall inference efficiency

eshoyuan commented 4 months ago

I think you can split text into small pieces.

Text segmentation inference only speeds up the stream response, but does not improve the overall inference efficiency

Transformer time complexity is O(N^2), so it can improve by splitting sentences.

eshoyuan commented 4 months ago

I think you can split text into small pieces.

Text segmentation inference only speeds up the stream response, but does not improve the overall inference efficiency

Transformer time complexity is O(N^2), so it can improve by splitting sentences.

Sorry, I was wrong. It can't improve. On single RTX3090: Generating '你好，我是通义生成式语音大模型，请问有什么可以帮您的吗？' for 10 times takes 36 seconds. Generating '你好，我是通义生成式语音大模型，请问有什么可以帮您的吗？' * 10 takes 30 seconds.

eshoyuan commented 4 months ago

I think you can split text into small pieces.

Text segmentation inference only speeds up the stream response, but does not improve the overall inference efficiency

Transformer time complexity is O(N^2), so it can improve by splitting sentences.

Sorry, I was wrong. It can't improve. On single RTX3090: Generating '你好，我是通义生成式语音大模型，请问有什么可以帮您的吗？' for 10 times takes 36 seconds. Generating '你好，我是通义生成式语音大模型，请问有什么可以帮您的吗？' * 10 takes 30 seconds.

The reason is that the author has used text splitter in self.frontend.text_normalize.

tiger-998jim commented 4 months ago

这个可以部署在kaggle上吗，用T4显卡跑，不知道推理速度怎么样

zhusy09 commented 4 months ago

well you can try fp16 inference, we may try some inference optimization method later, but now we focus on fixing some bugs and make this repo easier to use.

试了，但没啥反应，也用了torch.compile，但也没啥改善。

aluminumbox commented 4 months ago

这个可以部署在kaggle上吗，用T4显卡跑，不知道推理速度怎么样

you can try it, it should be ok

yangcunning1 commented 4 months ago

well you can try fp16 inference, we may try some inference optimization method later, but now we focus on fixing some bugs and make this repo easier to use.

试了，但没啥反应，也用了torch.compile，但也没啥改善。

怎么改成fp16进行推理啊

chenxu126 commented 3 months ago

同问

SuperNodeLibs commented 3 months ago

同问

einsqing commented 3 months ago

同问

suxuanning commented 3 months ago

同问

Dinxin commented 3 months ago

Flash attention 1/2、Paged attention and model quantization may be useful to speed the inference greatly.

LzyloveRila commented 3 months ago

well you can try fp16 inference, we may try some inference optimization method later, but now we focus on fixing some bugs and make this repo easier to use.

试了，但没啥反应，也用了torch.compile，但也没啥改善。

我也改了fp16，显存少了一半，速度没变。。。

yangcunning1 commented 3 months ago

怎么修改啊，我修改后，推理出来的音频没声音了

---原始邮件--- 发件人: "Frank @.> 发送时间: 2024年7月22日(周一) 下午4:34 收件人: @.>; 抄送: @.**@.>; 主题: Re: [FunAudioLLM/CosyVoice] 推理速度很慢 (Issue #75)

well you can try fp16 inference, we may try some inference optimization method later, but now we focus on fixing some bugs and make this repo easier to use.

试了，但没啥反应，也用了torch.compile，但也没啥改善。

我也改了fp16，显存少了一半，速度没变。。。

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

LzyloveRila commented 3 months ago

怎么修改啊，我修改后，推理出来的音频没声音了 … ---原始邮件--- 发件人: "Frank @.> 发送时间: 2024年7月22日(周一) 下午4:34 收件人: @.>; 抄送: @.**@.>; 主题: Re: [FunAudioLLM/CosyVoice] 推理速度很慢 (Issue #75) well you can try fp16 inference, we may try some inference optimization method later, but now we focus on fixing some bugs and make this repo easier to use. 试了，但没啥反应，也用了torch.compile，但也没啥改善。我也改了fp16，显存少了一半，速度没变。。。 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

直接用的autocast，文本长了之后我也没声音了，短文本可以

shanhaidexiamo commented 3 months ago

well you can try fp16 inference, we may try some inference optimization method later, but now we focus on fixing some bugs and make this repo easier to use.

试了，但没啥反应，也用了torch.compile，但也没啥改善。

我也改了fp16，显存少了一半，速度没变。。。

我也model.half了，只有flow部分可以加速，LLM部分速度没变，请问您解决这个问题了吗

abc8350712 commented 3 months ago

well you can try fp16 inference, we may try some inference optimization method later, but now we focus on fixing some bugs and make this repo easier to use.

试了，但没啥反应，也用了torch.compile，但也没啥改善。

我也改了fp16，显存少了一半，速度没变。。。

我也model.half了，只有flow部分可以加速，LLM部分速度没变，请问您解决这个问题了吗

请问下flow部分要怎么half呀？我一直报错RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half

hjj-lmx commented 3 months ago

目前测试推理速度很慢，用A100的推理，推理生成1分钟的音频，有时推理时间能达到1分钟。请问有什么优化的方法么？

请问找到解决方法了吗？

garfieldv5 commented 3 months ago

目前测试推理速度很慢，用的 A100 推理，推理生成 1 分钟的音频，有时推理时间能达到接近 1 分钟。请问有什么优化的方法麽？

碰到同样的问题，即使改用流式，首句37个字，也需要18秒后生成音频流。

boomyao commented 3 months ago

把长文本分句，然后多线程并行推理，可以有效缩短时间。

Erickrus commented 3 months ago

@boomyao 意思是 infer("1") + infer("2") + infer("3") <= infer ("1"+"2"+"3") ? 速度快很多吗？我测算过T4上面GPU Util全部拉满了，只能跑1路

wang-TJ-20 commented 2 months ago

well you can try fp16 inference, we may try some inference optimization method later, but now we focus on fixing some bugs and make this repo easier to use.

试了，但没啥反应，也用了torch.compile，但也没啥改善。

我也改了fp16，显存少了一半，速度没变。。。

我也model.half了，只有flow部分可以加速，LLM部分速度没变，请问您解决这个问题了吗

请问下flow部分要怎么half呀？我一直报错RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half

你好，这个问题解决了吗，我也遇到这个问题

FunAudioLLM / CosyVoice

推理速度很慢 #75