7B 模型单卡3090后处理非常耗时

Facico / Chinese-Vicuna

Chinese-Vicuna: A Chinese Instruction-following LLaMA-based Model —— 一个中文低资源的llama+lora方案，结构参考alpaca

https://github.com/Facico/Chinese-Vicuna

Apache License 2.0

4.14k stars 425 forks source link

Open f18298335152h opened 1 year ago

f18298335152h commented 1 year ago

我在3090上部署了7B得chat对话模型，在推理时我发现模型速度为0.3ms左右但是后处理token得时候，每隔token得耗时达到了2s，导致响应速度非常慢，我发现是for循环在迭代调用GenerationMixin时耗时非常就，请问这个怎么解决?