deepseek-ai / DeepSeek-V2

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
MIT License
3.47k stars 143 forks source link

为何我在A800上运行DeepSeek-V2-Lite-Chat (SFT),竟然消耗60G的显存?! #74

Open juhengzhe opened 2 months ago

juhengzhe commented 2 months ago

权重文件一共32G左右。 为啥实际加载模型后,占用内存将近60多G呢。

juhengzhe commented 2 months ago

模型加载时,通过指定数据类型为float16避免使用全精度,可以使内存降到40G以下。

liangfang commented 2 months ago

注意到这句话—— The model has a long context length (163840). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. Consider setting --max-model-len to a smaller value.

但是我也想就此请教一下long context length为啥消耗显存那么多?

beep-bebop commented 1 month ago

注意到这句话—— The model has a long context length (163840). This may cause OOM errors during the initial memory profiling phase, or result in low performance due to small KV cache space. 该模型的上下文长度很长(163840)。这可能会在初始内存分析阶段导致OOM错误,或者由于KV缓存空间较小而导致性能低下。 Consider setting --max-model-len to a smaller value. 考虑将--max-mode-len设置为较小的值。

但是我也想就此请教一下long context length为啥消耗显存那么多?

我猜是为long context做了显存的预分配,后续推理的时候显存不会变化太多。