[Question] 模型显存占用28G？

goodnessSZW commented 1 year ago

Required prerequisites

[X] I have read the documentation https://github.com/baichuan-inc/baichuan-7B/blob/HEAD/README.md.
[x] I have searched the Issue Tracker and Discussions that this hasn't already been reported. (+1 or comment there if it has.)
[X] Consider asking first in a Discussion.

Questions

huggingface下载模型后，采用官方推理代码，只加载模型后显存占用28G？大家有这个问题吗，不太确定问题出在哪

Checklist

[X] I have provided all relevant and necessary information above.
[X] I have chosen a suitable title for this issue.

dayL-W commented 1 year ago

我的V100显卡，占用也是30G左右，没有做量化，我感觉没问题

formath commented 1 year ago

7B模型不就应该这么大吗

goodnessSZW commented 1 year ago

7B模型不就应该这么大吗如果加载模型时采用bfloat16，模型占用显存会下降至13G左右，比较符合预期，因为用chatglm2 6B做推理时，模型占用显存11.6G左右

Wu-Fisher commented 1 year ago

老哥方便讲讲问题吗，我最近也在对比这两个模型

goodnessSZW commented 1 year ago

好啊，大佬们一起讨论一下，这几天试了下baichuan和chatglm2，也遇到一些小问题：

主要通过以下三个函数测试推理过程中的显存： torch.cuda.memory_allocated()：tensors当前所占用的 GPU 显存 torch.cuda.max_memory_allocated()：当前程序从开始所占用的最大显存 torch.cuda.max_memory_reserved()：当前进程所分配的显存缓冲区（显存-上下文）

1）chatglm2 6B，bin文件12G，加载模型后：tensor占用11.65G 2）baichuan 7B: bin文件14G，加载模型后：tensor占用13.16G ，bfloat16量化后显存会大概折半，device_map="auto"这个在源代码上有，加载后反而会提前oom，不太确定原因。。而且在每个batch推理结束后打印torch.cuda.memory_allocated()，显示结果就是模型占用显存，加上device_map="auto"后，每个batch推理结束后打印torch.cuda.memory_allocated()会有变高，可能是模型自动化硬件加载的策略？

baichuan-inc / Baichuan-7B