Facico / Chinese-Vicuna

Chinese-Vicuna: A Chinese Instruction-following LLaMA-based Model —— 一个中文低资源的llama+lora方案,结构参考alpaca
https://github.com/Facico/Chinese-Vicuna
Apache License 2.0
4.14k stars 425 forks source link

llama-13b-hf做推理,CUDA out of memory. 问题 #224

Open Bingohong opened 1 year ago

Bingohong commented 1 year ago

我这里尝试使用llama-13b-hf做推理,设备是4张10G 3080,上游模型应该就18G,但在加载lora模型时,就会报下列错误。 log torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 10.00 GiB total capacity; 8.82 GiB already allocated; 0 bytes free; 8.82 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF google后我设置了 PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:24, 还是无效。 想请问下这是什么原因?lora模型就26M,我有40G显存,按理说空闲还会有很多,但就会提示显存溢出。 感谢~

系统:winserver 2019

jby20180901 commented 1 year ago

因为这个代码库不支持多卡推理。。。我也很想知道怎么多卡推理

Facico commented 1 year ago

一个月前的版本就已经支持多卡推理了