llama-13b-hf做推理，CUDA out of memory. 问题

Bingohong commented 1 year ago

我这里尝试使用llama-13b-hf做推理，设备是4张10G 3080，上游模型应该就18G，但在加载lora模型时，就会报下列错误。 log torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 10.00 GiB total capacity; 8.82 GiB already allocated; 0 bytes free; 8.82 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF google后我设置了 PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:24，还是无效。想请问下这是什么原因？lora模型就26M，我有40G显存，按理说空闲还会有很多，但就会提示显存溢出。感谢~

系统：winserver 2019

jby20180901 commented 1 year ago

因为这个代码库不支持多卡推理。。。我也很想知道怎么多卡推理

Facico commented 1 year ago

一个月前的版本就已经支持多卡推理了

Facico / Chinese-Vicuna

llama-13b-hf做推理，CUDA out of memory. 问题 #224