单卡A100无法推理 - Githubissues

Huwei-deeplearning commented 10 months ago

您好，我运行

CUDA_VISIBLE_DEVICES=0 python pred.py --model chatglm3-6b-32k

后，发现显存炸了，请问对于chatglm3-6b-32k模型的显存要求最少是多少？单卡A100推理不了。报错信息如下 torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 17.95 GiB (GPU 0; 79.35 GiB total capacity; 48.28 GiB already allocated; 11.78 GiB free; 66.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

bys0318 commented 10 months ago

您好！请问您的pytorch版本是2.0或以上吗，ChatGLM需要2.0及以上的pytorch版本才能自动启用FlashAttention，不用FlashAttention的话确实会爆显存。

IT-five commented 10 months ago

您好！请问您的pytorch版本是2.0或以上吗，ChatGLM需要2.0及以上的pytorch版本才能自动启用FlashAttention，不用FlashAttention的话确实会爆显存。

我在跑baichuan-7b-chat的时候，将代码中的截断逻辑删了，并使用了NTK-ROPE插值，每次都由于max_seq_len_cached=20000左右报OOM，我的显卡是A800，按理来说这点长度不会造成OOM，请问是为什么？我的依赖如下，而且我装xformer想降显存，但是装了以后直接CUDA不可用了。 bitsandbytes 0.41.1 open-clip-torch 2.20.0 peft 0.5.0 pytorch-lightning 1.7.7 pytorch-metric-learning 2.3.0 pytorch-wavelets 1.3.0 pytorch-wpe 0.0.1 pytorch3d 0.7.4 rotary-embedding-torch 0.3.0 sentencepiece 0.1.99 taming-transformers-rom1504 0.0.6 torch 2.0.1+cu118 torch-complex 0.4.3 torch-scatter 2.1.1 torchaudio 2.0.2+cu118 torchmetrics 0.11.4 torchsummary 1.5.1 torchvision 0.15.2+cu118 transformers 4.34.1 transformers-stream-generator 0.0.4

bys0318 commented 10 months ago

您好！baichuan的modeling代码里目前似乎用xops.memory_efficient_attention来启用flash-attention，不用flash-attention的话显存占用会随长度二次方增长，20k的长度下推理确实可能会OOM。建议您通过from xformers import ops as xops来启用flash-attention。

THUDM / LongBench

单卡A100无法推理 #44