qwen1.5-7b-chat模型4090内存溢出

QwenLM / Qwen2

Qwen2 is the large language model series developed by Qwen team, Alibaba Cloud.

7.43k stars 454 forks source link

qwen1.5-7b-chat模型4090内存溢出 #174

Closed houliangxue closed 3 months ago

houliangxue commented 6 months ago

配置：4090 ，24G显存加载方式： self.model = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype="auto",device_map="auto").eval() self.tokenizer = AutoTokenizer.from_pretrained(model_path) self.generation_Config = GenerationConfig.from_pretrained(model_path, trust_remote_code=True)

当token达到6k时，出现内存溢出 torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.14 GiB (GPU 1; 23.65 GiB total capacity; 10.10 GiB already allocated; 276.50 MiB free; 10.58 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF。

换了量化版本还是一样问题，有什么能节省内存的方法吗

bravelll commented 6 months ago

torch_dtype 设置成bfloat16就ok了

houliangxue commented 5 months ago

torch_dtype 设置成bfloat16就ok了

qwem1.5的config.json默认就是bfloat16呀，提示词比较长的时候还是会爆显存

vincentliang commented 4 months ago

我的22G显存，换了7B CHAT，7B chat 4bit量化，都是超过8千个字就爆显存。请问如何配置。

jklj077 commented 3 months ago

transformers will only use the dtype in config.json if you pass torch_dtype="auto" to from_pretrained. So the original code should be fine.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.14 GiB (GPU 1; 23.65 GiB total capacity; 10.10 GiB already allocated; 276.50 MiB free; 10.58 GiB reserved in total by PyTorch)

This suggests that other processes were using your GPU memory and they took 23.65 - 10.58 = 13.07 GB.

Also see profiling results at https://qwen.readthedocs.io/en/latest/benchmark/hf_infer.html: Qwen1.5-7B-Chat-GPTQ-Int4 with 6144 input tokens and 2048 generated tokens take about 16.16GB GPU memory.