Alpha-VLLM / LLaMA2-Accessory

An Open-source Toolkit for LLM Development
https://llama2-accessory.readthedocs.io/
Other
2.63k stars 168 forks source link

Tracking issue for SPHINX quantization & other memory issues #114

Open linziyi96 opened 8 months ago

linziyi96 commented 8 months ago

We have currently received several requests (#112, #110, #97) to run the SPHINX inference on GPUs with smaller memory. We also believe that fitting it into the 24GB memory bar benefits a broad range of users who would like to run the model locally on commodity GPUs like 3090 or 4090.

With the latest update #113, we should see NF4 quantization running fine on SPHINX without errors (i.e., resolving #97). The memory usage is a bit less than 23GB, and it should now fit into a single 24GB GPU (3090, 4090 or A5000) even with ECC turned on

image

We are still doing a complete benchmark of this quantized model and will update the latest information under this issue. Meanwhile, any question is also welcomed :)

quizD commented 7 months ago

When I used 4*16GB VRAM GPUS to run the scripts SPHINX/README.md->Multi-GPU inference or torchrun --master_port=1112 --nproc_per_node=2 inference.py, but also got OutOfMemoryError wrong too. And watched 2 GPUS are used 100%. Does now use multi GPUS script is correct? I tried another way but also failed.

File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 427, in __init__ self.layers = nn.ModuleList([Blip2EncoderLayer(config) for _ in range(config.num_hidden_layers)]) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 427, in <listcomp> self.layers = nn.ModuleList([Blip2EncoderLayer(config) for _ in range(config.num_hidden_layers)]) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 223, in __init__ self.self_attn = Blip2Attention(config) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 139, in __init__ self.qkv = nn.Linear(self.embed_dim, 3 * self.embed_dim, bias=False) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 96, in __init__ self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs)) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/torch/utils/_device.py", line 62, in __torch_function__ return func(*args, **kwargs) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 15.57 GiB total capacity; 14.47 GiB already allocated; 90.12 MiB free; 14.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Looking forward to your reply!

linziyi96 commented 7 months ago

When I used 4*16GB VRAM GPUS to run the scripts SPHINX/README.md->Multi-GPU inference or torchrun --master_port=1112 --nproc_per_node=2 inference.py, but also got OutOfMemoryError wrong too. And watched 2 GPUS are used 100%. Does now use multi GPUS script is correct? I tried another way but also failed.

File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 427, in __init__ self.layers = nn.ModuleList([Blip2EncoderLayer(config) for _ in range(config.num_hidden_layers)]) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 427, in <listcomp> self.layers = nn.ModuleList([Blip2EncoderLayer(config) for _ in range(config.num_hidden_layers)]) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 223, in __init__ self.self_attn = Blip2Attention(config) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 139, in __init__ self.qkv = nn.Linear(self.embed_dim, 3 * self.embed_dim, bias=False) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 96, in __init__ self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs)) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/torch/utils/_device.py", line 62, in __torch_function__ return func(*args, **kwargs) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 15.57 GiB total capacity; 14.47 GiB already allocated; 90.12 MiB free; 14.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Looking forward to your reply!

With 16GB mem per GPU SPHINX will need to run on all 4 GPUs (26GB for LM params, 6GB for visual params, 4GB for kv-cache, 3GB for SAM, adding up to >32GB). We plan to add this support in the next few days (currently, we only support running on 1 or 2 GPUs).

linziyi96 commented 7 months ago

116 fixes inference memory with image input.

We have now moved the inference development to small (24/16GB) GPUs to avoid such errors slipping by on the large training GPUs.

linziyi96 commented 7 months ago

Ongoing developments as of 27 Nov:

FP16 inference memory optimizations

If you have other feature requests about SPHINX inference, please feel free to reply under this issue.