linziyi96 commented 8 months ago

We have currently received several requests (#112, #110, #97) to run the SPHINX inference on GPUs with smaller memory. We also believe that fitting it into the 24GB memory bar benefits a broad range of users who would like to run the model locally on commodity GPUs like 3090 or 4090.

With the latest update #113, we should see NF4 quantization running fine on SPHINX without errors (i.e., resolving #97). The memory usage is a bit less than 23GB, and it should now fit into a single 24GB GPU (3090, 4090 or A5000) even with ECC turned on

We are still doing a complete benchmark of this quantized model and will update the latest information under this issue. Meanwhile, any question is also welcomed :)

quizD commented 7 months ago

When I used 4*16GB VRAM GPUS to run the scripts SPHINX/README.md->Multi-GPU inference or torchrun --master_port=1112 --nproc_per_node=2 inference.py, but also got OutOfMemoryError wrong too. And watched 2 GPUS are used 100%. Does now use multi GPUS script is correct? I tried another way but also failed.

File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 427, in __init__ self.layers = nn.ModuleList([Blip2EncoderLayer(config) for _ in range(config.num_hidden_layers)]) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 427, in <listcomp> self.layers = nn.ModuleList([Blip2EncoderLayer(config) for _ in range(config.num_hidden_layers)]) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 223, in __init__ self.self_attn = Blip2Attention(config) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 139, in __init__ self.qkv = nn.Linear(self.embed_dim, 3 * self.embed_dim, bias=False) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 96, in __init__ self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs)) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/torch/utils/_device.py", line 62, in __torch_function__ return func(*args, **kwargs) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 15.57 GiB total capacity; 14.47 GiB already allocated; 90.12 MiB free; 14.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Looking forward to your reply!

linziyi96 commented 7 months ago

When I used 4*16GB VRAM GPUS to run the scripts SPHINX/README.md->Multi-GPU inference or torchrun --master_port=1112 --nproc_per_node=2 inference.py, but also got OutOfMemoryError wrong too. And watched 2 GPUS are used 100%. Does now use multi GPUS script is correct? I tried another way but also failed.

File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 427, in __init__ self.layers = nn.ModuleList([Blip2EncoderLayer(config) for _ in range(config.num_hidden_layers)]) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 427, in <listcomp> self.layers = nn.ModuleList([Blip2EncoderLayer(config) for _ in range(config.num_hidden_layers)]) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 223, in __init__ self.self_attn = Blip2Attention(config) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 139, in __init__ self.qkv = nn.Linear(self.embed_dim, 3 * self.embed_dim, bias=False) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 96, in __init__ self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs)) File "/xxxx/miniconda3/envs/accessory/lib/python3.10/site-packages/torch/utils/_device.py", line 62, in __torch_function__ return func(*args, **kwargs) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 15.57 GiB total capacity; 14.47 GiB already allocated; 90.12 MiB free; 14.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Looking forward to your reply!

With 16GB mem per GPU SPHINX will need to run on all 4 GPUs (26GB for LM params, 6GB for visual params, 4GB for kv-cache, 3GB for SAM, adding up to >32GB). We plan to add this support in the next few days (currently, we only support running on 1 or 2 GPUs).

linziyi96 commented 7 months ago

116 fixes inference memory with image input.

We have now moved the inference development to small (24/16GB) GPUs to avoid such errors slipping by on the large training GPUs.

linziyi96 commented 7 months ago

Ongoing developments as of 27 Nov:

FP16 inference memory optimizations

[x] Support re-sharding the model to larger tensor-parallel degrees (currently, only re-sharding to smaller degrees is supported) to support many small GPUs (e.g., 4*16GB) (#118)
[ ] Shard vision encoders to the tensor-parallel workers (currently, they are replicated among tensor-parallel works, which becomes inefficient for many small GPUs)
[ ] Better handling of SAM. Allow SAM to be sharded or disabled (at the cost of losing the segmentation functionality) to avoid uneven GPU memory usage.

If you have other feature requests about SPHINX inference, please feel free to reply under this issue.

Alpha-VLLM / LLaMA2-Accessory

Tracking issue for SPHINX quantization & other memory issues #114

116 fixes inference memory with image input.