I was attempting to run the "facebook/opt-13B" model on two A10 GPUs with 24 GB memory each and encountering issues below.
Any thought or feedback is appreciated.
Using automatic config: tensor parallel config not provided and no custom config registered for the model
Traceback (most recent call last):
File "/home/onsi/jye/GitLab/codellm/test_tensor_parallel.py", line 7, in
model = tp.tensor_parallel(model, ["cuda:0", "cuda:1"]) # <- each GPU has half the weights
File "/home/onsi/jye/anaconda3/envs/hf/lib/python3.10/site-packages/tensor_parallel/factory.py", line 61, in tensor_parallel
return TensorParallelPreTrainedModel(
File "/home/onsi/jye/anaconda3/envs/hf/lib/python3.10/site-packages/tensor_parallel/pretrained_model.py", line 57, in init
self.wrapped_model = TensorParallel(
File "/home/onsi/jye/anaconda3/envs/hf/lib/python3.10/site-packages/tensor_parallel/tensor_parallel.py", line 82, in init
shard, modified_parameters_names = make_shard(
File "/home/onsi/jye/anaconda3/envs/hf/lib/python3.10/site-packages/tensor_parallel/shard.py", line 56, in make_shard
modified_parameter_names = processstate(shard, source_tensors, config, rank=rank, world_size=world_size)
File "/home/onsi/jye/anaconda3/envs/hf/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/onsi/jye/anaconda3/envs/hf/lib/python3.10/site-packages/tensor_parallel/shard.py", line 160, in processstate
state.data = new_data.clone().detach().to(state.device).requiresgrad(state.requires_grad)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB. GPU 0 has a total capacty of 21.99 GiB of which 5.00 MiB is free. Including non-PyTorch memory, this process has 21.97 GiB memory in use. Of the allocated memory 21.75 GiB is allocated by PyTorch, and 5.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
After increasing the number of A10s from 2 to 3, the problem disappeared. It's surprising that a 13B model consumes more than 48 GB (24*2) of GPU memory.
Hi,
I was attempting to run the "facebook/opt-13B" model on two A10 GPUs with 24 GB memory each and encountering issues below.
Any thought or feedback is appreciated.
Using automatic config: tensor parallel config not provided and no custom config registered for the model Traceback (most recent call last): File "/home/onsi/jye/GitLab/codellm/test_tensor_parallel.py", line 7, in
model = tp.tensor_parallel(model, ["cuda:0", "cuda:1"]) # <- each GPU has half the weights
File "/home/onsi/jye/anaconda3/envs/hf/lib/python3.10/site-packages/tensor_parallel/factory.py", line 61, in tensor_parallel
return TensorParallelPreTrainedModel(
File "/home/onsi/jye/anaconda3/envs/hf/lib/python3.10/site-packages/tensor_parallel/pretrained_model.py", line 57, in init
self.wrapped_model = TensorParallel(
File "/home/onsi/jye/anaconda3/envs/hf/lib/python3.10/site-packages/tensor_parallel/tensor_parallel.py", line 82, in init
shard, modified_parameters_names = make_shard(
File "/home/onsi/jye/anaconda3/envs/hf/lib/python3.10/site-packages/tensor_parallel/shard.py", line 56, in make_shard
modified_parameter_names = processstate(shard, source_tensors, config, rank=rank, world_size=world_size)
File "/home/onsi/jye/anaconda3/envs/hf/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/onsi/jye/anaconda3/envs/hf/lib/python3.10/site-packages/tensor_parallel/shard.py", line 160, in processstate
state.data = new_data.clone().detach().to(state.device).requiresgrad(state.requires_grad)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB. GPU 0 has a total capacty of 21.99 GiB of which 5.00 MiB is free. Including non-PyTorch memory, this process has 21.97 GiB memory in use. Of the allocated memory 21.75 GiB is allocated by PyTorch, and 5.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF