BlackSamorez / tensor_parallel

Automatically split your PyTorch models on multiple GPUs for training & inference
MIT License
619 stars 38 forks source link

Out of GPU memory for two A10 GPUs #126

Closed JunyiYe closed 11 months ago

JunyiYe commented 12 months ago

Hi,

I was attempting to run the "facebook/opt-13B" model on two A10 GPUs with 24 GB memory each and encountering issues below.

Any thought or feedback is appreciated.

Using automatic config: tensor parallel config not provided and no custom config registered for the model Traceback (most recent call last): File "/home/onsi/jye/GitLab/codellm/test_tensor_parallel.py", line 7, in model = tp.tensor_parallel(model, ["cuda:0", "cuda:1"]) # <- each GPU has half the weights File "/home/onsi/jye/anaconda3/envs/hf/lib/python3.10/site-packages/tensor_parallel/factory.py", line 61, in tensor_parallel return TensorParallelPreTrainedModel( File "/home/onsi/jye/anaconda3/envs/hf/lib/python3.10/site-packages/tensor_parallel/pretrained_model.py", line 57, in init self.wrapped_model = TensorParallel( File "/home/onsi/jye/anaconda3/envs/hf/lib/python3.10/site-packages/tensor_parallel/tensor_parallel.py", line 82, in init shard, modified_parameters_names = make_shard( File "/home/onsi/jye/anaconda3/envs/hf/lib/python3.10/site-packages/tensor_parallel/shard.py", line 56, in make_shard modified_parameter_names = processstate(shard, source_tensors, config, rank=rank, world_size=world_size) File "/home/onsi/jye/anaconda3/envs/hf/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/onsi/jye/anaconda3/envs/hf/lib/python3.10/site-packages/tensor_parallel/shard.py", line 160, in processstate state.data = new_data.clone().detach().to(state.device).requiresgrad(state.requires_grad) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB. GPU 0 has a total capacty of 21.99 GiB of which 5.00 MiB is free. Including non-PyTorch memory, this process has 21.97 GiB memory in use. Of the allocated memory 21.75 GiB is allocated by PyTorch, and 5.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

JunyiYe commented 11 months ago

After increasing the number of A10s from 2 to 3, the problem disappeared. It's surprising that a 13B model consumes more than 48 GB (24*2) of GPU memory.