Running on two 3090s? - Githubissues

Hi authors, I do not have access to solid hardware. What I have for now is two 3090s (24GB each). I am planning I run/debug the code with this setup and then move experiments to A100s. On these two 3090s, I have CUDA 12.4. Torch version 2.0.0 (in the requirements) does not support CUDA 12.X. I found that torch 12.1.1 support CUDA 12.1. This is the only change I have made in the requirements, otherwise the setup is as suggested.

When I run

torchrun --nproc_per_node=2 --master_port=6000 train.py ...

I am the code is getting stuck at the following progress step,

LlamaTokenizerFast(name_or_path='meta-llama/Llama-2-7b-hf', vocab_size=32000, model_max_length=32, is_fast=True, padding_side='right', truncation
_side='left', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>'}, clean_up_tokenization_spaces=F
alse),  added_tokens_decoder={                                                                                                                   
        0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),                                   
        1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),                                     
        2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),                                    
}                                                                                                                                                
/mnt/shared_home/vdeshpande/miniconda3/envs/env_spag/lib/python3.9/site-packages/accelerate/accelerator.py:457: FutureWarning: Passing the follow
ing arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). 
Please pass an `accelerate.DataLoaderConfiguration` instead:                                                                                     
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)                                                          
  warnings.warn(                                                                                                                                 
Installed CUDA version 12.2 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /mnt/shared_home/vdeshpande/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...                                               
Installed CUDA version 12.2 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /mnt/shared_home/vdeshpande/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...                                               
Detected CUDA files, patching ldflags                                                                                                            
Emitting ninja build file /mnt/shared_home/vdeshpande/.cache/torch_extensions/py39_cu121/cpu_adam/build.ninja...                                 
Building extension module cpu_adam...                                                                                                            
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)                                
ninja: no work to do.               
Loading extension module cpu_adam...                                    
Loading extension module cpu_adam...                                    
Time to load cpu_adam op: 3.3634016513824463 seconds                    
Time to load cpu_adam op: 3.0814285278320312 seconds                    
Parameter Offload: Total persistent parameters: 532480 in 130 params                                                                             [2024-09-04 15:34:07,215] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 50337 closing signal SIGTERM                  
[2024-09-04 15:34:22,297] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 50336) of binary: /mnt
/shared_home/vdeshpande/miniconda3/envs/env_spag/bin/python

Any insights on resolving this issue?

Linear95 / SPAG

Running on two 3090s? #9