Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.5k stars 3.39k forks source link

Why only one GPU is getting used in the kaggle kernel #20424

Open KeesariVigneshwarReddy opened 1 week ago

KeesariVigneshwarReddy commented 1 week ago

Bug description

Screenshot 2024-11-16 201845

i initialized my trainer

trainer = L.Trainer(max_epochs=5,
                    devices=2,
                    strategy='ddp_notebook',
                    num_sanity_val_steps=0,
                    profiler='simple', 
                    default_root_dir="/kaggle/working",  
                    callbacks=[DeviceStatsMonitor(), 
                               StochasticWeightAveraging(swa_lrs=1e-2), 
                               #EarlyStopping(monitor='train_Loss', min_delta=0.001, patience=100, verbose=False, mode='min'),
                              ],
                    enable_progress_bar=True,
                    enable_model_summary=True,
                   )

distributed is initialized for both the GPUs but only one is getting hit.

Also for validation loop the GPU are not in usage

Screenshot 2024-11-16 202120

How can I resolve the situation to use 2 GPUs and fasten my training.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment ``` #- PyTorch Lightning Version (e.g., 2.4.0): #- PyTorch Version (e.g., 2.4): #- Python version (e.g., 3.12): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): ```

More info

No response

lantiga commented 1 week ago

It actually looks like both GPUs are being used.

The issue with the two utilization indicators may be that one process is CPU bound (e.g. rank 0 doing logging) while the other isn't. The model seems really small and essentially CPU operations dominate.

I suggest you increase the size of the model, or the size of the batch, to bring the actual utilization up.

You can also verify this by passing barebones=True to the Trainer: this should minimize non-model related operations and the two GPUs will probably look more similar.