Open KeesariVigneshwarReddy opened 1 week ago
It actually looks like both GPUs are being used.
The issue with the two utilization indicators may be that one process is CPU bound (e.g. rank 0 doing logging) while the other isn't. The model seems really small and essentially CPU operations dominate.
I suggest you increase the size of the model, or the size of the batch, to bring the actual utilization up.
You can also verify this by passing barebones=True
to the Trainer
: this should minimize non-model related operations and the two GPUs will probably look more similar.
Bug description
i initialized my trainer
distributed is initialized for both the GPUs but only one is getting hit.
Also for validation loop the GPU are not in usage
How can I resolve the situation to use 2 GPUs and fasten my training.
What version are you seeing the problem on?
v2.4
How to reproduce the bug
No response
Error messages and logs
Environment
Current environment
``` #- PyTorch Lightning Version (e.g., 2.4.0): #- PyTorch Version (e.g., 2.4): #- Python version (e.g., 3.12): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): ```More info
No response