BRIO-lab / LitJTML

Using Pytorch Lightning and WandbLogger for our JTML neural network segementation code
1 stars 5 forks source link

Batch device print statement shows only device 0 on multi-GPU setup #11

Open sasank-desaraju opened 1 year ago

sasank-desaraju commented 1 year ago

When running on HPG, the print output that gives us "Training batch is on device _" is only reading "device 0". Is this missing computations on the other GPU (i.e. "device 1"), which the program states is indeed there at the beginning, or is this second GPU not getting used during training for some reason?

sasank-desaraju commented 1 year ago

Oh what lol. In Wandb, which is where I was checking, there are actually two runs going on. The second one has less information in the logs but just says which device the operations are on and both training and validation are on "device 1", which is the second GPU. Cool! (I think).

sasank-desaraju commented 1 year ago

Okay, now I can't find those initial logs but both runs just show that all batches are on device 1. A question for another day...