Closed adamcatto closed 2 years ago
Hey @adamcatto
This happens when you try to call cuda functions in the main process before calling e.g. Trainer.fit(), and then try to call cuda functions again somewhere during trainer.fit. Please check in your code that you're not calling .to("cuda")
or similar device operations when using the TPU accelerator. I see some device logic in that optimizer code.
And if that doesn't help, would you mind sharing a reproducible snippet of code? If you don't mind, you could share the link to a copy of your colab directly.
Thanks @awaelchli turned out the issue was apparently that assigning device = g_norm.get_device()
did not retain XLA device information, so initializing tensors using this device resulted in CUDA tensor. I changed the assignment to device = g_norm.device
and it worked. Strangely enough, removing device calls and initializing the tensor without device resulted in device mismatch, which I would have thought pytorch-lightning would handle. I wonder what the issue there is...
Lightning can only handle the transfer of data from the dataloader to the device automatically and placing the model weights on the device. Any other tensors that you create you need to move yourself. You can use the self.device
attribute on the LightningModule if you need that device information.
🐛 Bug
I am attempting to train a model on a Colab TPU, and got this error:
I am not sure what's going on here, as no CUDA devices are present!
From the error message it appears there is an issue with the optimizer step; I am using a custom optimizer, so I will paste the optimizer code here:
I am wondering if the CUDA/Nvidia environment variables have something to do with this? (see output of
os.environ
in Environment section) – do I somehow need to remove CUDA-related information from the environment?Environment
Output of
collect_env_details.py
script:Additional details:
os.environ
cc @kaushikb11 @rohitgr7 @akihironitta