Closed Xingyu-Lin closed 1 year ago
This link will be helpful: https://lightning.ai/docs/pytorch/stable/accelerators/tpu.html. Even though it is for the PyTorch Lightning Trainer, most of the information also applies to Fabric.
We should update it with any extra clarifications.
As far as I know, there's no TPU specific docs for Fabric yet.
Is TPU Pod supported by lightning fabric right now? Would be nice to have some examples.
Currently, when launching a job on a TPU Pod (v3-64) using the following command (run on each of the 8 host VMs at the same time), the world size remain to be 8 (Instead of 64)
fabric = L.Fabric(accelerator='tpu', precision='bf16-mixed', num_nodes=8, num_devices=8)
I haven't tried it but you still need to launch the command on each host separately by passing gcloud compute tpus tpu-vm ssh ... --worker=all
: https://cloud.google.com/tpu/docs/v4-users-guide#resnet-pytorch-pod
This applies to both Fabric and the Trainer
@Xingyu-Lin did you follow this guide to start the work on TPU VMs with xla_dist
? You would need to set up the codes in all TPU VMs in the TPU Pod, then SSH to the worker 0 and issue the command through xla_dist
module
We are working on PJRT support in Lightning which would make @carmocca 's suggestion above work, but for now, using xla_dist
module is required to start TPU Pod jobs.
@awaelchli Let's close this in favor of https://github.com/Lightning-AI/lightning/issues/17492?
📚 Documentation
The new lightning fabric is really nice! However, I am having issues running it on TPU and TPU pods.
A few specific questions that I hope can be answered here (And ideally also in documentation):
@Liyang90 Would be great if you could help here. Thanks a lot!
cc @borda @carmocca @JackCaoG @steventk-g @Liyang90 @justusschock @awaelchli