Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.2k stars 3.37k forks source link

Examples of using lightning fabric on TPU #17224

Closed Xingyu-Lin closed 1 year ago

Xingyu-Lin commented 1 year ago

📚 Documentation

The new lightning fabric is really nice! However, I am having issues running it on TPU and TPU pods.

A few specific questions that I hope can be answered here (And ideally also in documentation):

  1. What is the num_devices and world_size for TPU? On a single 8-device TPU, should the device be 8 or 1?
  2. Assuming num_devices means the number of chips for TPU, what should I use for a TPU Pod with 64 chips? It seems that num_devices is limited to [0, 8]
  3. Could you provide examples of scripts for distributed training on TPU and TPU pod?

@Liyang90 Would be great if you could help here. Thanks a lot!

cc @borda @carmocca @JackCaoG @steventk-g @Liyang90 @justusschock @awaelchli

carmocca commented 1 year ago

This link will be helpful: https://lightning.ai/docs/pytorch/stable/accelerators/tpu.html. Even though it is for the PyTorch Lightning Trainer, most of the information also applies to Fabric.

We should update it with any extra clarifications.

As far as I know, there's no TPU specific docs for Fabric yet.

Xingyu-Lin commented 1 year ago

Is TPU Pod supported by lightning fabric right now? Would be nice to have some examples.

Currently, when launching a job on a TPU Pod (v3-64) using the following command (run on each of the 8 host VMs at the same time), the world size remain to be 8 (Instead of 64)

fabric = L.Fabric(accelerator='tpu', precision='bf16-mixed', num_nodes=8, num_devices=8)
carmocca commented 1 year ago

I haven't tried it but you still need to launch the command on each host separately by passing gcloud compute tpus tpu-vm ssh ... --worker=all: https://cloud.google.com/tpu/docs/v4-users-guide#resnet-pytorch-pod

This applies to both Fabric and the Trainer

Liyang90 commented 1 year ago

@Xingyu-Lin did you follow this guide to start the work on TPU VMs with xla_dist? You would need to set up the codes in all TPU VMs in the TPU Pod, then SSH to the worker 0 and issue the command through xla_dist module

We are working on PJRT support in Lightning which would make @carmocca 's suggestion above work, but for now, using xla_dist module is required to start TPU Pod jobs.

carmocca commented 1 year ago

@awaelchli Let's close this in favor of https://github.com/Lightning-AI/lightning/issues/17492?