1323 added single-device TPU Pods support. Multi-device TPU Pods have not been supported because running multi-node tasks on them may require changes to dstack.

Currently, dstack runs different jobs of a multi-node task on different instances. To run multi-node tasks on TPU Pods, we can create an instance for each device in the Pod. The possible downside is that the Pod management UX will be suboptimal: users won't see TPU Pods in pools but all the TPU Pods devices as different instances. This can be mitigated by introducing a cluster concept to dstack.

Another solution would be to have one InstanceModel per TPU Pod but make it possible to run multiple jobs on such instance simultaneously. This will require no changes to the dstack interface but may lead to significant internal refactoring.

The most promising solution at the moment seems to be the instance-per-TPU-device model. Provisioning a multi-device TPU Pod creates an instance for each TPU device. For example, provisioning TPU v2-32 would create four TPU v2 instances. dstack pool ps shows each TPU device as a separate instance but instances can be grouped, e.g. by TPU Pod name. Users delete TPU Pods by specifying TPU Pod name (also can be the name of any instance in the Pod).

To support multi-node TPU tasks, dstack can determine nodes automatically based on the number of devices in the TPU Pod. For example, if a user specifies tpu-v2-32, dstack will run four jobs. An alternative solution would be to ask for a TPU type like tpu-v2 and determine the number of cores based on the number of jobs. The downside of the latter is that users won't be able to specify arbitrary number in nodes, so they'll need to calculate it depending on what TPU configuration they want to run.

Implementation details:

Compute.create_instance() should be able to return List[JobProvisioningData] to return provisioning data for each Pod device.
For every JobProvisioningData, InstanceModel needs to be created. Master job will occupy the newly provisioned device 0 instance. Other jobs will wait for master job provisioning and then occupy idle device instances from the pool.
Instances of a Pod must be grouped together. This can probably be done by introducing InstanceGroupModel of different types including "tpu_pod" type.

dstackai / dstack

Support multi-device TPU Pods #1337

1323 added single-device TPU Pods support. Multi-device TPU Pods have not been supported because running multi-node tasks on them may require changes to dstack.