Open r4victor opened 3 weeks ago
The most promising solution at the moment seems to be the instance-per-TPU-device model. Provisioning a multi-device TPU Pod creates an instance for each TPU device. For example, provisioning TPU v2-32 would create four TPU v2 instances. dstack pool ps
shows each TPU device as a separate instance but instances can be grouped, e.g. by TPU Pod name. Users delete TPU Pods by specifying TPU Pod name (also can be the name of any instance in the Pod).
To support multi-node TPU tasks, dstack can determine nodes
automatically based on the number of devices in the TPU Pod. For example, if a user specifies tpu-v2-32
, dstack will run four jobs. An alternative solution would be to ask for a TPU type like tpu-v2
and determine the number of cores based on the number of jobs. The downside of the latter is that users won't be able to specify arbitrary number in nodes
, so they'll need to calculate it depending on what TPU configuration they want to run.
Implementation details:
Compute.create_instance()
should be able to return List[JobProvisioningData]
to return provisioning data for each Pod device.JobProvisioningData
, InstanceModel
needs to be created. Master job will occupy the newly provisioned device 0 instance. Other jobs will wait for master job provisioning and then occupy idle device instances from the pool.InstanceGroupModel
of different types including "tpu_pod" type.
1323 added single-device TPU Pods support. Multi-device TPU Pods have not been supported because running multi-node tasks on them may require changes to dstack.
Currently, dstack runs different jobs of a multi-node task on different instances. To run multi-node tasks on TPU Pods, we can create an instance for each device in the Pod. The possible downside is that the Pod management UX will be suboptimal: users won't see TPU Pods in pools but all the TPU Pods devices as different instances. This can be mitigated by introducing a cluster concept to dstack.
Another solution would be to have one InstanceModel per TPU Pod but make it possible to run multiple jobs on such instance simultaneously. This will require no changes to the dstack interface but may lead to significant internal refactoring.