AI-Hypercomputer / xpk

xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerators such as TPUs and GPUs on GKE.
Apache License 2.0
70 stars 18 forks source link

Update XPK to support topology-aware scheduler for GPU workloads. #154

Closed yangyuwei closed 3 months ago

yangyuwei commented 3 months ago

Fixes / Features

Testing / Documentation

Manually tested two cases:

  1. Use topology-aware scheduler for running MaxText at a single node on A3+.
  2. Use default-scheduler for running MaxText at a single node on A3+.
yangyuwei commented 3 months ago

Thanks Victor for the review. Please see my replies below:

  1. Does the topology-aware-auto scheduler need to be installed on the cluster before it can be used? If so we can include that as part of xpk cluster creation steps.

Yes, that's correct. If you don't mind, I'd add that support in a followup PR, so that we can get this PR in first to support running workloads with topology-aware scheduler on GPU clusters which may or may not be created by XPK.

  1. Is there any downside to not using the topology-aware-auto scheduler? I am wondering if we should make it a first class use case in the xpk API? Basically make it the default for xpk gpu users?

Good question. I don't think there is any downside. But topology-aware scheduler can only be used on machines which have compact placement.

There might be two options: a. Relying on users to tell whether to use it or not (as in the current PR) b. Adding some logic to check whether topology-aware scheduler is installed on the cluster, if so, we will use it in the workload config, otherwise, we fall back to the original way. But I discussed with grigsby@ and he said topology-aware scheduler should be able to install on either compact-placement or non-compact-placement clusters. He said they could add precheck for installation, but it's not there yet.

Therefore, we can go with option a for now. Does it make sense?