alibaba / clusterdata

cluster data collected from production clusters in Alibaba for cluster management research
1.54k stars 402 forks source link

GPU sharing method #193

Closed arashasg closed 11 months ago

arashasg commented 11 months ago

Hi, I have a couple of questions regarding GPU sharing method used in the simulation for the GPU cluster trace of 2023.

  1. In the paper, you mention dividing the GPU computation unit. I would like to understand if this division is solely regarding GPU memory or if there are other means for dividing the computation capabilities of the GPU as well. Could you please elaborate on what exactly you mean by "0.6 GPU"? Does it refer only to GPU memory or does it encompass other resources as well? Additionally, I am curious if your isolation framework also isolates GPU threads, similar to CUDA MPS or any other mechanism.
  2. In the paper, you mention a GPU sharing platform. I came across a GitHub repository (https://github.com/AliyunContainerService/gpushare-scheduler-extender) that seems to correspond to this platform. I would like to validate if this repository is the same platform mentioned in your paper or if there is another internal tool specifically developed at Alibaba for the purpose of GPU sharing.

Thank you for your response in advance

qzweng commented 11 months ago

Dear Arash,

Thank you for your insightful feedback on our conference paper. I'm glad you found the technical details interesting and useful for your work. Let me try to address your queries:

1) 0.6 GPU refers to 60% of the GPU's memory capacity and at least 60% of its computation time being assigned to a task. For more details, please refer to the paragraph before Section 2.2 and footnote 3 in the paper. In our experiments, GPU computation are shared by time rather than by space (e.g., dividing GPU computation unit as you mentioned). Yet, the proposed fragmentation metric is applicable to other isolation frameworks like CUDA MPS, NVIDIA MIG, vGPU etc.

2) The paper was an collaboration work with Alibaba Group, not Aliyun, thus the platform is different from the one you referred. However, after a quick look of the GitHub repo, I notice the gpushare scheduler extender is based on Kubernetes, which is similar to our prototype [1]. I think it would also be fine if you choose explore the idea of fragmentation gradient descent on that code base.

[1] https://github.com/hkust-adsl/kubernetes-scheduler-simulator