Closed arashasg closed 11 months ago
Dear Arash,
Thank you for your insightful feedback on our conference paper. I'm glad you found the technical details interesting and useful for your work. Let me try to address your queries:
1) 0.6 GPU refers to 60% of the GPU's memory capacity and at least 60% of its computation time being assigned to a task. For more details, please refer to the paragraph before Section 2.2 and footnote 3 in the paper. In our experiments, GPU computation are shared by time rather than by space (e.g., dividing GPU computation unit as you mentioned). Yet, the proposed fragmentation metric is applicable to other isolation frameworks like CUDA MPS, NVIDIA MIG, vGPU etc.
2) The paper was an collaboration work with Alibaba Group, not Aliyun, thus the platform is different from the one you referred. However, after a quick look of the GitHub repo, I notice the gpushare scheduler extender is based on Kubernetes, which is similar to our prototype [1]. I think it would also be fine if you choose explore the idea of fragmentation gradient descent on that code base.
[1] https://github.com/hkust-adsl/kubernetes-scheduler-simulator
Hi, I have a couple of questions regarding GPU sharing method used in the simulation for the GPU cluster trace of 2023.
Thank you for your response in advance