NTHU-LSALAB / KubeShare

Share GPU between Pods in Kubernetes
Apache License 2.0
193 stars 42 forks source link

how to test KubeShare throughput improvement? #11

Closed dyoung23 closed 3 years ago

dyoung23 commented 4 years ago

I deployed the environment of KubeShare on my cluster,and used a DL inference example to test its throughput improvement.I created two jobs with half of a GPU card's cores and memory , each job process 5000 images.For comparison,I created a same job in k8s using a entire GPU card and process 10000 images at all.As the result,the former is much slower and the GPU utilization has no obvious difference. Is there anything wrong with my test?Can I have your evaluation examples to test in my cluster?

jchou-git commented 4 years ago

GPU sharing can only be beneficial when the GPU cannot be fully utilized by a single job. Specifically, that means the job should have the same performance when it is running on GPU alone with half of the allocated resources. This may not be true for many workloads. For instance, some DL frameworks will have better performance with more allocated memory. The job characteristics can be highly different between models and training/inferencing. So I suggest you first to verify whether the performance of your application can stay the same when you run the job alone on GPU, but with half of the resources. We would also like to know what inference job did you run, and what was the utilization you observed from a single job without resource limits. We are interested in looking into the evaluation from different workloads to improve our solution as well.

dyoung23 commented 4 years ago

@jchou-git Thanks for your reply. I don't have too much sense about this example I used which I get from another team of our lab.I thought the improvement may have relationships to GPU utilization.So I run some experiments after that.When run this example alone,it does never over 80% of the GPU Utils. Then I run a single job which allocated 80% resources of a GPU,but it similarly is slower than run on a entire GPU. May I know what examples you used to test and get a good result?

jchou-git commented 4 years ago

80% is pretty high. That means the workload is already bounded by the internal GPU memory access bandwidth. So adding more workload will not further improve the overall GPU throughput. BTW, this part is actually more related to our Gemini project, which is used by KubeShare for throttling the GPU usage. So I will let Jim (the developer of Gemini) tell you exactly what workload we used for the evaluation.

jim90247 commented 4 years ago

Hi @dyoung23, I'm the developer of Gemini.

GPU sharing in Gemini (and so as KubeShare) is based on a time slicing mechanism: applications submit jobs to GPU in turn. Gemini is most suitable for applications which may have some inactive periods on GPU, e.g. some GPU-accelerated web applications. Our main goal is to share GPU according to specified fraction while removing GPU inactive time when running multiple applications of this type.

When evaluating Gemini, we modify detectron2 inference benchmarks by adding an idle interval (application goes to sleep) of 20~100 milliseconds between each inference requests, emulating the behavior of a GPU-accelerated web applications.

You may try to make the time slice smaller to see if the utilization could be improved when sharing GPU between two applications. But notice that time slices which is too small may cause greater overhead, due to the communication between front end (hook library) and back end (Gemini scheduler) is too frequent. Typically setting the time slice to tens of milliseconds would be fine.

dyoung23 commented 4 years ago

@jchou-git The 80% test is running alone,but it is also much slower.So I think it may be hook library cause this overhead. @jim90247 It is appreciated your introduction.After that,in my opinion,KubeShare may be not suitable to running DL training or inference which is more intensive use of GPU resource.

jim90247 commented 4 years ago

After that,in my opinion,KubeShare may be not suitable to running DL training or inference which is more intensive use of GPU resource.

You're right. Typically, GPU sharing will not make much benefit for DL training jobs. As for DL inference, GPU sharing is beneficial only if the inference task cannot fully utilize GPU resource, such as sometimes application may become inactive.

dyoung23 commented 4 years ago

@jim90247 @jchou-git OK。谢谢你们的工作!