AliyunContainerService / gpushare-scheduler-extender

GPU Sharing Scheduler for Kubernetes Cluster
Apache License 2.0
1.39k stars 309 forks source link

Remove DeletionTimestamp!=nil condition in IsCompletePod function #221

Open zhangbc97 opened 9 months ago

zhangbc97 commented 9 months ago

本次提交的PR主要是为了解决一个关于Pod管理的问题。在当前的IsCompletePod函数中,存在对DeletionTimestamp的判断,但这种判断方式可能引发一些问题。DeletionTimestamp被更新仅表示当前Pod开始执行退出动作,但并不意味着该Pod中的服务已经释放了显存。如果Pod的退出动作执行时间较长,可能会导致服务出现显存OOM的情况。这个问题在单机多卡的场景下有一定的出现概率,因此需要移除原有的判断方式,以避免潜在的风险。

PR submission: Remove the judgment of DeletionTimestamp in the IsCompletePod function, because the update of DeletionTimestamp only indicates that the current Pod has started the exit action, but does not mean that the service in the Pod has released the GPU memory. If the exit action of the Pod takes a long time to execute, it may cause the service to run out of GPU memory, which has a certain probability of occurrence in the scenario of multiple GPUs on a single machine. Therefore, it is necessary to remove the original judgment method to avoid potential risks.