NTHU-LSALAB / KubeShare

Share GPU between Pods in Kubernetes
Apache License 2.0
193 stars 42 forks source link

Scheduling before resource updated #18

Closed hwan94 closed 1 year ago

hwan94 commented 2 years ago

Hi, I appreciate your efforts in sharing GPUs in Kubernetes. We tried to run KubeShare scheduler in our cluster, and we found an issue.

Some sharepods are waiting, because there is not enough GPU resources. When KubeShare schedule sharepods, it need to synchronize current resource of nodes in cluster. However, instead of waiting for the scheduled sharepod to be updated, the next sharepod will be scheduled immediately.

So maybe it needs to add code to wait for the scheduled sharepod to be updated. We solved the issue by adding the code below to 'syncHandler' function in 'KubeShare/pkg/scheduler/controller.go.'

for sharepod.Spec.NodeName != schedNode && sharepod.ObjectMeta.Annotations[kubesharev1.KubeShareResourceGPUID] != schedGPUID {
    sharepod, err = c.sharepodsLister.SharePods(namespace).Get(name)
    if err != nil {
        if errors.IsNotFound(err) {
            utilruntime.HandleError(fmt.Errorf("SharePod '%s' in work queue no longer exists", key))
            return nil
        }
        return err
    }
}

Thank you for your great work!

jchou-git commented 2 years ago

Many thanks to your feedback. We are glad that you are interested in our work.

hwan94 commented 1 year ago

And I think you miss 'wait.Done()'

if cannotScheduled {
    wait.Done()
    return
}

at 'syncNodeResources' function in 'KubeShare/pkg/scheduler/sync_resources.go'

When the node cannot be scheduled, because of its taint, the scheduler waits forever.