Nodecache and NLS's .status.nodeStorageInfo.volumeGroups inconsistencies

Clara12062 commented 4 months ago

Problem Description:

In a single-node environment, multiple pods in the same namespace share a PVC. It was discovered that Extender would try to update the nodecache's VG information in the CapacityPredicate during pod scheduling. For the first n pods, the capacity information was updated during scheduling, but it was suspected that the PVC had not been created. When the PVC was created, the available capacity in the nodecache was reduced, but Open Local still had to subtract the shared PVC's capacity from the VG, resulting in inconsistencies between the NLS's status and the nodecache's VGs. Even when trying to create again, Extender would inform that there was not enough capacity.

I'm not sure if my usage is correct. So I have a few questions to confirm.

I have three questions:

Will pod scheduling be affected if Extender is not enabled during scheduling when there is only one node in the environment? I have encountered an "not eligible" error when trying to create a PVC.

I0716 01:20:03.254958      1 types.go:196] [PutPvc]pvc (apgpt/apgpt-log on apgpt/apgpt-eml-0) status changed to true
I0716 01:20:03.255142      1 types.go:196] [PutPvc]pvc (apgpt/apgpt-eml-share on apgpt/apgpt-eml-0) status changed to true
I0716 01:20:03.255160      1 types.go:196] [PutPvc]pvc (apgpt/apgpt-cloud-config on apgpt/apgpt-cloud-config-0) status changed to true
I0716 01:20:03.256349      1 util.go:109] got pvc apgpt/apgpt-log as lvm pvc
I0716 01:20:03.256372      1 types.go:167] [Put]pvc (apgpt/apgpt-log on apgpt/apgpt-ocr-0) status changed to true
I0716 I0716 01:20:03.256397 1 cluster.go:89] The node cache is not set, it is nil or nodeName is nil.
I0716 01:20:03.256697 1 routes.go:216] Path: /apis/scheduling/:namespace/persistentvolumeclaims/:name, request body:
I0716 01:20:03.256749 1 scheduling.go:41] Scheduling PVC apgpt/apgpt-eml-share on node xos-
862dd2
21
I0716 01:20:03.256768 1 scheduling.go:64] PVC apgpt/apgpt-eml-share is not eligible for provisioning as related PVCs are still pending.
E0716 01:20:03.256776 1 api_routes.go:61] Failed to schedule PVC apgpt/apgpt-eml-share: PVC apgpt/apgpt-eml-share is not eligible for provisioning as related PVCs are still pending.

Does open-local support having multiple pods on a single node sharing a PVC? K8s' description of RWO volumes:

ReadWriteOnce the volume can be mounted as read-write by a single node. ReadWriteOnce access mode still can allow multiple pods to access the volume when the pods are running on the same node. For single pod access, please see ReadWriteOncePod.

However, I noticed that open-local does not implement the NodeStageVolume interface, so I'm not sure if this is possible. Additionally, there is an error on the agent side:

I0720 18:28:21.895995 1403411 mount_linux.go:447] Attempting to determine if disk "/dev/xoslocal-open-local-lvm/local-f48ae1af-c512-4c96-8d99- /dev/mapper/xoslocal--open--local--lvm-local--f48ae1af--c512--4c96--8d99--9013998d0737 is mounted.
e2fsck: Cannot continue, aborting.
However, this does not seem to affect anything.

If the nodecache has pvcMapping.PodPvcInfo set to true, does it mean that the PVC has been bound successfully? Can we decide whether to skip the predict operation based on this value?

The most important thing is that I want to know if open-local does not support multiple pods on the same node sharing access to a PVC. Even if it does not support it, the capacity in the nodecache should also be consistent with the .status.nodeStorageInfo.volumeGroups in the NLS.

peter-wangxu commented 4 months ago

looks like code needs to update so that the cache can be right

do you have the use case so that i can reproduce locally?

peter-wangxu commented 4 months ago

I checked the logic, we already cover the bounded pv/pvc

    err, lvmPVCs, mpPVCs, devicePVCs := algorithm.GetPodUnboundPvcs(pvc, ctx)
    if err != nil {
        log.Errorf("failed to get pod unbound pvcs: %s", err.Error())
        return nil, err
    }

    if len(lvmPVCs)+len(mpPVCs)+len(devicePVCs) == 0 {
        msg := "unexpected schedulering request for all pvcs are bounded"
        log.Info(msg)
        return nil, fmt.Errorf(msg)
    }

Clara12062 commented 4 months ago

yep. But there seems to be a timing problem. Not necessarily in my environment. The scheduler.log information is as follows.

scheduler.log

The corresponding lv appears to have been created and the nodecache updated. But pvc is still not in the bound state. The predict log contains the following information:

{"phase":"Pending","conditions":[{"type":"PodScheduled","status":"False","lastProbeTime":null,"lastTransitionTime":"2024-07-16T01:20:03Z","reason":"SchedulerError","message":"running PreBind plugin \"VolumeBinding\": Operation cannot be fulfilled on persistentvolumeclaims \"apgpt-log\": the object has been modified; please apply your changes to the latest version and try again"}],"qosClass":"Burstable"}},"Nodes":null,"NodeNames":["xos-862dd221"]}

So at this time, capacityPredict does not skip the same pvc when it continues scheduling.

peter-wangxu commented 4 months ago

what's your case, do you have reproduce steps?

I am reproducing via following spec


apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  containers:
    - name: test
      image: /nginx:latest
      volumeMounts:
        # 网站数据挂载
        - name: config
          mountPath: /usr/share/nginx/html
          subPath: html
  volumes:
    - name: config
      persistentVolumeClaim:
        claimName: html-nginx-lvm-0

but did not found any issue

Clara12062 commented 4 months ago

ok wait a minute

Clara12062 commented 4 months ago

kubectl apply -k k8s-resources/ But strangely, it hasn't come back today. And the code implementation seems to be fine. Unless the pvc is not in the Bound state and the volume has been created. This problem will appear, but I did not find why pvc is not Bound phase. Don't close this issue until I find the reproduction condition and then add more information.

Clara12062 commented 4 months ago

[Uploading test.tar.gz…]()

Clara12062 commented 4 months ago

but did not found any issue

So far, this phenomenon has not been repeated. Follow the logs for clues. Currently see nodecache capacity is updated in Assume.

func (c *ClusterNodeCache) Assume(units []AllocatedUnit) (err error) {
    // all pass, write cache now
    //TODO(yuzhi.wx) we need to move it out, after all check pass
    for _, u := range units {
        nodeCache := c.GetNodeCache(u.NodeName)
        if nodeCache == nil {
            return fmt.Errorf("node %s not found from cache when assume", u.NodeName)
        }
        volumeType := u.VolumeType
        switch volumeType {
        case pkg.VolumeTypeLVM:
            _, err = c.assumeLVMAllocatedUnit(u, nodeCache)

At this point, it has been shown that pvc scheduling is successful. But at onPodUpdate, if a pvc of a pod is pending, whether it affects other PVCS, resulting in the scheduling, assume will still update the nodecache capacity?

    e.Ctx.ClusterNodeCache.PvcMapping.PutPod(podName, pvcs)
    // if a pvcs is pending, remove the selected node in a goroutine
    // so that to avoid ErrVolumeBindConflict(means the selected-node(on pvc)
    // does not match the newly selected node by scheduler
    for _, p := range pvcs {
        if p.Status.Phase != corev1.ClaimPending {
            return
        }
    }

Actually a pvc is subtracted multiple times.

Clara12062 commented 4 months ago

And in UpdateNodeInfo, could 'Requeted' be updated?

    for _, vg := range unchangedVGs {
        // update the size if the updatedName got extended
        v := cacheNode.VGs[ResourceName(vg)]
        v.Capacity = int64(vgMapInfo[vg].Allocatable)
        cacheNode.VGs[ResourceName(vg)] = v
        log.V(6).Infof("updating existing volume group %q(total:%d,allocatable:%d,used:%d) on node cache %s",
            vg, vgMapInfo[vg].Total, vgMapInfo[vg].Allocatable, vgMapInfo[vg].Total-vgMapInfo[vg].Available, cacheNode.NodeName)
    }

peter-wangxu commented 4 months ago

which version were you using?

on my local, i did not notice the assume logic triggered. to confirm this happened.

you can find logs about the pvc

    if node == nil {
        log.Infof("scheduling pvc %s without node", utils.GetName(pvc.ObjectMeta))
    } else {
        log.Infof("scheduling pvc %s on node %s", utils.GetName(pvc.ObjectMeta), node.Name)
    }

Clara12062 commented 4 months ago

which version were you using? 0.6.0

I0716 01:20:03.256697       1 routes.go:216] path: /apis/scheduling/:namespace/persistentvolumeclaims/:name, request body: 
I0716 01:20:03.256869       1 scheduling.go:41] scheduling pvc apgpt/apgpt-log on node xos-862dd221
I0716 01:20:03.256888       1 util.go:109] got pvc apgpt/apgpt-log as lvm pvc
I0716 01:20:03.256898       1 common.go:426] storage class open-local-lvm has no parameter "vgName" set
I0716 01:20:03.256920       1 cluster.go:167] assume node cache successfully: node = xos-862dd221, vg = xoslocal-open-local-lvm
I0716 01:20:03.256924       1 cluster.go:96] node cache update
I0716 01:20:03.256928       1 scheduling.go:119] allocatedUnits of pvc apgpt/apgpt-log: [{NodeName:xos-862dd221 VolumeType:LVM Requested:214748364800 Allocated:214748364800 VgName:xoslocal-open-local-lvm Device: MountPoint: PVCName:apgpt/apgpt-log}]
...
I0716 01:20:04.980639       1 scheduling.go:41] scheduling pvc apgpt/apgpt-cloud-config on node xos-862dd221
I0716 01:20:04.980688       1 scheduling.go:64] pvc apgpt/apgpt-cloud-config is not eligible for provisioning as related pvcs are still pending
E0716 01:20:04.980713       1 api_routes.go:61] failed to scheduling pvc apgpt/apgpt-cloud-config: pvc apgpt/apgpt-cloud-config is not eligible for provisioning as related pvcs are still pending
I0716 01:20:04.980745       1 routes.go:218] path: /apis/scheduling/:namespace/persistentvolumeclaims/:name, code=500, response body=pvc apgpt/apgpt-cloud-config is not eligible for provisioning as related pvcs are still pending

peter-wangxu commented 4 months ago

normally，k8s should not trigger the scheduling process in this case, the only thing is to bind the pvc and pv. what's the k8s version?

Clara12062 commented 4 months ago

normally，k8s should not trigger the scheduling process in this case, the only thing is to bind the pvc and pv. what's the k8s version?

1.25.16

peter-wangxu commented 4 months ago

per log, we can try add logic to check the pv bind status before assume

how do you think? maybe we can check the provisioner log to make sure above logic do works in this scenario.

alibaba / open-local

Nodecache and NLS's .status.nodeStorageInfo.volumeGroups inconsistencies #262

Problem Description:

I have three questions: