[BUG]: csi-powerstore can't fetch metrics of nfs PVCs

druesendieb commented 3 weeks ago

Bug Description

We're currently migrating from a Unity to Powerstore storage system, so we have a cluster with both csi-unity and csi-powerstore drivers installed. With both we use nfs and iscsi storage classes.

After implementing the csi-powerstore driver and migrating the first PVCs, we encountered an issue with the published nfs PVC metrics, instead of fetching them we face errors in the driver container of the node daemonset.

Metrics from powerstore-iscsi and unity-nfs work as expected.

Logs

{"level":"info","msg":"/csi.v1.Node/NodeGetVolumeStats: REQ 0018: VolumeId=66866727-cf7a-4dd8-b15f-16ad14c055a8/PS4f022082b83d/nfs, VolumePath=/var/lib/kubelet/pods/e73489b6-17cf-4e78-bcde-59430aa8baea/volumes/kubernetes.io~csi/csivol-$NAME-cf0e8041e2/mount, XXX_NoUnkeyedLiteral={}, XXX_sizecache=0","time":"2024-07-05T11:11:19.602727737Z"}

{"level":"info","msg":"/csi.v1.Node/NodeGetVolumeStats: REP 0018: VolumeCondition=abnormal:true message:\"host csi-node-a122dec52e994b51bb2c21ee0113800e-$IP is not attached to NFS export for filesystem 66866727-cf7a-4dd8-b15f-16ad14c055a8\" , XXX_NoUnkeyedLiteral={}, XXX_sizecache=0","time":"2024-07-05T11:11:19.615126687Z"}

Screenshots

No response

Additional Environment Information

k8s 1.24

Steps to Reproduce

Configure csi-powerstore with Volume Health Monitoring enabled https://dell.github.io/csm-docs/v3/csidriver/installation/helm/powerstore/#volume-health-monitoring

Use NFS storageclass:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: "powerstore-nfs"
provisioner: "csi-powerstore.dellemc.com"
parameters:
  arrayID: "$ARRAY"
  csi.storage.k8s.io/fstype: "nfs"
  nasName: "$NAME"
  allowRoot: "root"
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: Immediate

Create 1 PVC with nfs storage class Create Pod consuming this PVC

See errors in csi-powerstore-node driver container when fetching metrics

Expected Behavior

No errors when running NodeGetVolumeStats Kubernetes should present volume metrics from nfs volumes

CSM Driver(s)

csi-powerstore: 2.10.0

Installation Type

Helm

Container Storage Modules Enabled

No response

Container Orchestrator

RKE1

Operating System

Ubuntu 18.04

csmbot commented 3 weeks ago

@druesendieb: Thank you for submitting this issue!

The issue is currently awaiting triage. Please make sure you have given us as much context as possible.

If the maintainers determine this is a relevant issue, they will remove the needs-triage label and respond appropriately.

We want your feedback! If you have any questions or suggestions regarding our contributing process/workflow, please reach out to us at container.storage.modules@dell.com.

adarsh-dell commented 3 weeks ago

Hi @druesendieb ,

To start analyzing the issue step by step, I have a couple of questions:

Have you tried to create a fresh NFS volume on Pstore? As migration migration could be little bit tricky here and we will require complete log of the driver i.e. controller and node pods.
Given the volumeBindingMode: Immediate, was the volume successfully created on the array?
Can you check the export on the Pstore UI? Does the export include the IP address of the worker node where the pod is being scheduled?
JFYI, until volume condition is not normal, we will not get the metrics for the volume.

Thanks.

druesendieb commented 3 weeks ago

Hi @adarsh-dell ,

My migration happened on the k8s side, it was basically creating a new PVC and copying the data from the unity pvc to the new powerstore pvc by using https://github.com/utkuozdemir/pv-migrate
I have no access to the Pstore appliance itself as this is managed by another team. I assume so, as i got a working volume :)
I can look into this with my colleagues if necessary.
Volume condition seems to be normal on k8s side, the pods are running normal and behave as they should.

adarsh-dell commented 3 weeks ago

Please provide the complete driver logs and let us know the exact steps that you are following for copying the data so it will be easy for us to reproduce the issue in our lab without any conflicts in the steps.
Yes! Please check point 2 and 3, this will really help us a lot to debug the issue further.
Volume condition seems to be normal on k8s side, the pods are running normal and behave as they should. Please check the below logic based on which the csi-driver decides that if volume condition is normal or not. Please let us know if you need any help in understanding the below code logic. https://github.com/dell/csi-powerstore/blob/7a7bba812cc459730039697d415186760bc4d5f1/pkg/node/node.go#L594

https://github.com/kubernetes-csi/external-health-monitor

Thanks

adarsh-dell commented 3 weeks ago

Please provide the complete driver logs and let us know the exact steps that you are following for copying the data so it will be easy for us to reproduce the issue in our lab without any conflicts in the steps.
Yes! Please check point 2 and 3, this will really help us a lot to debug the issue further.
Volume condition seems to be normal on k8s side, the pods are running normal and behave as they should. Please check the below logic based on which the csi-driver decides that if volume condition is normal or not. Please let us know if you need any help in understanding the below code logic. https://github.com/dell/csi-powerstore/blob/7a7bba812cc459730039697d415186760bc4d5f1/pkg/node/node.go#L594

https://github.com/kubernetes-csi/external-health-monitor

Thanks

druesendieb commented 3 weeks ago

Please provide the complete driver logs and let us know the exact steps that you are following for copying the data so it will be easy for us to reproduce the issue in our lab without any conflicts in the steps.

Already on the move, will provide more details next week.

The procedure is basically a job that mounts 2 pvcs and rsyncs the data from old to new, see the repostory of pv-migrate.

High level:

have a pvc named volume with data with old storageclass
create a new pvc with the powerstore sc - same size, named volume-temp
scale down consumers of volume
run k pv-migrate --source=volume --dest=volume-temp to copy data to a temp pvc
delete volume pvc
create new volume pvc with powstore storage class
run k pv-migrate --source=volume-temp --dest=volume -d to copy data from temp pvc to new powerstore pvc
scale consumers up again

adarsh-dell commented 3 weeks ago

Please provide the complete driver logs and let us know the exact steps that you are following for copying the data so it will be easy for us to reproduce the issue in our lab without any conflicts in the steps.

Already on the move, will provide more details next week.

The procedure is basically a job that mounts 2 pvcs and rsyncs the data from old to new, see the repostory of pv-migrate.

High level:

have a pvc named volume with data with old storageclass

create a new pvc with the powerstore sc - same size, named volume-temp

scale down consumers of volume

run k pv-migrate --source=volume --dest=volume-temp to copy data to a temp pvc

delete volume pvc

create new volume pvc with powstore storage class

run k pv-migrate --source=volume-temp --dest=volume -d to copy data from temp pvc to new powerstore pvc

scale consumers up again

Thanks for the detailed information about the steps to reproduce the issue. Please check the NFS export whenever you get time because as per the code shared by me earlier csi-driver will try to get the export IPs list from the NFS export and to me it seems to be that it is not available on the NFS export that's why csi-driver is reporting this vols as abnormal in state.

Thanks, Adarsh

druesendieb commented 3 weeks ago

Hi @adarsh-dell, I got access to the UI now, lets continue:

Given the volumeBindingMode: Immediate, was the volume successfully created on the array?

On the Pstore UI i can see the file system for the nfs pvc - I see no alerts since creation, for me this looks fine. Is there anything i can check in the UI to see if this is not the case?

Can you check the export on the Pstore UI? Does the export include the IP address of the worker node where the pod is being scheduled?

Storage - File Systems - Tab NFS Exports in the Pstore UI shows me the NFS Export titled as the PVC name, e.g. csivol-$NAME-cf0e8041e2, the NFS Export Path (IPv4) is prefixed with the IP of the Pstore system followed by the NFS Export name, e.g. 1.2.3.4:/csivol-$NAME-cf0e8041e2

You've linked the https://github.com/kubernetes-csi/external-health-monitor, here its stated for NodeVolumeStats that a feature gate may be necessary: This feature in Kubelet is controlled by an Alpha feature gate CSIVolumeHealth. I will try to activate this.

Additionally:

This is a general problem with NFS backed volumes from the powerstore driver in k8s, independent if i migrate data to it or just create a brand new PVC.
I also cross checked this with our Unity
- This has the same behaviour on the appliance regarding export naming
- NFS metrics work.
- The code for csi-unity driver is quite different though and NodeGetVolumeStats is only used once.

adarsh-dell commented 3 weeks ago

Hi @druesendieb,

As requested earlier, could you please share the driver log (controller and node pods) with us? You mentioned that this issue occurs with all NFS-backed volumes, so I am interested to see if the NFS export includes the worker nodes' IP addresses or not.

Thanks, Adarsh

hoppea2 commented 3 weeks ago

/sync

csmbot commented 3 weeks ago

link: 26117

adarsh-dell commented 2 weeks ago

Any update regarding sharing the logs?

gallacher commented 1 week ago

@falfaroc, the ticket is being closed but feel free to re-open it with logs if the issue persists. Thanks!

dell / csm