SynologyOpenSource / synology-csi

Apache License 2.0
527 stars 114 forks source link

Kubelet space/inodes usage (e.g. `kubelet_volume_stats_used_bytes`) is missing #51

Closed vaskozl closed 1 year ago

vaskozl commented 1 year ago

Typically metrics for volumes are available via the kubelet summary API (/stats/summary).

Monitoring solutions like Prometheus with Alertmanager will scrape metrics from kubelet about volume usage and alert when a disk if filling up. This doesn't work when using the synology-csi since there are no such metrics since the csi does not seem to implement them.

Missing:

There are some histogram metrics (less useful) that are available:

Reporting the volume usage is critical to avoid cases where one runs out of disk and ultimate application failure.

chihyuwu commented 1 year ago

Hi @vaskozl, Thanks for bringing this to our attention. We understand its importance and will consider implementing this feature in the near future.

If you don't mind sharing, could you please let us know which protocol you are using? Is it iscsi or smb?

vaskozl commented 1 year ago

That's great @chihyuwu !

I use predominantly iSCSI with -E nodiscard such that volumes do not immediately appear full.

chihyuwu commented 1 year ago

@vaskozl Thank you for your feedback, and please continue to share your ideas with us. :) Feel free to reach out if you have any further questions or suggestions.

vaskozl commented 1 year ago

EBS implementation:

https://github.com/kubernetes-sigs/aws-ebs-csi-driver/pull/677/files

Should be able to do it similarly. A bit silly that kubelet doesn't just look at the block device itself.

vaskozl commented 1 year ago

Looks likes NodeGetVolumeStats is already implemented but the RPC_GET_VOLUME_STATS capability is still commented out with a TODO.

@chihyuwu is there a reason for that? Looks like just reporting the capability should get it working.

I note that it uses the reported size from DSM, which appears full when using discard (default). We might want to just check the filesystem usage directly on the node via the volumePath like in the EBS implementation.

vaskozl commented 1 year ago

I've removed the comment and I now have stats in grafana!

Also I believe https://github.com/SynologyOpenSource/synology-csi/issues/36 is after the same thing.

As predicted all the volumes provisioned with "discard" are 100% used up so it's not terribly useful for those as is. I've switched my storage class to nodiscard now but still have lots of volumes from before the format params were added.

I think reporting the inodes like the EBS csi is the correct way to resolve this anyway.

vaskozl commented 1 year ago

In this commit I've made the nodeserver use statfs instead of just taking the whole filesystem size as returned by DSM. This is in line with what the other CSI drivers do.

I'm getting stats for all my LUN volumes based on the used filesystem now and those erroneous KubePersistentVolumeFillingUp alerts are now gone! If anyone else would like to test/use this functionality, you may grab the image I build:ghcr.io/vaskozl/synology-csi:1.1.2-7

Happy to make a PR if you are interested in merging it.

benjamin-gentner-fnt commented 1 year ago

Can this be merged? Would be very helpful. @vaskozl @chihyuwu

chihyuwu commented 1 year ago

Hi @vaskozl Thank you for looking into this! Could you kindly create a PR for the merge?

newbenji commented 1 year ago

In this commit I've made the nodeserver use statfs instead of just taking the whole filesystem size as returned by DSM. This is in line with what the other CSI drivers do.

I'm getting stats for all my LUN volumes based on the used filesystem now and those erroneous KubePersistentVolumeFillingUp alerts are now gone! If anyone else would like to test/use this functionality, you may grab the image I build:ghcr.io/vaskozl/synology-csi:1.1.2-7

Happy to make a PR if you are interested in merging it.

do you have example for metrics exporter

laghoule commented 1 year ago

@newbenji metrics are exposed via the kubelet job in Prometheus..

newbenji commented 1 year ago

i just dont see a metrics exporter somewhere.. thats why i ask

newbenji commented 1 year ago

but i can see they are there so thx