Monitor NFS servers - critical diagnostics to understand issues

consideRatio commented 1 year ago

Ideally we would be able to monitor the NFS servers we rely on in the grafana isntances directly, but unless we can't do that we need at least some way to understand if the NFS servers are overloaded.

I understand it as we rely on cloud provided NFS services GCP Filestore and AWS EFS. Ideally, we should at least learn how to monitor them using the cloud console if we can't provide grafana instances access to the datasources and import pre-defined dashboards for this.

Cloud services

GCP's Filestore service has notes on monitoring
AWS EFS service has notes on monitoring
Azure Files has notes on monitoring

Action points

[ ] Explore the options to monitor NFS services performance and come up with refined action points

An idea to pay more to get a more performant EFS service: https://github.com/2i2c-org/infrastructure/issues/1236
An idea to suggest using /tmp for anything temp as that could help reduce load on the NFS server: https://github.com/2i2c-org/infrastructure/issues/1236#issuecomment-1109164647
An idea to provide a temp folder directly in the home directory to nudge users towards this: https://github.com/2i2c-org/infrastructure/pull/2062

pnasrat commented 1 year ago

I believe @yuvipanda already has some graphs that could be added

abkfenris commented 1 year ago

I just encountered this kind of issue on EFS, and it took a lot of digging to understand what is going on.

EFS has 3 different throughput modes. Bursting is the default and AWS does some sneaky stuff to make sure it's initially fast, but if you don't put enough data on it right away you can hit a wall and have really variable and hard to diagnose performance.

The key metrics for EFS to look at are Burst Credit Balance, Permitted Throughput, and Throughput Utilization.

If that's what you are encountering, I'd be happy to pull together some of the resources that I found while trying to diagnose it.

2i2c-org / infrastructure

Monitor NFS servers - critical diagnostics to understand issues #2242

Cloud services

Action points

Related