2i2c-org / infrastructure

Infrastructure for configuring and deploying our community JupyterHubs.
https://infrastructure.2i2c.org
BSD 3-Clause "New" or "Revised" License
103 stars 63 forks source link

Monitor NFS servers - critical diagnostics to understand issues #2242

Open consideRatio opened 1 year ago

consideRatio commented 1 year ago

Ideally we would be able to monitor the NFS servers we rely on in the grafana isntances directly, but unless we can't do that we need at least some way to understand if the NFS servers are overloaded.

I understand it as we rely on cloud provided NFS services GCP Filestore and AWS EFS. Ideally, we should at least learn how to monitor them using the cloud console if we can't provide grafana instances access to the datasources and import pre-defined dashboards for this.

Cloud services

Action points

Related

pnasrat commented 1 year ago

I believe @yuvipanda already has some graphs that could be added

abkfenris commented 1 year ago

I just encountered this kind of issue on EFS, and it took a lot of digging to understand what is going on.

EFS has 3 different throughput modes. Bursting is the default and AWS does some sneaky stuff to make sure it's initially fast, but if you don't put enough data on it right away you can hit a wall and have really variable and hard to diagnose performance.

The key metrics for EFS to look at are Burst Credit Balance, Permitted Throughput, and Throughput Utilization.

If that's what you are encountering, I'd be happy to pull together some of the resources that I found while trying to diagnose it.