2i2c-org / infrastructure

Infrastructure for configuring and deploying our community JupyterHubs.
https://infrastructure.2i2c.org
BSD 3-Clause "New" or "Revised" License
105 stars 64 forks source link

[Support] dirsize-reporter lags for some dirs on nasa-cryo #4882

Closed GeorgianaElena closed 1 month ago

GeorgianaElena commented 1 month ago

The Freshdesk ticket link

https://2i2c.freshdesk.com/a/tickets/2186

Ticket request type

Something is not working

Ticket impact

🟨 Medium

Short ticket description

The Grafana dashboard of nasa-cryo that is showing the home directory size is not reporting accurate data for some users.

tsnow03 and jdmillstei should have GB of data and instead it is reported as MB of data.

(Optional) Investigation results

Maybe the pod gets killed and doesn't manage to get through all of the directories, hence the outdated data? This needs to be verified and we should consider increasing the pods' available metric.

consideRatio commented 1 month ago

I think you are spot on with:

Maybe the pod gets killed and doesn't manage to get through all of the directories, hence the outdated data?

yuvipanda commented 1 month ago

I think we should increase the memory limit / request here until it works again.

consideRatio commented 1 month ago

Pending verification of not crashing now after #4945 is merged. I think it requires several hours but not more than a day to know it works.

A confirmation can be done by looking at https://grafana.cryointhecloud.2i2c.cloud/d/688c04dba0500904/home-directory-usage-dashboard?orgId=1, where tsnow03 should have more than 10Mb of usage reported, and the shared-dirsize-metrics pod in prod shouldn't have a restart observed.

consideRatio commented 1 month ago

While it doesn't go OOM, it doesn't work either.

In nasa-cryo it seems to use more and more memory and then manages to garbage collect a bit. Here is the change after going from 256Mi limit to 512Mi limit.

Image

Here is some behavior with 512Mi limit, without k8s reporting that the container has restarted - but we can see clear memory resets of some kind.

Image

A healthy pattern seems to be like this in nasa-veda:

Image

Note how there is stability plateaus, this is when its sleeping and waiting between each cycle. This never happens for nasa-cryo that not only never completes a cycle, but never actually completes reporting of a single directory.

consideRatio commented 1 month ago

This is now reporting correctly, but it took a lot of hours - I think more than twenty hours actually.

Image

consideRatio commented 1 month ago

I responded in freshdesk as well, closing as completed.