Closed GeorgianaElena closed 1 month ago
I think you are spot on with:
Maybe the pod gets killed and doesn't manage to get through all of the directories, hence the outdated data?
I think we should increase the memory limit / request here until it works again.
Pending verification of not crashing now after #4945 is merged. I think it requires several hours but not more than a day to know it works.
A confirmation can be done by looking at https://grafana.cryointhecloud.2i2c.cloud/d/688c04dba0500904/home-directory-usage-dashboard?orgId=1, where tsnow03 should have more than 10Mb of usage reported, and the shared-dirsize-metrics pod in prod
shouldn't have a restart observed.
While it doesn't go OOM, it doesn't work either.
In nasa-cryo it seems to use more and more memory and then manages to garbage collect a bit. Here is the change after going from 256Mi limit to 512Mi limit.
Here is some behavior with 512Mi limit, without k8s reporting that the container has restarted - but we can see clear memory resets of some kind.
A healthy pattern seems to be like this in nasa-veda:
Note how there is stability plateaus, this is when its sleeping and waiting between each cycle. This never happens for nasa-cryo that not only never completes a cycle, but never actually completes reporting of a single directory.
This is now reporting correctly, but it took a lot of hours - I think more than twenty hours actually.
I responded in freshdesk as well, closing as completed.
The Freshdesk ticket link
https://2i2c.freshdesk.com/a/tickets/2186
Ticket request type
Something is not working
Ticket impact
🟨 Medium
Short ticket description
The Grafana dashboard of nasa-cryo that is showing the home directory size is not reporting accurate data for some users.
tsnow03
andjdmillstei
should have GB of data and instead it is reported as MB of data.(Optional) Investigation results
tsnow03
andjdmillstei
was 1y, respectively 2y ago according to Grafanashared-dirsize-metrics
pod in nasa-cryo shows that was recently oom killed and it has ~100 pod restarts. This is related to what @consideRatio has noticed in https://github.com/2i2c-org/infrastructure/issues/2950.Maybe the pod gets killed and doesn't manage to get through all of the directories, hence the outdated data? This needs to be verified and we should consider increasing the pods' available metric.