[Support] dirsize-reporter lags for some dirs on nasa-cryo

GeorgianaElena commented 1 month ago

The Freshdesk ticket link

https://2i2c.freshdesk.com/a/tickets/2186

Ticket request type

Something is not working

Ticket impact

🟨 Medium

Short ticket description

The Grafana dashboard of nasa-cryo that is showing the home directory size is not reporting accurate data for some users.

tsnow03 and jdmillstei should have GB of data and instead it is reported as MB of data.

(Optional) Investigation results

The last time the size of the home directories for tsnow03 and jdmillstei was 1y, respectively 2y ago according to Grafana
I checked and their directories and they don't contain anything with odd permissions that would make the reported skip them
More users shows dirsize values that were last updated a long time ago. This is also happening on other clusters as well
The dirseize-reported node doesn't show any logs because --enable-detailed-processing-time-metric is not passed to the command. Maybe we should temp enable it and see what info it shows
A describe of the shared-dirsize-metrics pod in nasa-cryo shows that was recently oom killed and it has ~100 pod restarts. This is related to what @consideRatio has noticed in https://github.com/2i2c-org/infrastructure/issues/2950.
```
Last State:     Terminated
Reason:       OOMKilled
Exit Code:    137
```

Maybe the pod gets killed and doesn't manage to get through all of the directories, hence the outdated data? This needs to be verified and we should consider increasing the pods' available metric.

consideRatio commented 1 month ago

I think you are spot on with:

Maybe the pod gets killed and doesn't manage to get through all of the directories, hence the outdated data?

yuvipanda commented 1 month ago

I think we should increase the memory limit / request here until it works again.

consideRatio commented 1 month ago

Pending verification of not crashing now after #4945 is merged. I think it requires several hours but not more than a day to know it works.

A confirmation can be done by looking at https://grafana.cryointhecloud.2i2c.cloud/d/688c04dba0500904/home-directory-usage-dashboard?orgId=1, where tsnow03 should have more than 10Mb of usage reported, and the shared-dirsize-metrics pod in prod shouldn't have a restart observed.

consideRatio commented 1 month ago

While it doesn't go OOM, it doesn't work either.

In nasa-cryo it seems to use more and more memory and then manages to garbage collect a bit. Here is the change after going from 256Mi limit to 512Mi limit.

Here is some behavior with 512Mi limit, without k8s reporting that the container has restarted - but we can see clear memory resets of some kind.

A healthy pattern seems to be like this in nasa-veda:

Note how there is stability plateaus, this is when its sleeping and waiting between each cycle. This never happens for nasa-cryo that not only never completes a cycle, but never actually completes reporting of a single directory.

consideRatio commented 1 month ago

This is now reporting correctly, but it took a lot of hours - I think more than twenty hours actually.

consideRatio commented 1 month ago

I responded in freshdesk as well, closing as completed.

2i2c-org / infrastructure