Closed consideRatio closed 1 year ago
Tried to login, and got a 500 error immediately after.
The hub logs say the following:
--- Logging error ---
Traceback (most recent call last):
File "/usr/local/lib/python3.11/logging/__init__.py", line 1114, in emit
self.flush()
File "/usr/local/lib/python3.11/logging/__init__.py", line 1094, in flush
self.stream.flush()
OSError: [Errno 28] No space left on device
So, it looks like is the same issue we had in https://github.com/2i2c-org/infrastructure/issues/1845
Got a shell inside the staging hub pod with:
deployer exec-hub-shell utoronto staging
The size of jupyterhub.log
is:
-rw-rw-r-- 1 jovyan jovyan 957M Mar 2 09:35 jupyterhub.log
which is close to the capacity of the hub-db-dir
pvc of 1GB:
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
home-azurefile Bound staging-home-azurefile 1Mi RWX 450d
home-nfs Bound staging-home-nfs 1Mi RWX 247d
hub-db-dir Bound pvc-440045ad-658e-4627-903f-ec79a51cbb60 1Gi RWO managed-premium 250d
The question is now, do we want to increase the size of the staging db disk as we did for prod in https://github.com/2i2c-org/infrastructure/pull/1848?
Or we can just delete the logs since they are for staging and call it a day?
Given https://github.com/2i2c-org/infrastructure/pull/1847, my inclination would be to say that for staging we can delete them now to unblock deployment, followed by pushing forward in the direction of stopping to store logs at all for utoronto.
@consideRatio, what do you think? Do you agree with deleting the staging logs? (Alternatively, maybe we could copy them locally(archived) and upload them to the 2i2c drive somewhere, in case we might need them?)
Ah nice work tracking this down @GeorgianaElena!!!
Short term, we can do a lot of things. I think whiping the logs and increasing the size of the disc from 1GB to 10GB makes perfect sense for example.
I don't know if we are storing logs for utoronto only, or if we do it for all hubs? If we do it for all hubs, I think its essential that we don't drop the ball of ensuring this doesn't happen to other hubs. If so, we must have an issue to track and get working on a resolution. For reference, this relates to https://github.com/2i2c-org/infrastructure/issues/1890.
If this is utoronto specific for being the sole AKS hub where for example I don't have access to the cloud console, then I'm open to retaining this setup but I think we shouldn't keep increasing the size if it reoccurrs but instead whipe the logs if 10GB on staging and 60GB on the prod hub isn't enough.
I believe this is UToronto specific only. From https://github.com/2i2c-org/infrastructure/issues/1860 and the discussion in the draft PR linked to this issue, I understand the logs are useful for figuring out nbgitpuller usage patterns. But this is relevant for the prod hub, right? I don't think such info is relevant for the staging hub.
This is why I believe we shouldn't increasing the disk size to 10GB for staging. Instead, just delete the logs (as irrelevant) and keep the current 1GB size.
https://github.com/2i2c-org/infrastructure/issues/1860 suggests moving away from this persistent logging setup anyway. We just need to figure out where to store the current prod ones in order to be able to analyze them.
What do you think? Can there be useful info in the staging hub to retain and analyze? If yes, then resizing makes sense, if not, just delete them and keep the current disk size.
But this is relevant for the prod hub, right? I don't think such info is relevant for the staging hub.
I think absolutely, but I don't know really.
This is why I believe we shouldn't increasing the disk size to 10GB for staging. Instead, just delete the logs (as irrelevant) and keep the current 1GB size.
I'm :+1: as long as we don't end up in this situation again before having stopped persisting logs in this way.
For not getting in the situation again that @consideRatio mentions- or at least not getting caught by it without an alert!
Previous teams I've been on for disk utilization have setup non-paging ticket alerts (maybe just going to slack and email not to someone's phone) at a threshold someone can do something with - should be done looking at the growth metrics but say 90%
Prometheus can alert to pagerduty, and adding a simple alert like this just for the hub would be a good way to get started in adding more automated alerting but without a big alerting project. These sort of alerts should be actionable and have accompanying playbook docs
Some useful references on alerting philosophy:
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit https://www.oreilly.com/radar/monitoring-distributed-systems/
I'm 💯 for adding more alerting. I now realized that after the last hub db dir full event, @yuvipanda also created a grafana dashboard to spot such things 🔽
For now, I ended up manually increasing the size of the disk to 5GB, which is reflected in https://github.com/2i2c-org/infrastructure/pull/2289 (before seeing you alerting suggestion), hope that's ok.
@damianavila, maybe you can help out with increasing the priority of https://github.com/2i2c-org/infrastructure/issues/1860 and prioritization in next cycles?
@damianavila, maybe you can help out with increasing the priority of https://github.com/2i2c-org/infrastructure/issues/1860 and prioritization in next cycles?
Added to the Sprint board (issues tab) with high priority so we can work on it in the next cycles.
In jobs to deploy to uturonto's staging, which involves to run a health check we are failing when the
jhub_client
software tries to ensure that a user exists.test_hub_healthy -> check_hub_health -> (deployer jhub_client boundary passed) -> execute_notebook -> execute_code -> hub.ensure_server -> self.ensure_user returns a user object without a
server
key and fails.Action points
jhub_client
, or possibly just help it error more clearlyNeither me or Pris has access to utoronto's jupyterhub's cloud console (AKS) or the hub itself and would have to handle this soleley from the
deployer use-cluster-credentials
which makes it a bit less easy to do for us.Related