2i2c-org / infrastructure

Infrastructure for configuring and deploying our community JupyterHubs.
https://infrastructure.2i2c.org
BSD 3-Clause "New" or "Revised" License
108 stars 65 forks source link

Failure to run health check on utoronto #2288

Closed consideRatio closed 1 year ago

consideRatio commented 1 year ago

In jobs to deploy to uturonto's staging, which involves to run a health check we are failing when the jhub_client software tries to ensure that a user exists.

Running hub health check for staging...
Testing locally, do not redirect output
F                                                                                                                              [100%]
============================================================== FAILURES ==============================================================
__________________________________________________________ test_hub_healthy __________________________________________________________

hub_url = 'https://staging.utoronto.2i2c.cloud', api_token = '41515c727ecea362e053f00778095f4a71abf1dbc5e9aa10501c51788914c2d3'
notebook_dir = PosixPath('/home/erik/dev/2i2c-org/infrastructure/deployer/tests/test-notebooks/basehub'), check_dask_scaling = False

    @pytest.mark.asyncio
    async def test_hub_healthy(hub_url, api_token, notebook_dir, check_dask_scaling):
        try:
            print(f"Starting hub {hub_url} health validation...")
            for root, directories, files in os.walk(notebook_dir, topdown=False):
                for i, name in enumerate(files):
                    # We only want to run the "scale_dask_workers.ipynb" file if the
                    # check_dask_scaling variable is true. We continue in the loop if
                    # check_dask_scaling == False when we iterate over this file.
                    if (not check_dask_scaling) and (name == "scale_dask_workers.ipynb"):
                        continue

                    print(f"Running {name} test notebook...")
                    test_notebook_path = os.path.join(root, name)
                    await check_hub_health(hub_url, test_notebook_path, api_token)

            print(f"Hub {hub_url} is healthy!")
        except Exception as e:
            print(
                f"Hub {hub_url} not healthy! Stopping further deployments. Exception was {e}."
            )
>           raise (e)

deployer/tests/test_hub_health.py:84: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
deployer/tests/test_hub_health.py:77: in test_hub_healthy
    await check_hub_health(hub_url, test_notebook_path, api_token)
deployer/tests/test_hub_health.py:45: in check_hub_health
    await execute_notebook(
../../../mambaforge/lib/python3.10/site-packages/jhub_client/execute.py:153: in execute_notebook
    return await execute_code(hub_url, cells, **kwargs)
../../../mambaforge/lib/python3.10/site-packages/jhub_client/execute.py:79: in execute_code
    jupyter = await hub.ensure_server(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <jhub_client.api.JupyterHubAPI object at 0x7fed3db3e080>, username = 'deployment-service-check', timeout = 360
user_options = None, create_user = True

    async def ensure_server(
        self, username, timeout, user_options=None, create_user=False
    ):
        user = await self.ensure_user(username, create_user=create_user)
>       if user["server"] is None:
E       TypeError: 'NoneType' object is not subscriptable

../../../mambaforge/lib/python3.10/site-packages/jhub_client/api.py:103: TypeError
-------------------------------------------------------- Captured stdout call --------------------------------------------------------
Starting hub https://staging.utoronto.2i2c.cloud health validation...
Running simple.ipynb test notebook...
Hub https://staging.utoronto.2i2c.cloud not healthy! Stopping further deployments. Exception was 'NoneType' object is not subscriptable.
====================================================== short test summary info =======================================================
FAILED deployer/tests/test_hub_health.py::test_hub_healthy - TypeError: 'NoneType' object is not subscriptable
1 failed in 2.07s
Health check failed!

test_hub_healthy -> check_hub_health -> (deployer jhub_client boundary passed) -> execute_notebook -> execute_code -> hub.ensure_server -> self.ensure_user returns a user object without a server key and fails.

Action points

Neither me or Pris has access to utoronto's jupyterhub's cloud console (AKS) or the hub itself and would have to handle this soleley from the deployer use-cluster-credentials which makes it a bit less easy to do for us.

Related

GeorgianaElena commented 1 year ago

Tried to login, and got a 500 error immediately after.

The hub logs say the following:

--- Logging error ---
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/logging/__init__.py", line 1114, in emit
    self.flush()
  File "/usr/local/lib/python3.11/logging/__init__.py", line 1094, in flush
    self.stream.flush()
OSError: [Errno 28] No space left on device

So, it looks like is the same issue we had in https://github.com/2i2c-org/infrastructure/issues/1845

GeorgianaElena commented 1 year ago

Got a shell inside the staging hub pod with:

deployer exec-hub-shell utoronto staging

The size of jupyterhub.log is:

-rw-rw-r-- 1 jovyan jovyan 957M Mar  2 09:35 jupyterhub.log

which is close to the capacity of the hub-db-dir pvc of 1GB:

NAME             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
home-azurefile   Bound    staging-home-azurefile                     1Mi        RWX                              450d
home-nfs         Bound    staging-home-nfs                           1Mi        RWX                              247d
hub-db-dir       Bound    pvc-440045ad-658e-4627-903f-ec79a51cbb60   1Gi        RWO            managed-premium   250d
GeorgianaElena commented 1 year ago

The question is now, do we want to increase the size of the staging db disk as we did for prod in https://github.com/2i2c-org/infrastructure/pull/1848?

Or we can just delete the logs since they are for staging and call it a day?

Given https://github.com/2i2c-org/infrastructure/pull/1847, my inclination would be to say that for staging we can delete them now to unblock deployment, followed by pushing forward in the direction of stopping to store logs at all for utoronto.

@consideRatio, what do you think? Do you agree with deleting the staging logs? (Alternatively, maybe we could copy them locally(archived) and upload them to the 2i2c drive somewhere, in case we might need them?)

consideRatio commented 1 year ago

Ah nice work tracking this down @GeorgianaElena!!!

Short term, we can do a lot of things. I think whiping the logs and increasing the size of the disc from 1GB to 10GB makes perfect sense for example.

I don't know if we are storing logs for utoronto only, or if we do it for all hubs? If we do it for all hubs, I think its essential that we don't drop the ball of ensuring this doesn't happen to other hubs. If so, we must have an issue to track and get working on a resolution. For reference, this relates to https://github.com/2i2c-org/infrastructure/issues/1890.

If this is utoronto specific for being the sole AKS hub where for example I don't have access to the cloud console, then I'm open to retaining this setup but I think we shouldn't keep increasing the size if it reoccurrs but instead whipe the logs if 10GB on staging and 60GB on the prod hub isn't enough.

GeorgianaElena commented 1 year ago

I believe this is UToronto specific only. From https://github.com/2i2c-org/infrastructure/issues/1860 and the discussion in the draft PR linked to this issue, I understand the logs are useful for figuring out nbgitpuller usage patterns. But this is relevant for the prod hub, right? I don't think such info is relevant for the staging hub.

This is why I believe we shouldn't increasing the disk size to 10GB for staging. Instead, just delete the logs (as irrelevant) and keep the current 1GB size.

https://github.com/2i2c-org/infrastructure/issues/1860 suggests moving away from this persistent logging setup anyway. We just need to figure out where to store the current prod ones in order to be able to analyze them.

What do you think? Can there be useful info in the staging hub to retain and analyze? If yes, then resizing makes sense, if not, just delete them and keep the current disk size.

consideRatio commented 1 year ago

But this is relevant for the prod hub, right? I don't think such info is relevant for the staging hub.

I think absolutely, but I don't know really.

This is why I believe we shouldn't increasing the disk size to 10GB for staging. Instead, just delete the logs (as irrelevant) and keep the current 1GB size.

I'm :+1: as long as we don't end up in this situation again before having stopped persisting logs in this way.

pnasrat commented 1 year ago

For not getting in the situation again that @consideRatio mentions- or at least not getting caught by it without an alert!

Previous teams I've been on for disk utilization have setup non-paging ticket alerts (maybe just going to slack and email not to someone's phone) at a threshold someone can do something with - should be done looking at the growth metrics but say 90%

Prometheus can alert to pagerduty, and adding a simple alert like this just for the hub would be a good way to get started in adding more automated alerting but without a big alerting project. These sort of alerts should be actionable and have accompanying playbook docs

Some useful references on alerting philosophy:

https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit https://www.oreilly.com/radar/monitoring-distributed-systems/

GeorgianaElena commented 1 year ago

I'm 💯 for adding more alerting. I now realized that after the last hub db dir full event, @yuvipanda also created a grafana dashboard to spot such things 🔽

Screenshot 2023-03-02 at 14 11 40

For now, I ended up manually increasing the size of the disk to 5GB, which is reflected in https://github.com/2i2c-org/infrastructure/pull/2289 (before seeing you alerting suggestion), hope that's ok.

@damianavila, maybe you can help out with increasing the priority of https://github.com/2i2c-org/infrastructure/issues/1860 and prioritization in next cycles?

damianavila commented 1 year ago

@damianavila, maybe you can help out with increasing the priority of https://github.com/2i2c-org/infrastructure/issues/1860 and prioritization in next cycles?

Added to the Sprint board (issues tab) with high priority so we can work on it in the next cycles.