Investigate why deployer fails to spawn a server for the deployment-service-check user on openscapes hubs

sgibson91 commented 2 years ago

Context

The deployer is failing to create the deployment-service-check user (specifically on the staging hub) hence why CI/CD is failing.

========================================= FAILURES ==========================================
_____________________________________ test_hub_healthy ______________________________________

hub_url = 'https://staging.openscapes.2i2c.cloud'
api_token = 'xxxx'
notebook_dir = PosixPath('/Users/sgibson/source/github/2i2c-org/infrastructure/deployer/tests/test-notebooks/daskhub')
check_dask_scaling = False

    @pytest.mark.asyncio
    async def test_hub_healthy(hub_url, api_token, notebook_dir, check_dask_scaling):
        try:
            print(f"Starting hub {hub_url} health validation...")
            for root, directories, files in os.walk(notebook_dir, topdown=False):
                for i, name in enumerate(files):
                    # We only want to run the "scale_dask_workers.ipynb" file if the
                    # check_dask_scaling variable is true. We continue in the loop if
                    # check_dask_scaling == False when we iterate over this file.
                    if (not check_dask_scaling) and (name == "scale_dask_workers.ipynb"):
                        continue

                    print(f"Running {name} test notebook...")
                    test_notebook_path = os.path.join(root, name)
                    await check_hub_health(hub_url, test_notebook_path, api_token)

            print(f"Hub {hub_url} is healthy!")
        except Exception as e:
            print(
                f"Hub {hub_url} not healthy! Stopping further deployments. Exception was {e}."
            )
>           raise (e)

deployer/tests/test_hub_health.py:84:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
deployer/tests/test_hub_health.py:77: in test_hub_healthy
    await check_hub_health(hub_url, test_notebook_path, api_token)
deployer/tests/test_hub_health.py:45: in check_hub_health
    await execute_notebook(
/usr/local/Caskroom/miniconda/base/envs/infrastructure/lib/python3.10/site-packages/jhub_client/execute.py:153: in execute_notebook
    return await execute_code(hub_url, cells, **kwargs)
/usr/local/Caskroom/miniconda/base/envs/infrastructure/lib/python3.10/site-packages/jhub_client/execute.py:79: in execute_code
    jupyter = await hub.ensure_server(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <jhub_client.api.JupyterHubAPI object at 0x7fcf2671d540>
username = 'deployment-service-check', timeout = 360, user_options = None, create_user = True

    async def ensure_server(
        self, username, timeout, user_options=None, create_user=False
    ):
        user = await self.ensure_user(username, create_user=create_user)
        if user["server"] is None:
            await self.create_server(username, user_options=user_options)

        start_time = time.time()
        while True:
            user = await self.get_user(username)
            if user["server"] and user["pending"] is None:
                return JupyterAPI(
                    self.hub_url / "user" / username,
                    self.api_token,
                    verify_ssl=self.verify_ssl,
                )

            await asyncio.sleep(5)
            total_time = time.time() - start_time
            if total_time > timeout:
                logger.error(f"jupyterhub server creation timeout={timeout:.0f} [s]")
>               raise TimeoutError(
                    f"jupyterhub server creation timeout={timeout:.0f} [s]"
                )
E               TimeoutError: jupyterhub server creation timeout=360 [s]

/usr/local/Caskroom/miniconda/base/envs/infrastructure/lib/python3.10/site-packages/jhub_client/api.py:120: TimeoutError
----------------------------------- Captured stdout call ------------------------------------
Starting hub https://staging.openscapes.2i2c.cloud health validation...
Running dask_test_notebook.ipynb test notebook...
Hub https://staging.openscapes.2i2c.cloud not healthy! Stopping further deployments. Exception was jupyterhub server creation timeout=360 [s].
------------------------------------- Captured log call -------------------------------------
ERROR    jhub_client.api:api.py:119 jupyterhub server creation timeout=360 [s]
================================== short test summary info ==================================
FAILED deployer/tests/test_hub_health.py::test_hub_healthy - TimeoutError: jupyterhub serv...
1 failed in 365.63s (0:06:05)
Health check failed!

Proposal

No response

Updates and actions

No response

sgibson91 commented 2 years ago

I have confirmed that this is also the case for the prod hub. Hence, this is either an issue in the common config, or with the cluster itself.

sgibson91 commented 2 years ago

I think we should prioritise investigating this issue quite highly since openscapes will continue to fail in CI/CD (and hence require manual upgrading) until it is fixed.

GeorgianaElena commented 2 years ago

Hope it's ok that I self assigned this and plan to look into it soon.

GeorgianaElena commented 2 years ago

I believe the issue is related to https://github.com/jupyterhub/kubespawner/pull/631.
What I think is happening is:
- we're trying to start a server for the test user deployment-service-check but without passing it any user options
- in this context of multiple profiles, with multiple options, each of it with multiple choices, the expected behavior would be that a server corresponding to each of this options defaults to be spawned if no custom options are passed
- but from what I understand, in kubespawner only the user options are taken into account when loading a profile and the defaults aren't taken into account, hence the hub doesn't know which kind of server to spawn
I opened https://github.com/jupyterhub/kubespawner/pull/631 to fix this upstream hopefully.
In the meantime maybe what we could for the 2i2c ci/cd to pass, would be to hack into https://github.com/2i2c-org/infrastructure/blob/02cb6b7862198a29f010ad416d5dc486d7390201/deployer/tests/test_hub_health.py#L45-L54 and pass it a dict of user-options only if we're deploying the openscapes hub, where we tell it which kind of server to spawn.

Alternatively, we could skip checking the staging hub's health and allow the upgrade to happen to the prod hub too, since we're doing this manually anyway.

sgibson91 commented 2 years ago

@GeorgianaElena Should we keep this open until the upstream PR is merged and we can remove the fix, or should we track that in a new issue?

GeorgianaElena commented 2 years ago

Hmm, I'm not sure. My thinking was that the #fixme comment in temp fix code it's enough to close this issue and then track the upstream kubespawner PR as part of https://github.com/2i2c-org/infrastructure/issues/1055 (maybe as a small check box there)? I believe the kubespawner version we're using comes from z2jh anyway. What do you think?

sgibson91 commented 2 years ago

My only concern is that we don't develop much in the deployer any more and so it might take a while to rediscover and remember to fix the #fixme. However, if it will be harmless after #1055 then I don't think my concern should be a blocker.

GeorgianaElena commented 2 years ago

I just opened https://github.com/2i2c-org/infrastructure/issues/1643 and added it to the list of tasks to take when https://github.com/2i2c-org/infrastructure/issues/1055 if that upgrade will come with the upstream fix, just to be sure.

Thank you @sgibson91 ✨

damianavila commented 2 years ago

Thanks for opening the tracker issue, @GeorgianaElena!!

sgibson91 commented 2 years ago

I am now seeing this problem on CarbonPlan (AWS) after https://github.com/2i2c-org/infrastructure/pull/1642#issuecomment-1232771961 ~~Both of these hubs use the environment chooser, so I wonder if this bug is tied to that config in some way?~~

Sorry, just realised from your comment that it happens when we set default: true for a profile

2i2c-org / infrastructure