2i2c-org / infrastructure

Infrastructure for configuring and deploying our community JupyterHubs.
https://infrastructure.2i2c.org
BSD 3-Clause "New" or "Revised" License
103 stars 63 forks source link

Investigate why deployer fails to spawn a server for the deployment-service-check user on openscapes hubs #1611

Closed sgibson91 closed 2 years ago

sgibson91 commented 2 years ago

Context

The deployer is failing to create the deployment-service-check user (specifically on the staging hub) hence why CI/CD is failing.

========================================= FAILURES ==========================================
_____________________________________ test_hub_healthy ______________________________________

hub_url = 'https://staging.openscapes.2i2c.cloud'
api_token = 'xxxx'
notebook_dir = PosixPath('/Users/sgibson/source/github/2i2c-org/infrastructure/deployer/tests/test-notebooks/daskhub')
check_dask_scaling = False

    @pytest.mark.asyncio
    async def test_hub_healthy(hub_url, api_token, notebook_dir, check_dask_scaling):
        try:
            print(f"Starting hub {hub_url} health validation...")
            for root, directories, files in os.walk(notebook_dir, topdown=False):
                for i, name in enumerate(files):
                    # We only want to run the "scale_dask_workers.ipynb" file if the
                    # check_dask_scaling variable is true. We continue in the loop if
                    # check_dask_scaling == False when we iterate over this file.
                    if (not check_dask_scaling) and (name == "scale_dask_workers.ipynb"):
                        continue

                    print(f"Running {name} test notebook...")
                    test_notebook_path = os.path.join(root, name)
                    await check_hub_health(hub_url, test_notebook_path, api_token)

            print(f"Hub {hub_url} is healthy!")
        except Exception as e:
            print(
                f"Hub {hub_url} not healthy! Stopping further deployments. Exception was {e}."
            )
>           raise (e)

deployer/tests/test_hub_health.py:84:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
deployer/tests/test_hub_health.py:77: in test_hub_healthy
    await check_hub_health(hub_url, test_notebook_path, api_token)
deployer/tests/test_hub_health.py:45: in check_hub_health
    await execute_notebook(
/usr/local/Caskroom/miniconda/base/envs/infrastructure/lib/python3.10/site-packages/jhub_client/execute.py:153: in execute_notebook
    return await execute_code(hub_url, cells, **kwargs)
/usr/local/Caskroom/miniconda/base/envs/infrastructure/lib/python3.10/site-packages/jhub_client/execute.py:79: in execute_code
    jupyter = await hub.ensure_server(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <jhub_client.api.JupyterHubAPI object at 0x7fcf2671d540>
username = 'deployment-service-check', timeout = 360, user_options = None, create_user = True

    async def ensure_server(
        self, username, timeout, user_options=None, create_user=False
    ):
        user = await self.ensure_user(username, create_user=create_user)
        if user["server"] is None:
            await self.create_server(username, user_options=user_options)

        start_time = time.time()
        while True:
            user = await self.get_user(username)
            if user["server"] and user["pending"] is None:
                return JupyterAPI(
                    self.hub_url / "user" / username,
                    self.api_token,
                    verify_ssl=self.verify_ssl,
                )

            await asyncio.sleep(5)
            total_time = time.time() - start_time
            if total_time > timeout:
                logger.error(f"jupyterhub server creation timeout={timeout:.0f} [s]")
>               raise TimeoutError(
                    f"jupyterhub server creation timeout={timeout:.0f} [s]"
                )
E               TimeoutError: jupyterhub server creation timeout=360 [s]

/usr/local/Caskroom/miniconda/base/envs/infrastructure/lib/python3.10/site-packages/jhub_client/api.py:120: TimeoutError
----------------------------------- Captured stdout call ------------------------------------
Starting hub https://staging.openscapes.2i2c.cloud health validation...
Running dask_test_notebook.ipynb test notebook...
Hub https://staging.openscapes.2i2c.cloud not healthy! Stopping further deployments. Exception was jupyterhub server creation timeout=360 [s].
------------------------------------- Captured log call -------------------------------------
ERROR    jhub_client.api:api.py:119 jupyterhub server creation timeout=360 [s]
================================== short test summary info ==================================
FAILED deployer/tests/test_hub_health.py::test_hub_healthy - TimeoutError: jupyterhub serv...
1 failed in 365.63s (0:06:05)
Health check failed!

Proposal

No response

Updates and actions

No response

sgibson91 commented 2 years ago

I have confirmed that this is also the case for the prod hub. Hence, this is either an issue in the common config, or with the cluster itself.

sgibson91 commented 2 years ago

I think we should prioritise investigating this issue quite highly since openscapes will continue to fail in CI/CD (and hence require manual upgrading) until it is fixed.

GeorgianaElena commented 2 years ago

Hope it's ok that I self assigned this and plan to look into it soon.

GeorgianaElena commented 2 years ago
sgibson91 commented 2 years ago

@GeorgianaElena Should we keep this open until the upstream PR is merged and we can remove the fix, or should we track that in a new issue?

GeorgianaElena commented 2 years ago

Hmm, I'm not sure. My thinking was that the #fixme comment in temp fix code it's enough to close this issue and then track the upstream kubespawner PR as part of https://github.com/2i2c-org/infrastructure/issues/1055 (maybe as a small check box there)? I believe the kubespawner version we're using comes from z2jh anyway. What do you think?

sgibson91 commented 2 years ago

My only concern is that we don't develop much in the deployer any more and so it might take a while to rediscover and remember to fix the #fixme. However, if it will be harmless after #1055 then I don't think my concern should be a blocker.

GeorgianaElena commented 2 years ago

I just opened https://github.com/2i2c-org/infrastructure/issues/1643 and added it to the list of tasks to take when https://github.com/2i2c-org/infrastructure/issues/1055 if that upgrade will come with the upstream fix, just to be sure.

Thank you @sgibson91 ✨

damianavila commented 2 years ago

Thanks for opening the tracker issue, @GeorgianaElena!!

sgibson91 commented 2 years ago

I am now seeing this problem on CarbonPlan (AWS) after https://github.com/2i2c-org/infrastructure/pull/1642#issuecomment-1232771961 Both of these hubs use the environment chooser, so I wonder if this bug is tied to that config in some way?

Sorry, just realised from your comment that it happens when we set default: true for a profile