Operator sometimes gets stuck

olevski commented 2 years ago

I have only seen this in CI or dev deployments so far.

Sometimes the operator just gets stuck and even though the jupyterserver resources can be created the operator does not act to create the child resources required.

In all cases simply deleting the operator pod fixes this. But we should try to catch or avoid this behaviour all together.

olevski commented 2 years ago

As of yesteday this started occurring on production deployments. Namely on renkulab there were several sessions where the status was not updating because the operator was stuck. Users could not use these sessions because the ui and notebook service depend on the status of the jypyterserver resource to judge when the sessions are ready. The operator seems to have been stuck so the status of every new session was not being updated.

Deleting the operator pod and letting it get recreated fixes this. But it should not occur in the first place.

Have not been able to replicate.

olevski commented 2 years ago

It seems that this is related to these series of events (extracted from the log of the operator):

A GET to https://10.43.0.1:443/apis/amalthea.dev/v1alpha1/namespaces/renku/jupyterservers?watch=true&resourceVersion=23516041 fails 1/9
A repeat of the GET request fails 2/9
After this amalthea starts shutting down its watches on any resources it is supposed to control Stopping the watch-stream for jupyterservers.v1alpha1.amalthea.dev in 'renku'
Then we get the errors like: Watcher for jupyterservers.v1alpha1.amalthea.dev@renku has failed I think at this point the controller is fully unresponsive.

olevski commented 2 years ago

I tried to replicate a scenario like this by deploying Amalthea, then editing its role to prevent it from having any rights to touch Jupyterservers. During this time I create a new session. As expected this new session is not created. But when the roles are fixed Amalthea still does not respond to the new session. It only creates the new session after the operator pod is fully deleted and restarted.

So this shows the same outcome as the issue described here. But the error messages in the log are the same as what we saw in production when this happens "naturally". I will attempt to trigger this once more by adding a network policy that will block all egress for the operator pod.

olevski commented 2 years ago

Another attempt to replicate this issue has been made. In this case by using a k8s networkpolicy that blocks all egress for the amalthea operator.

This gets close to the original error where I can see this kind of messages:

"Request attempt #3/9 failed; will retry: GET https://10.43.0.1:443/api/v1/namespaces/tasko/pods?watch=true&resourceVersion=8314911 -> ClientConnectorError(ConnectionKey(host='10.43.0.1'

However the difference between this and the original error is that the resource watches are never shut down by amalthea. I think this is because when I apply the network policy this always results in a 5XX - like error. And kopf treats these as eventually recoverable. So it will keep trying.

Based on the logs of the original error the k8s api eventually returns a 403:

{"message": "Watcher for jupyterservers.v1alpha1.amalthea.dev@renku has failed: ('jupyterservers.amalthea.dev is forbidden: User \"system:serviceaccount:renku:renku-amalthea\" cannot watch resource \"jupyterservers\" in API group \"amalthea.dev\" in the namespace \"renku\": RBAC: role.rbac.authorization.k8s.io \"renku-amalthea\" not found', {'kind': 'Status', 'apiVersion': 'v1', 'metadata': {}, 'status': 'Failure', 'message': 'jupyterservers.amalthea.dev is forbidden: User \"system:serviceaccount:renku:renku-amalthea\" cannot watch resource \"jupyterservers\" in API group \"amalthea.dev\" in the namespace \"renku\": RBAC: role.rbac.authorization.k8s.io \"renku-amalthea\" not found', 'reason': 'Forbidden', 'details': {'group': 'amalthea.dev', 'kind': 'jupyterservers'}, 'code': 403})", "exc_info": "Traceback (most recent call last):\n  File \"/usr/local/lib/python3.8/site-packages/kopf/_cogs/clients/errors.py\", line 148, in check_response\n    response.raise_for_status()\n  File \"/usr/local/lib/python3.8/site-packages/aiohttp/client_reqrep.py\", line 1000, in raise_for_status\n    raise ClientResponseError(\naiohttp.client_exceptions.ClientResponseError: 403, message='Forbidden', url=URL('https://10.43.0.1:443/apis/amalthea.dev/v1alpha1/namespaces/renku/jupyterservers?watch=true&resourceVersion=23516041')\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/usr/local/lib/python3.8/site-packages/kopf/_cogs/aiokits/aiotasks.py\", line 69, in guard\n    await coro\n  File \"/usr/local/lib/python3.8/site-packages/kopf/_core/reactor/queueing.py\", line 184, in watcher\n    async for raw_event in stream:\n  File \"/usr/local/lib/python3.8/site-packages/kopf/_cogs/clients/watching.py\", line 82, in infinite_watch\n    async for raw_event in stream:\n  File \"/usr/local/lib/python3.8/site-packages/kopf/_cogs/clients/watching.py\", line 182, in continuous_watch\n    async for raw_input in stream:\n  File \"/usr/local/lib/python3.8/site-packages/kopf/_cogs/clients/watching.py\", line 247, in watch_objs\n    async for raw_input in api.stream(\n  File \"/usr/local/lib/python3.8/site-packages/kopf/_cogs/clients/api.py\", line 200, in stream\n    response = await request(\n  File \"/usr/local/lib/python3.8/site-packages/kopf/_cogs/clients/auth.py\", line 45, in wrapper\n    return await fn(*args, **kwargs, context=context)\n  File \"/usr/local/lib/python3.8/site-packages/kopf/_cogs/clients/api.py\", line 85, in request\n    await errors.check_response(response)  # but do not parse it!\n  File \"/usr/local/lib/python3.8/site-packages/kopf/_cogs/clients/errors.py\", line 150, in check_response\n    raise cls(payload, status=response.status) from e\nkopf._cogs.clients.errors.APIForbiddenError: ('jupyterservers.amalthea.dev is forbidden: User \"system:serviceaccount:renku:renku-amalthea\" cannot watch resource \"jupyterservers\" in API group \"amalthea.dev\" in the namespace \"renku\": RBAC: role.rbac.authorization.k8s.io \"renku-amalthea\" not found', {'kind': 'Status', 'apiVersion': 'v1', 'metadata': {}, 'status': 'Failure', 'message': 'jupyterservers.amalthea.dev is forbidden: User \"system:serviceaccount:renku:renku-amalthea\" cannot watch resource \"jupyterservers\" in API group \"amalthea.dev\" in the namespace \"renku\": RBAC: role.rbac.authorization.k8s.io \"renku-amalthea\" not found', 'reason': 'Forbidden', 'details': {'group': 'amalthea.dev', 'kind': 'jupyterservers'}, 'code': 403})", "timestamp": "2022-02-07T16:58:55.870799+00:00", "severity": "error"}
--

olevski commented 2 years ago

I can confirm that using a networkpolicy will temporarily block amalthea from reacting to new resources. But after the deny all ingress network policy is deleted then amalthea can recover without needing a full pod delete/restart. So to replicate this we need to get a real error response from the k8s api.

SwissDataScienceCenter / amalthea

Operator sometimes gets stuck #132