berkeley-dsep-infra / data100-19s

1 stars 3 forks source link

Error causing hub restarts, gateway timeout error #40

Open ryanlovett opened 5 years ago

ryanlovett commented 5 years ago

A student got a 504 gateway timeout this morning at 9:05a when the hub process restarted. get pod -o yaml on the hub pod had:

    lastState:
      terminated:
        containerID: docker://82c3e2a0014c8afb15db34575710ba8e2f141eecb0a76bb3a6e8f5a05b2d40ef
        exitCode: 1
        finishedAt: 2019-01-24T17:05:42Z
        reason: Error
        startedAt: 2019-01-24T03:24:36Z

logs -p on the hub pod had (@ 8:50a):

[I 2019-01-24 16:50:41.350 JupyterHub log:158] 200 GET /hub/api/users/asdf/server/progress (asdf@10.240.0.8) 36092.09ms
[I 2019-01-24 16:50:41.351 JupyterHub log:158] 200 GET /hub/api/users/asdf/server/progress (asdf@10.240.0.8) 12988.69ms
[I 2019-01-24 16:50:41.577 JupyterHub proxy:301] Checking routes
[I 2019-01-24 16:50:43.346 JupyterHub app:1701] Cleaning up 1 services...
[I 2019-01-24 16:50:43.384 JupyterHub app:1713] Leaving single-user servers running
[I 2019-01-24 16:50:43.384 JupyterHub app:1721] I didn't start the proxy, I can't clean it up
[I 2019-01-24 16:50:43.385 JupyterHub app:1739] ...done
2019-01-24 17:00:21,001 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='data100-19-data100-19s-3f9d81-02bd0856.hcp.westus2.azmk8s.io', port=443): Read timed out. (read timeout=None)",)': /api/v1/namespaces/data100-prod/pods/jupyter-qwer?gracePeriodSeconds=1
WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='data100-19-data100-19s-3f9d81-02bd0856.hcp.westus2.azmk8s.io', port=443): Read timed out. (read timeout=None)",)': /api/v1/namespaces/data100-prod/pods/jupyter-qwer?gracePeriodSeconds=1
2019-01-24 17:00:29,192 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='data100-19-data100-19s-3f9d81-02bd0856.hcp.westus2.azmk8s.io', port=443): Read timed out. (read timeout=None)",)': /api/v1/namespaces/data100-prod/pods/jupyter-asdf?gracePeriodSeconds=1
WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='data100-19-data100-19s-3f9d81-02bd0856.hcp.westus2.azmk8s.io', port=443): Read timed out. (read timeout=None)",)': /api/v1/namespaces/data100-prod/pods/jupyter-asdf?gracePeriodSeconds=1

Unfortunately get events doesn't go back far enough in time.

It seems that the hub recognizes that it is stopping, but it isn't clear to me why it is happening. Hub restarts are inevitable so it'd be good to make sure that they are less intrusive, however that is an upstream issue. I'm more concerned with why the hub may be unnecessarily restarting.