berkeley-dsep-infra / data100-19s

1 stars 3 forks source link

Elevated Pending Spawn Error Rate #50

Closed simon-mo closed 5 years ago

simon-mo commented 5 years ago

We are getting a lot of timeout errors for user trying start the server.

From the userside, it shows:

Server requested
Server requested
Server requested
Server requested
Server requested
Spawn failed: Timeout

Checking the hub log, the corresponding error log seems to be reading api server too much?

[E 2019-01-28 16:49:38.947 JupyterHub gen:974] Exception in Future <Future finished exception=ReadTimeoutError("HTTPSConnectionPool(host='data100-19-data100-19s-3f9d81-02bd0856.hcp.westus2.azmk8s.io', port=443): Read timed out. (read timeout=None)",)> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 970, in error_callback
        future.result()
      File "<string>", line 16, in start
      File "/usr/local/lib/python3.6/dist-packages/kubespawner/spawner.py", line 1691, in _start
        pod,
      File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
        result = self.fn(*self.args, **self.kwargs)
      File "/usr/local/lib/python3.6/dist-packages/kubespawner/spawner.py", line 1477, in asynchronize
        return method(*args, **kwargs)
      File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/apis/core_v1_api.py", line 6115, in create_namespaced_pod
        (data) = self.create_namespaced_pod_with_http_info(namespace, body, **kwargs)
      File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/apis/core_v1_api.py", line 6206, in create_namespaced_pod_with_http_info
        collection_formats=collection_formats)
      File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 321, in call_api
        _return_http_data_only, collection_formats, _preload_content, _request_timeout)
      File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 155, in __call_api
        _request_timeout=_request_timeout)
      File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/api_client.py", line 364, in request
        body=body)
      File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/rest.py", line 266, in POST
        body=body)
      File "/usr/local/lib/python3.6/dist-packages/kubernetes/client/rest.py", line 166, in request
        headers=headers)
      File "/usr/local/lib/python3.6/dist-packages/urllib3/request.py", line 72, in request
        **urlopen_kw)
      File "/usr/local/lib/python3.6/dist-packages/urllib3/request.py", line 150, in request_encode_body
        return self.urlopen(method, url, **extra_kw)
      File "/usr/local/lib/python3.6/dist-packages/urllib3/poolmanager.py", line 323, in urlopen
        response = conn.urlopen(method, u.request_uri, **kw)
      File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 638, in urlopen
        _stacktrace=sys.exc_info()[2])
      File "/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py", line 367, in increment
        raise six.reraise(type(error), error, _stacktrace)
      File "/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py", line 686, in reraise
        raise value
      File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
        chunked=chunked)
      File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 386, in _make_request
        self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
      File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 317, in _raise_timeout
        raise ReadTimeoutError(self, url, "Read timed out. (read timeout=%s)" % timeout_value)
    urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='data100-19-data100-19s-3f9d81-02bd0856.hcp.westus2.azmk8s.io', port=443): Read timed out. (read timeout=None)

We do have enough resource, @ryanlovett added a node few hours ago.

ryanlovett commented 5 years ago

We think this is fixed by altering the k8s api host URL. The public hostname referenced in the exception was changed to an internal hostname.