dask / dask-kubernetes

Native Kubernetes integration for Dask
https://kubernetes.dask.org
BSD 3-Clause "New" or "Revised" License
311 stars 143 forks source link

Worker Pod is in pending state #205

Closed MnvS closed 2 years ago

MnvS commented 4 years ago

Hi all there, thanks so much for such a great project. I am trying to run the example provided on https://kubernetes.dask.org/en/latest/ link but while running dask array example its is showing worker pod is in pending state and python code is looping through below error.

Pod status on kube8s

[root@k8s-master example]# kubectl get pods -o wide --all-namespaces
NAMESPACE     NAME                                 READY   STATUS    RESTARTS   AGE   IP             NODE           NOMINATED NODE   READINESS GATES
default       workerpod                            0/1     Pending   0          19m   <none>         <none>         <none>           <none>
kube-system   coredns-5644d7b6d9-f2xbk             1/1     Running   0          30m   10.40.0.2      worker-node2   <none>           <none>
kube-system   coredns-5644d7b6d9-npj4c             1/1     Running   0          30m   10.40.0.1      worker-node2   <none>           <none>
kube-system   etcd-k8s-master                      1/1     Running   0          29m   172.16.0.76    k8s-master     <none>           <none>
kube-system   kube-apiserver-k8s-master            1/1     Running   0          29m   172.16.0.76    k8s-master     <none>           <none>
kube-system   kube-controller-manager-k8s-master   1/1     Running   0          29m   172.16.0.76    k8s-master     <none>           <none>
kube-system   kube-proxy-2h6g9                     1/1     Running   0          28m   172.16.0.114   worker-node2   <none>           <none>
kube-system   kube-proxy-97nlm                     1/1     Running   0          29m   172.16.0.31    worker-node1   <none>           <none>
kube-system   kube-proxy-hztq5                     1/1     Running   0          30m   172.16.0.76    k8s-master     <none>           <none>
kube-system   kube-scheduler-k8s-master            1/1     Running   0          29m   172.16.0.76    k8s-master     <none>           <none>
kube-system   weave-net-9dddd                      2/2     Running   0          27m   172.16.0.31    worker-node1   <none>           <none>
kube-system   weave-net-d7f4p                      2/2     Running   0          27m   172.16.0.114   worker-node2   <none>           <none>
kube-system   weave-net-xpfqr                      2/2     Running   0          27m   172.16.0.76    k8s-master     <none>           <none>
[root@k8s-master example]#

Error Message

AssertionError
distributed.scheduler - INFO - Receive client connection: Client-4125e45c-0808-11ea-a403-12bd5ffa93ff
distributed.core - INFO - Starting established connection
ERROR:asyncio:Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /usr/lib64/python3.7/asyncio/tasks.py:623> exception=AssertionError()>
Traceback (most recent call last):
  File "/usr/lib64/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/usr/local/lib/python3.7/site-packages/distributed/deploy/spec.py", line 42, in _
    assert self.status == "running"
AssertionError
jacobtomlinson commented 4 years ago

Could you run kubectl describe pod workerpod? It would be interesting to see why it isn't getting placed.

MnvS commented 4 years ago

Below is the output for describe pod command:

[root@k8s-master example]# kubectl describe pod workerpod
Name:         workerpod
Namespace:    default
Priority:     0
Node:         <none>
Labels:       app=dask
          dask.org/cluster-name=dask-root-ede98556-b
          dask.org/component=worker
          foo=bar
          user=root
Annotations:  <none>
Status:       Pending
IP:
IPs:          <none>
Containers:
  dask:
    Image:      daskdev/dask:latest
    Port:       <none>
    Host Port:  <none>
    Args:
      dask-worker
      --nthreads
      2
      --no-bokeh
      --memory-limit
      6GB
      --death-timeout
      60
    Limits:
      cpu:     2
      memory:  6G
    Requests:
      cpu:     2
      memory:  6G
    Environment:
      EXTRA_PIP_PACKAGES:      fastparquet git+https://github.com/dask/distributed
      DASK_SCHEDULER_ADDRESS:  tcp://172.16.0.76:34581
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-7mjzj (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  default-token-7mjzj:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-7mjzj
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     k8s.dask.org/dedicated=worker:NoSchedule
                 k8s.dask.org_dedicated=worker:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  35s   default-scheduler  0/3 nodes are available: 3 Insufficient cpu, 3 
Insufficient memory.
[root@k8s-master example]# kubectl get nodes
NAME           STATUS   ROLES    AGE     VERSION
k8s-master     Ready    master   7m19s   v1.16.3
worker-node1   Ready    <none>   5m38s   v1.16.2
worker-node2   Ready    <none>   5m30s   v1.16.2
[root@k8s-master example]# kubectl get nodes
jacobtomlinson commented 4 years ago

3 Insufficient cpu, 3 Insufficient memory.

It looks like your cluster is not able to fulfil the requirements you have set your pod. You will either need to use bigger nodes, enable autoscaling or reduce your worker requirements.

What kind of cluster are you using (GKE, EKS, Bare metal, local, etc)?

MnvS commented 4 years ago

Thanks for reply. I am able to run dask array example after increasing the memory and cpus on nodes, but still getting below errors as output (along with output 1). I am using local kube8s cluster on ec2 instances.

(base) [root@k8s-master example]# python dask_example.py
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:   tcp://172.16.0.76:34237
distributed.scheduler - INFO - Receive client connection: Client-0eab0792-0b04-11ea-90ce-12bd5ffa93ff
distributed.core - INFO - Starting established connection
ERROR:asyncio:Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /root/miniconda3/lib/python3.7/asyncio/tasks.py:623> exception=AssertionError()>
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
return (yield from awaitable.__await__())
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/spec.py", line 42, in _
assert self.status == "running"
AssertionError

…..Same errors repeated...

…..Received output...

1.0

….errors repeated...

tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fc369298e90>>, <Task finished coro=<SpecCluster._correct_state_internal() done, defined at /root/miniconda3/lib/python3.7/site-packages/distributed/deploy/spec.py:284> exception=AssertionError()>)
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
  File "/root/miniconda3/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/spec.py", line 317, in _correct_state_internal
await w  # for tornado gen.coroutine support
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/spec.py", line 42, in _
assert self.status == "running"
AssertionError
distributed.scheduler - INFO - Scheduler closing...
distributed.scheduler - INFO - Scheduler closing all comms
distributed.scheduler - INFO - Remove worker tcp://10.44.0.1:38635
distributed.core - INFO - Removing comms to tcp://10.44.0.1:38635
distributed.scheduler - INFO - Lost all workers
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.7/contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 186, in ignoring
yield
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/spec.py", line 574, in close_clusters
cluster.close(timeout=10)
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/cluster.py", line 83, in close
return self.sync(self._close, callback_timeout=timeout)
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/cluster.py", line 162, in sync
return sync(self.loop, func, *args, **kwargs)
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 334, in sync
raise exc.with_traceback(tb)
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 318, in f
result[0] = yield future
  File "/root/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/spec.py", line 372, in _close
assert w.status == "closed", w.status
AssertionError: created
2019-11-19 20:28:16,181 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fc34d55dd10>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/v1/namespaces/default/pods?labelSelector=foo%3Dbar%2Cdask.org%2Fcluster-name%3Ddask-root-ead3c94d-f%2Cuser%3Droot%2Capp%3Ddask

...

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.7/weakref.py", line 648, in _exitfunc
f()
  File "/root/miniconda3/lib/python3.7/weakref.py", line 572, in __call__
return info.func(*info.args, **(info.kwargs or {}))
  File "/root/miniconda3/lib/python3.7/site-packages/dask_kubernetes/core.py", line 623, in _cleanup_resources
pods = core_api.list_namespaced_pod(namespace, label_selector=format_labels(labels))
  File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 12372, in list_namespaced_pod
(data) = self.list_namespaced_pod_with_http_info(namespace, **kwargs)
  File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 12472, in list_namespaced_pod_with_http_info
collection_formats=collection_formats)
  File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 334, in call_api
_return_http_data_only, collection_formats, _preload_content, _request_timeout)
  File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 168, in __call_api
_request_timeout=_request_timeout)
  File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 355, in request
headers=headers)
  File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 231, in GET
query_params=query_params)

...

urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='localhost', port=443): Max retries 
exceeded with url: /api/v1/namespaces/default/pods?labelSelector=foo%3Dbar%2Cdask.org%2Fcluster-name%3Ddask-root-ead3c94d-f%2Cuser%3Droot%2Capp%3Ddask (Caused by 
NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fc362d1a090>: 
Failed to establish a new connection: [Errno 111] Connection refused'))
MnvS commented 4 years ago

Update: Worker pod is giving error status as below:

(base) [root@k8s-master example]# ls
dask_example.py  worker-spec.yml
(base) [root@k8s-master example]# nohup python dask_example.py &
[1] 3660
(base) [root@k8s-master example]# cat nohup.out
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:   tcp://172.16.0.76:40119
distributed.scheduler - INFO - Receive client connection: Client-df4caa18-0bc8-11ea-8e4c-12bd5ffa93ff
distributed.core - INFO - Starting established connection
(base) [root@k8s-master example]# kubectl get pods -o wide --all-namespaces
NAMESPACE     NAME                                 READY   STATUS    RESTARTS   AGE     IP             NODE           NOMINATED NODE   READINESS GATES
default       workerpod                            1/1     Running   0          70s     10.32.0.2      worker-node1   <none>           <none>
kube-system   coredns-5644d7b6d9-l4jsd             1/1     Running   0          8m19s   10.32.0.4      k8s-master     <none>           <none>
kube-system   coredns-5644d7b6d9-q679h             1/1     Running   0          8m19s   10.32.0.3      k8s-master     <none>           <none>
kube-system   etcd-k8s-master                      1/1     Running   0          7m16s   172.16.0.76    k8s-master     <none>           <none>
kube-system   kube-apiserver-k8s-master            1/1     Running   0          7m1s    172.16.0.76    k8s-master     <none>           <none>
kube-system   kube-controller-manager-k8s-master   1/1     Running   0          7m27s   172.16.0.76    k8s-master     <none>           <none>
kube-system   kube-proxy-ctgj8                     1/1     Running   0          5m7s    172.16.0.114   worker-node2   <none>           <none>
kube-system   kube-proxy-f78bm                     1/1     Running   0          8m18s   172.16.0.76    k8s-master     <none>           <none>
kube-system   kube-proxy-ksk59                     1/1     Running   0          5m15s   172.16.0.31    worker-node1   <none>           <none>
kube-system   kube-scheduler-k8s-master            1/1     Running   0          7m2s    172.16.0.76    k8s-master     <none>           <none>
kube-system   weave-net-q2zwn                      2/2     Running   0          6m22s   172.16.0.76    k8s-master     <none>           <none>
kube-system   weave-net-r9tzs                      2/2     Running   0          5m15s   172.16.0.31    worker-node1   <none>           <none>
kube-system   weave-net-tm8xx                      2/2     Running   0          5m7s    172.16.0.114   worker-node2   <none>           <none>
(base) [root@k8s-master example]# kubectl get pods -o wide --all-namespaces
NAMESPACE     NAME                                 READY   STATUS    RESTARTS   AGE     IP             NODE           NOMINATED NODE   READINESS GATES
default       workerpod                            0/1     Error     0          4m23s   10.32.0.2      worker-node1   <none>           <none>
kube-system   coredns-5644d7b6d9-l4jsd             1/1     Running   0          11m     10.32.0.4      k8s-master     <none>           <none>
kube-system   coredns-5644d7b6d9-q679h             1/1     Running   0          11m     10.32.0.3      k8s-master     <none>           <none>
kube-system   etcd-k8s-master                      1/1     Running   0          10m     172.16.0.76    k8s-master     <none>           <none>
kube-system   kube-apiserver-k8s-master            1/1     Running   0          10m     172.16.0.76    k8s-master     <none>           <none>
kube-system   kube-controller-manager-k8s-master   1/1     Running   0          10m     172.16.0.76    k8s-master     <none>           <none>
kube-system   kube-proxy-ctgj8                     1/1     Running   0          8m20s   172.16.0.114   worker-node2   <none>           <none>
kube-system   kube-proxy-f78bm                     1/1     Running   0          11m     172.16.0.76    k8s-master     <none>           <none>
kube-system   kube-proxy-ksk59                     1/1     Running   0          8m28s   172.16.0.31    worker-node1   <none>           <none>
kube-system   kube-scheduler-k8s-master            1/1     Running   0          10m     172.16.0.76    k8s-master     <none>           <none>
kube-system   weave-net-q2zwn                      2/2     Running   0          9m35s   172.16.0.76    k8s-master     <none>           <none>
kube-system   weave-net-r9tzs                      2/2     Running   0          8m28s   172.16.0.31    worker-node1   <none>           <none>
kube-system   weave-net-tm8xx                      2/2     Running   0          8m20s   172.16.0.114   worker-node2   <none>           <none>
(base) [root@k8s-master example]# cat nohup.out
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:   tcp://172.16.0.76:40119
distributed.scheduler - INFO - Receive client connection: Client-df4caa18-0bc8-11ea-8e4c-12bd5ffa93ff
distributed.core - INFO - Starting established connection
 (base) [root@k8s-master example]# kubectl describe pod workerpod
Name:         workerpod
Namespace:    default
Priority:     0
Node:         worker-node1/172.16.0.31
Start Time:   Wed, 20 Nov 2019 19:06:36 +0000
Labels:       app=dask
          dask.org/cluster-name=dask-root-99dcf768-4
          dask.org/component=worker
          foo=bar
          user=root
Annotations:  <none>
Status:       Failed
 IP:           10.32.0.2
 IPs:
   IP:  10.32.0.2
Containers:
  dask:
    Container ID:  docker://578dc575fc263c4a3889a4f2cb5e06cd82a00e03cfc6acfd7a98fef703421390
    Image:         daskdev/dask:latest
    Image ID:      docker-pullable://daskdev/dask@sha256:0a936daa94c82cea371c19a2c90c695688ab4e1e7acc905f8b30dfd419adfb6f
Port:          <none>
Host Port:     <none>
Args:
  dask-worker
  --nthreads
  2
  --no-bokeh
  --memory-limit
  6GB
  --death-timeout
  60
State:          Terminated
  Reason:       Error
  Exit Code:    1
  Started:      Wed, 20 Nov 2019 19:06:38 +0000
  Finished:     Wed, 20 Nov 2019 19:08:20 +0000
Ready:          False
Restart Count:  0
Limits:
  cpu:     2
  memory:  6G
Requests:
  cpu:     2
  memory:  6G
Environment:
  EXTRA_PIP_PACKAGES:      fastparquet git+https://github.com/dask/distributed
  DASK_SCHEDULER_ADDRESS:  tcp://172.16.0.76:40119
Mounts:
  /var/run/secrets/kubernetes.io/serviceaccount from default-token-p9f9v (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  default-token-p9f9v:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-p9f9v
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     k8s.dask.org/dedicated=worker:NoSchedule
                 k8s.dask.org_dedicated=worker:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age    From                   Message
  ----    ------     ----   ----                   -------
  Normal  Scheduled  5m47s  default-scheduler      Successfully assigned default/workerpod to worker-node1
  Normal  Pulled     5m45s  kubelet, worker-node1  Container image "daskdev/dask:latest" already present on machine
  Normal  Created    5m45s  kubelet, worker-node1  Created container dask
  Normal  Started    5m45s  kubelet, worker-node1  Started container dask
(base) [root@k8s-master example]#
(base) [root@k8s-master example]# kubectl get events
LAST SEEN   TYPE     REASON                    OBJECT              MESSAGE
21m         Normal   Starting                  node/k8s-master     Starting kubelet.
21m         Normal   NodeHasSufficientMemory   node/k8s-master     Node k8s-master status is now: NodeHasSufficientMemory
21m         Normal   NodeHasNoDiskPressure     node/k8s-master     Node k8s-master status is now: NodeHasNoDiskPressure
21m         Normal   NodeHasSufficientPID      node/k8s-master     Node k8s-master status is now: NodeHasSufficientPID
21m         Normal   NodeAllocatableEnforced   node/k8s-master     Updated Node Allocatable limit across pods
21m         Normal   RegisteredNode            node/k8s-master     Node k8s-master event: Registered Node k8s-master in Controller
21m         Normal   Starting                  node/k8s-master     Starting kube-proxy.
18m         Normal   Starting                  node/worker-node1   Starting kubelet.
18m         Normal   NodeHasSufficientMemory   node/worker-node1   Node worker-node1 status is now: NodeHasSufficientMemory
18m         Normal   NodeHasNoDiskPressure     node/worker-node1   Node worker-node1 status is now: NodeHasNoDiskPressure
18m         Normal   NodeHasSufficientPID      node/worker-node1   Node worker-node1 status is now: NodeHasSufficientPID
18m         Normal   NodeAllocatableEnforced   node/worker-node1   Updated Node Allocatable limit across pods
18m         Normal   Starting                  node/worker-node1   Starting kube-proxy.
18m         Normal   RegisteredNode            node/worker-node1   Node worker-node1 event: Registered Node worker-node1 in Controller
17m         Normal   NodeReady                 node/worker-node1   Node worker-node1 status is now: NodeReady
18m         Normal   Starting                  node/worker-node2   Starting kubelet.
18m         Normal   NodeHasSufficientMemory   node/worker-node2   Node worker-node2 status is now: NodeHasSufficientMemory
18m         Normal   NodeHasNoDiskPressure     node/worker-node2   Node worker-node2 status is now: NodeHasNoDiskPressure
18m         Normal   NodeHasSufficientPID      node/worker-node2   Node worker-node2 status is now: NodeHasSufficientPID
18m         Normal   NodeAllocatableEnforced   node/worker-node2   Updated Node Allocatable limit across pods
18m         Normal   Starting                  node/worker-node2   Starting kube-proxy.
17m         Normal   RegisteredNode            node/worker-node2   Node worker-node2 event: Registered Node worker-node2 in Controller
17m         Normal   NodeReady                 node/worker-node2   Node worker-node2 status is now: NodeReady
14m         Normal   Scheduled                 pod/workerpod       Successfully assigned default/workerpod to worker-node1
14m         Normal   Pulled                    pod/workerpod       Container image "daskdev/dask:latest" already present on machine
14m         Normal   Created                   pod/workerpod       Created container dask
14m         Normal   Started                   pod/workerpod       Started container dask
(base) [root@k8s-master example]#
jacobtomlinson commented 4 years ago

Thanks for providing the extra info. There definitely seems to be something up with you k8s cluster as the pod is erroring but not providing much reasoning behind it.

When you upped the memory is there actually enough memory for it to use?

MnvS commented 4 years ago

Memory is increased to 64GB on all 3 nodes so it should not be issue. Logs of workerpod show that it is not able to resolve github.com so it could be dns issue (Output below):

(base) [root@k8s-master example]# free -mh
              total        used        free      shared  buff/cache   available
Mem:            62G        1.2G         59G        992K        1.2G         60G
Swap:            0B          0B          0B

(base) [root@k8s-master example]# kubectl logs workerpod
+ '[' '' ']'
+ '[' -e /opt/app/environment.yml ']'
no environment.yml
+ echo 'no environment.yml'
+ '[' '' ']'
EXTRA_PIP_PACKAGES environment variable found.  Installing.
+ '[' 'fastparquet git+https://github.com/dask/distributed' ']'
+ echo 'EXTRA_PIP_PACKAGES environment variable found.  Installing.'
+ /opt/conda/bin/pip install fastparquet git+https://github.com/dask/distributed
Collecting git+https://github.com/dask/distributed
  Cloning https://github.com/dask/distributed to /tmp/pip-req-build-auau5085
  Running command git clone -q https://github.com/dask/distributed /tmp/pip-req-build-auau5085
  fatal: unable to access 'https://github.com/dask/distributed/': Could not resolve host: github.com
ERROR: Command errored out with exit status 128: git clone -q https://github.com/dask/distributed /tmp/pip-req-build-auau5085 Check the logs for full command output.
+ exec dask-worker --nthreads 2 --no-bokeh --memory-limit 6GB --death-timeout 60
/opt/conda/lib/python3.7/site-packages/distributed/cli/dask_worker.py:252: UserWarning: The --bokeh/--no-bokeh flag has been renamed to --dashboard/--no-dashboard.
  "The --bokeh/--no-bokeh flag has been renamed to --dashboard/--no-dashboard. "
distributed.nanny - INFO -         Start Nanny at: 'tcp://10.32.0.3:45001'
distributed.worker - INFO -       Start worker at:      tcp://10.32.0.3:39147
distributed.worker - INFO -          Listening to:      tcp://10.32.0.3:39147
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.0.76:41719
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          2
distributed.worker - INFO -                Memory:                    6.00 GB
distributed.worker - INFO -       Local Directory:           /worker-2vb5q4k8
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.0.76:41719
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.0.76:41719
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.0.76:41719
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.0.76:41719
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.0.76:41719
distributed.nanny - INFO - Closing Nanny at 'tcp://10.32.0.3:45001'
distributed.worker - INFO - Stopping worker at tcp://10.32.0.3:39147
distributed.worker - INFO - Closed worker has not yet started: None
distributed.dask_worker - INFO - Timed out starting worker
distributed.dask_worker - INFO - End worker
(base) [root@k8s-master example]#

dns check when workerpod was is running state

(base) [root@k8s-master example]# kubectl exec workerpod cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:5
(base) [root@k8s-master example]# kubectl exec workerpod nslookup github.com
OCI runtime exec failed: exec failed: container_linux.go:345: starting container process caused "exec: \"nslookup\": executable file not found in $PATH": unknown
command terminated with exit code 126

Not sure how to set executable file in $PATH of dask worker pod. It is set and working on my master host.

(base) [root@k8s-master example]# nslookup github.com
Server:         172.31.0.2
Address:        172.31.0.2#53

Non-authoritative answer:
Name:   github.com
Address: 140.82.114.3
jacobtomlinson commented 4 years ago

Yes it seems like your cluster is not able to access the internet. This is a requirement here.

OCI runtime exec failed: exec failed: container_linux.go:345: starting container process caused "exec: \"nslookup\": executable file not found in $PATH": unknown

This is failing because nslookup is not included in the base image. It's quite a minimal image so you wont find many network tools. Your easiest way to test it is probably to use Python requests as we know that will be available. Here's an example running docker locally, should be almost the same on k8s.

$ docker exec -it 8d22 python -c "import requests; print(requests.get('https://github.com').headers)"

{'Date': 'Fri, 22 Nov 2019 08:58:52 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked',
'Server': 'GitHub.com', 'Status': '200 OK', 'Vary': 'X-PJAX, Accept-Encoding', 'ETag': 'W/"8ec94cb60917f9348f3965fa3f6
341fe"', 'Cache-Control': 'max-age=0, private, must-revalidate', 'Set-Cookie': 'has_recent_activity=1; path=/; expires
=Fri, 22 Nov 2019 09:58:52 -0000, _octo=GH1.1.358802128.1574413132; domain=.github.com; path=/; expires=Mon, 22 Nov 20
21 08:58:52 -0000, logged_in=no; domain=.github.com; path=/; expires=Tue, 22 Nov 2039 08:58:52 -0000; secure; HttpOnly
, _gh_sess=NXBjTmhrNURKZnRrS294Q1llTDU2c25tVFlVVTNCZTJFKzIyeFN0KzJHM1lGdUR4d1F5Zzh6aWgySExFNXBOVDZjeXZHLzhZaHJCMFhEWjk
ra252NEZML2sySHRkdnh3TE8vUC9Ia21iRHFHYUNnQlVveDdRMTRndzV5OStoL0daandQQis1c0ppQ05RVDA3ZzFZNWNRPT0tLWxRUUhwOHJ2REpuNTc1c
Dhobk9hNEE9PQ%3D%3D--2f084d8c69d414d743a33055c134c010f40da5de; path=/; secure; HttpOnly', 'X-Request-Id': '73fb8c58-58
0f-4e74-b072-971356412e67', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains; preload', 'X-Frame-Opti
ons': 'deny', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Referrer-Policy': 'origin-whe
n-cross-origin, strict-origin-when-cross-origin', 'Expect-CT': 'max-age=2592000, report-uri="https://api.github.com/_p
rivate/browser/errors"', 'Content-Security-Policy': "default-src 'none'; base-uri 'self'; block-all-mixed-content; con
nect-src 'self' uploads.github.com www.githubstatus.com collector.githubapp.com api.github.com www.google-analytics.co
m github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-man
ifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com wss://live.github.com; font-sr
c github.githubassets.com; form-action 'self' github.com gist.github.com; frame-ancestors 'none'; frame-src render.git
hubusercontent.com; img-src 'self' data: github.githubassets.com identicons.github.com collector.githubapp.com github-
cloud.s3.amazonaws.com *.githubusercontent.com customer-stories-feed.github.com spotlights-feed.github.com; manifest-s
rc 'self'; media-src 'none'; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com", '
Content-Encoding': 'gzip', 'X-GitHub-Request-Id': 'A2E2:1D58E:B8DF62:11696C7:5DD7A34C'}
MnvS commented 4 years ago

Thanks so much for reply. nslookup issue is resolved after re-installing Kube8s and dask, now getting issues while building fastparquet on worker pod. Logs below:

(base) [root@k8s-master example]# kubectl logs workerpod
+ '[' '' ']'
+ '[' -e /opt/app/environment.yml ']'
+ echo 'no environment.yml'
no environment.yml
+ '[' '' ']'
+ '[' 'fastparquet git+https://github.com/dask/distributed' ']'
+ echo 'EXTRA_PIP_PACKAGES environment variable found.  Installing.'
+ /opt/conda/bin/pip install fastparquet git+https://github.com/dask/distributed
EXTRA_PIP_PACKAGES environment variable found.  Installing.
Collecting git+https://github.com/dask/distributed
  Cloning https://github.com/dask/distributed to /tmp/pip-req-build-9pgvdhjf
  Running command git clone -q https://github.com/dask/distributed /tmp/pip-req-build-9pgvdhjf
Collecting fastparquet
  Downloading https://files.pythonhosted.org/packages/58/49/dccb790fa17ab3fbf84a6b848050083c7a1899e9586000e34e3e4fbf5538/fastparquet-0.3.2.tar.gz (151kB)
Requirement already satisfied: click>=6.6 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.0+8.g5b33d54c) (7.0)
Requirement already satisfied: cloudpickle>=0.2.2 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.0+8.g5b33d54c) (1.2.2)
Requirement already satisfied: dask>=2.7.0 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.0+8.g5b33d54c) (2.8.0)
Requirement already satisfied: msgpack in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.0+8.g5b33d54c) (0.6.2)
Requirement already satisfied: psutil>=5.0 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.0+8.g5b33d54c) (5.6.5)
Requirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.0+8.g5b33d54c) (2.1.0)
Requirement already satisfied: tblib in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.0+8.g5b33d54c) (1.4.0)
Requirement already satisfied: toolz>=0.7.4 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.0+8.g5b33d54c) (0.10.0)
Requirement already satisfied: tornado>=5 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.0+8.g5b33d54c) (6.0.3)
Requirement already satisfied: zict>=0.1.3 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.0+8.g5b33d54c) (1.0.0)
Requirement already satisfied: pyyaml in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.0+8.g5b33d54c) (5.1.2)
Requirement already satisfied: pandas>=0.19 in /opt/conda/lib/python3.7/site-packages (from fastparquet) (0.25.2)
Collecting numba>=0.28 (from fastparquet)
  Downloading https://files.pythonhosted.org/packages/57/66/7ebc88e87b4ddf9b1c204e24d467cb0d13a7a890ecb4f83d20949f768929/numba-0.46.0-cp37-cp37m-manylinux1_x86_64.whl (3.6MB)
Requirement already satisfied: numpy>=1.11 in /opt/conda/lib/python3.7/site-packages (from fastparquet) (1.17.3)
Collecting thrift>=0.11.0 (from fastparquet)
  Downloading https://files.pythonhosted.org/packages/97/1e/3284d19d7be99305eda145b8aa46b0c33244e4a496ec66440dac19f8274d/thrift-0.13.0.tar.gz (59kB)
Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from fastparquet) (1.12.0)
Requirement already satisfied: heapdict in /opt/conda/lib/python3.7/site-packages (from zict>=0.1.3->distributed==2.8.0+8.g5b33d54c) (1.0.1)
Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.19->fastparquet) (2019.3)
Requirement already satisfied: python-dateutil>=2.6.1 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.19->fastparquet) (2.8.1)
Collecting llvmlite>=0.30.0dev0 (from numba>=0.28->fastparquet)
  Downloading https://files.pythonhosted.org/packages/1f/3e/642ffb29ed35ca5e93f171ba327452bdee81ec76f2d711ef0f15b411928a/llvmlite-0.30.0-cp37-cp37m-manylinux1_x86_64.whl (20.2MB)
Building wheels for collected packages: fastparquet, distributed, thrift
  Building wheel for fastparquet (setup.py): started
  Building wheel for fastparquet (setup.py): finished with status 'error'
  ERROR: Command errored out with exit status 1:
   command: /opt/conda/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-m6berq44/fastparquet/setup.py'"'"'; __file__='"'"'/tmp/pip-install-m6berq44/fastparquet/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-tcex8z1e --python-tag cp37
       cwd: /tmp/pip-install-m6berq44/fastparquet/
  Complete output (61 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-3.7
  creating build/lib.linux-x86_64-3.7/fastparquet
  copying fastparquet/__init__.py -> build/lib.linux-x86_64-3.7/fastparquet
  copying fastparquet/api.py -> build/lib.linux-x86_64-3.7/fastparquet
  copying fastparquet/compression.py -> build/lib.linux-x86_64-3.7/fastparquet
  copying fastparquet/converted_types.py -> build/lib.linux-x86_64-3.7/fastparquet
  copying fastparquet/core.py -> build/lib.linux-x86_64-3.7/fastparquet
  copying fastparquet/dataframe.py -> build/lib.linux-x86_64-3.7/fastparquet
  copying fastparquet/encoding.py -> build/lib.linux-x86_64-3.7/fastparquet
  copying fastparquet/schema.py -> build/lib.linux-x86_64-3.7/fastparquet
  copying fastparquet/thrift_structures.py -> build/lib.linux-x86_64-3.7/fastparquet
  copying fastparquet/util.py -> build/lib.linux-x86_64-3.7/fastparquet
  copying fastparquet/writer.py -> build/lib.linux-x86_64-3.7/fastparquet
  running egg_info
  writing fastparquet.egg-info/PKG-INFO
  writing dependency_links to fastparquet.egg-info/dependency_links.txt
  writing requirements to fastparquet.egg-info/requires.txt
  writing top-level names to fastparquet.egg-info/top_level.txt
  reading manifest file 'fastparquet.egg-info/SOURCES.txt'
  reading manifest template 'MANIFEST.in'
  no previously-included directories found matching 'docs/_build'
  writing manifest file 'fastparquet.egg-info/SOURCES.txt'
  copying fastparquet/parquet.thrift -> build/lib.linux-x86_64-3.7/fastparquet
  copying fastparquet/speedups.c -> build/lib.linux-x86_64-3.7/fastparquet
  copying fastparquet/speedups.pyx -> build/lib.linux-x86_64-3.7/fastparquet
  creating build/lib.linux-x86_64-3.7/fastparquet/benchmarks
  copying fastparquet/benchmarks/columns.py -> build/lib.linux-x86_64-3.7/fastparquet/benchmarks
  creating build/lib.linux-x86_64-3.7/fastparquet/parquet_thrift
  copying fastparquet/parquet_thrift/__init__.py -> build/lib.linux-x86_64-3.7/fastparquet/parquet_thrift
  creating build/lib.linux-x86_64-3.7/fastparquet/parquet_thrift/parquet
  copying fastparquet/parquet_thrift/parquet/__init__.py -> build/lib.linux-x86_64-3.7/fastparquet/parquet_thrift/parquet
  copying fastparquet/parquet_thrift/parquet/constants.py -> build/lib.linux-x86_64-3.7/fastparquet/parquet_thrift/parquet
  copying fastparquet/parquet_thrift/parquet/ttypes.py -> build/lib.linux-x86_64-3.7/fastparquet/parquet_thrift/parquet
  creating build/lib.linux-x86_64-3.7/fastparquet/test
  copying fastparquet/test/__init__.py -> build/lib.linux-x86_64-3.7/fastparquet/test
  copying fastparquet/test/test_api.py -> build/lib.linux-x86_64-3.7/fastparquet/test
  copying fastparquet/test/test_aroundtrips.py -> build/lib.linux-x86_64-3.7/fastparquet/test
  copying fastparquet/test/test_compression.py -> build/lib.linux-x86_64-3.7/fastparquet/test
  copying fastparquet/test/test_converted_types.py -> build/lib.linux-x86_64-3.7/fastparquet/test
  copying fastparquet/test/test_dataframe.py -> build/lib.linux-x86_64-3.7/fastparquet/test
  copying fastparquet/test/test_encoding.py -> build/lib.linux-x86_64-3.7/fastparquet/test
  copying fastparquet/test/test_output.py -> build/lib.linux-x86_64-3.7/fastparquet/test
  copying fastparquet/test/test_partition_filters_specialstrings.py -> build/lib.linux-x86_64-3.7/fastparquet/test
  copying fastparquet/test/test_read.py -> build/lib.linux-x86_64-3.7/fastparquet/test
  copying fastparquet/test/test_schema.py -> build/lib.linux-x86_64-3.7/fastparquet/test
  copying fastparquet/test/test_speedups.py -> build/lib.linux-x86_64-3.7/fastparquet/test
  copying fastparquet/test/test_thrift_structures.py -> build/lib.linux-x86_64-3.7/fastparquet/test
  copying fastparquet/test/test_util.py -> build/lib.linux-x86_64-3.7/fastparquet/test
  copying fastparquet/test/test_with_n.py -> build/lib.linux-x86_64-3.7/fastparquet/test
  copying fastparquet/test/util.py -> build/lib.linux-x86_64-3.7/fastparquet/test
  running build_ext
  building 'fastparquet.speedups' extension
  creating build/temp.linux-x86_64-3.7
  creating build/temp.linux-x86_64-3.7/fastparquet
  gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/include/python3.7m -I/opt/conda/lib/python3.7/site-packages/numpy/core/include -c fastparquet/speedups.c -o build/temp.linux-x86_64-3.7/fastparquet/speedups.o
  unable to execute 'gcc': No such file or directory
  error: command 'gcc' failed with exit status 1
  ----------------------------------------
  ERROR: Failed building wheel for fastparquet
  Running setup.py clean for fastparquet
  Building wheel for distributed (setup.py): started
  Building wheel for distributed (setup.py): finished with status 'done'
  Created wheel for distributed: filename=distributed-2.8.0+8.g5b33d54c-cp37-none-any.whl size=568764 sha256=9712974396e1221fa5dd195616e85031da70894222c2c7ff574bcfb318b5f80c
  Stored in directory: /tmp/pip-ephem-wheel-cache-v5jnd4bs/wheels/aa/21/a7/d9548d684f8e074360b7ad1bd8633843dba9658288b68b3dd5
  Building wheel for thrift (setup.py): started
  Building wheel for thrift (setup.py): finished with status 'done'
  Created wheel for thrift: filename=thrift-0.13.0-cp37-none-any.whl size=154884 sha256=c32af6aa5c4cfced68fadc2997e173f14ed0595a4bb9bb407eb7ef62794fafd8
  Stored in directory: /root/.cache/pip/wheels/02/a2/46/689ccfcf40155c23edc7cdbd9de488611c8fdf49ff34b1706e
Successfully built distributed thrift
Failed to build fastparquet
Installing collected packages: llvmlite, numba, thrift, fastparquet, distributed
  Running setup.py install for fastparquet: started
    Running setup.py install for fastparquet: finished with status 'error'
    ERROR: Command errored out with exit status 1:
     command: /opt/conda/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-m6berq44/fastparquet/setup.py'"'"'; __file__='"'"'/tmp/pip-install-m6berq44/fastparquet/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-qrgpngpc/install-record.txt --single-version-externally-managed --compile
         cwd: /tmp/pip-install-m6berq44/fastparquet/
    Complete output (19 lines):
    running install
    running build
    running build_py
    running egg_info
    writing fastparquet.egg-info/PKG-INFO
    writing dependency_links to fastparquet.egg-info/dependency_links.txt
    writing requirements to fastparquet.egg-info/requires.txt
    writing top-level names to fastparquet.egg-info/top_level.txt
    reading manifest file 'fastparquet.egg-info/SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    no previously-included directories found matching 'docs/_build'
    writing manifest file 'fastparquet.egg-info/SOURCES.txt'
    running build_ext
    building 'fastparquet.speedups' extension
    creating build/temp.linux-x86_64-3.7
    creating build/temp.linux-x86_64-3.7/fastparquet
    gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/include/python3.7m -I/opt/conda/lib/python3.7/site-packages/numpy/core/include -c fastparquet/speedups.c -o build/temp.linux-x86_64-3.7/fastparquet/speedups.o
    unable to execute 'gcc': No such file or directory
    error: command 'gcc' failed with exit status 1
    ----------------------------------------
ERROR: Command errored out with exit status 1: /opt/conda/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-m6berq44/fastparquet/setup.py'"'"'; __file__='"'"'/tmp/pip-install-m6berq44/fastparquet/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-qrgpngpc/install-record.txt --single-version-externally-managed --compile Check the logs for full command output.
+ exec dask-worker --nthreads 2 --no-bokeh --memory-limit 6GB --death-timeout 60
/opt/conda/lib/python3.7/site-packages/distributed/cli/dask_worker.py:252: UserWarning: The --bokeh/--no-bokeh flag has been renamed to --dashboard/--no-dashboard.
  "The --bokeh/--no-bokeh flag has been renamed to --dashboard/--no-dashboard. "
distributed.nanny - INFO -         Start Nanny at: 'tcp://10.44.0.1:46597'
distributed.worker - INFO -       Start worker at:      tcp://10.44.0.1:36897
distributed.worker - INFO -          Listening to:      tcp://10.44.0.1:36897
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.0.76:42143
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          2
distributed.worker - INFO -                Memory:                    6.00 GB
distributed.worker - INFO -       Local Directory:           /worker-vlqvrk15
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.0.76:42143
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.0.76:42143
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.0.76:42143
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.0.76:42143
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.0.76:42143
distributed.nanny - INFO - Closing Nanny at 'tcp://10.44.0.1:46597'
distributed.worker - INFO - Stopping worker at tcp://10.44.0.1:36897
distributed.worker - INFO - Closed worker has not yet started: None
distributed.dask_worker - INFO - Timed out starting worker
distributed.dask_worker - INFO - End worker
(base) [root@k8s-master example]#
jacobtomlinson commented 4 years ago

Looks like you also need to specify gcc in the EXTRA_APT_PACKAGES env var.

MnvS commented 4 years ago

After adding below:

env:
      - name: EXTRA_APT_PACKAGES
        value: gcc

gcc related issue is resolved, but workerpod is still giving error status.

(base) [root@k8s-master example]# kubectl get service,pods -o wide --all-namespaces
NAMESPACE     NAME                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE     SELECTOR
default       service/kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP                  6h32m   <none>
kube-system   service/kube-dns     ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   6h32m   k8s-app=kube-dns

NAMESPACE     NAME                                     READY   STATUS    RESTARTS   AGE     IP             NODE           NOMINATED NODE   READINESS GATES
default       pod/workerpod                            0/1     Error     0          16m     10.44.0.1      worker-node1   <none>           <none>
kube-system   pod/coredns-5644d7b6d9-82kqk             1/1     Running   0          6h32m   10.32.0.3      k8s-master     <none>           <none>
kube-system   pod/coredns-5644d7b6d9-xh4pg             1/1     Running   0          6h32m   10.32.0.4      k8s-master     <none>           <none>
kube-system   pod/etcd-k8s-master                      1/1     Running   0          6h31m   172.16.0.76    k8s-master     <none>           <none>
kube-system   pod/kube-apiserver-k8s-master            1/1     Running   0          6h31m   172.16.0.76    k8s-master     <none>           <none>
kube-system   pod/kube-controller-manager-k8s-master   1/1     Running   0          6h31m   172.16.0.76    k8s-master     <none>           <none>
kube-system   pod/kube-proxy-25nnf                     1/1     Running   0          6h28m   172.16.0.114   worker-node2   <none>           <none>
kube-system   pod/kube-proxy-cr84h                     1/1     Running   0          6h28m   172.16.0.31    worker-node1   <none>           <none>
kube-system   pod/kube-proxy-lvs9g                     1/1     Running   0          6h32m   172.16.0.76    k8s-master     <none>           <none>
kube-system   pod/kube-scheduler-k8s-master            1/1     Running   0          6h30m   172.16.0.76    k8s-master     <none>           <none>
kube-system   pod/weave-net-d5jsg                      2/2     Running   1          6h28m   172.16.0.31    worker-node1   <none>           <none>
kube-system   pod/weave-net-nnfzh                      2/2     Running   0          6h29m   172.16.0.76    k8s-master     <none>           <none>
kube-system   pod/weave-net-zcv8v                      2/2     Running   1          6h28m   172.16.0.114   worker-node2   <none>           <none>
(base) [root@k8s-master example]#
(base) [root@k8s-master example]# kubectl logs workerpod
+ '[' gcc ']'
+ echo 'EXTRA_APT_PACKAGES environment variable found.  Installing.'
+ apt update -y
EXTRA_APT_PACKAGES environment variable found.  Installing.

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

Get:1 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB]
Get:2 http://deb.debian.org/debian buster InRelease [122 kB]
Get:3 http://security.debian.org/debian-security buster/updates/main amd64 Packages [158 kB]
Get:4 http://deb.debian.org/debian buster-updates InRelease [49.3 kB]
Get:5 http://deb.debian.org/debian buster/main amd64 Packages [7908 kB]
Fetched 8302 kB in 17s (497 kB/s)
Reading package lists...
Building dependency tree...
Reading state information...
21 packages can be upgraded. Run 'apt list --upgradable' to see them.
+ apt install -y gcc

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  binutils binutils-common binutils-x86-64-linux-gnu cpp cpp-8 gcc-8 libasan5
  libatomic1 libbinutils libc-dev-bin libc6-dev libcc1-0 libgcc-8-dev libgomp1
  libisl19 libitm1 liblsan0 libmpc3 libmpfr6 libmpx2 libquadmath0 libtsan0
  libubsan1 linux-libc-dev manpages manpages-dev
Suggested packages:
  binutils-doc cpp-doc gcc-8-locales gcc-multilib make autoconf automake
  libtool flex bison gdb gcc-doc gcc-8-multilib gcc-8-doc libgcc1-dbg
  libgomp1-dbg libitm1-dbg libatomic1-dbg libasan5-dbg liblsan0-dbg
  libtsan0-dbg libubsan1-dbg libmpx2-dbg libquadmath0-dbg glibc-doc
  man-browser
The following NEW packages will be installed:
  binutils binutils-common binutils-x86-64-linux-gnu cpp cpp-8 gcc gcc-8
  libasan5 libatomic1 libbinutils libc-dev-bin libc6-dev libcc1-0 libgcc-8-dev
  libgomp1 libisl19 libitm1 liblsan0 libmpc3 libmpfr6 libmpx2 libquadmath0
  libtsan0 libubsan1 linux-libc-dev manpages manpages-dev
0 upgraded, 27 newly installed, 0 to remove and 21 not upgraded.
Need to get 35.5 MB of archives.
After this operation, 135 MB of additional disk space will be used.
Get:1 http://deb.debian.org/debian buster/main amd64 manpages all 4.16-2 [1295 kB]
Get:2 http://security.debian.org/debian-security buster/updates/main amd64 linux-libc-dev amd64 4.19.67-2+deb10u2 [1234 kB]
Get:3 http://deb.debian.org/debian buster/main amd64 binutils-common amd64 2.31.1-16 [2073 kB]
Get:4 http://deb.debian.org/debian buster/main amd64 libbinutils amd64 2.31.1-16 [478 kB]
Get:5 http://deb.debian.org/debian buster/main amd64 binutils-x86-64-linux-gnu amd64 2.31.1-16 [1823 kB]
Get:6 http://deb.debian.org/debian buster/main amd64 binutils amd64 2.31.1-16 [56.8 kB]
Get:7 http://deb.debian.org/debian buster/main amd64 libisl19 amd64 0.20-2 [587 kB]
Get:8 http://deb.debian.org/debian buster/main amd64 libmpfr6 amd64 4.0.2-1 [775 kB]
Get:9 http://deb.debian.org/debian buster/main amd64 libmpc3 amd64 1.1.0-1 [41.3 kB]
Get:10 http://deb.debian.org/debian buster/main amd64 cpp-8 amd64 8.3.0-6 [8914 kB]
Get:11 http://deb.debian.org/debian buster/main amd64 cpp amd64 4:8.3.0-1 [19.4 kB]
Get:12 http://deb.debian.org/debian buster/main amd64 libcc1-0 amd64 8.3.0-6 [46.6 kB]
Get:13 http://deb.debian.org/debian buster/main amd64 libgomp1 amd64 8.3.0-6 [75.8 kB]
Get:14 http://deb.debian.org/debian buster/main amd64 libitm1 amd64 8.3.0-6 [27.7 kB]
Get:15 http://deb.debian.org/debian buster/main amd64 libatomic1 amd64 8.3.0-6 [9032 B]
Get:16 http://deb.debian.org/debian buster/main amd64 libasan5 amd64 8.3.0-6 [362 kB]
Get:17 http://deb.debian.org/debian buster/main amd64 liblsan0 amd64 8.3.0-6 [131 kB]
Get:18 http://deb.debian.org/debian buster/main amd64 libtsan0 amd64 8.3.0-6 [283 kB]
Get:19 http://deb.debian.org/debian buster/main amd64 libubsan1 amd64 8.3.0-6 [120 kB]
Get:20 http://deb.debian.org/debian buster/main amd64 libmpx2 amd64 8.3.0-6 [11.4 kB]
Get:21 http://deb.debian.org/debian buster/main amd64 libquadmath0 amd64 8.3.0-6 [133 kB]
Get:22 http://deb.debian.org/debian buster/main amd64 libgcc-8-dev amd64 8.3.0-6 [2298 kB]
Get:23 http://deb.debian.org/debian buster/main amd64 gcc-8 amd64 8.3.0-6 [9452 kB]
Get:24 http://deb.debian.org/debian buster/main amd64 gcc amd64 4:8.3.0-1 [5196 B]
Get:25 http://deb.debian.org/debian buster/main amd64 libc-dev-bin amd64 2.28-10 [275 kB]
Get:26 http://deb.debian.org/debian buster/main amd64 libc6-dev amd64 2.28-10 [2691 kB]
Get:27 http://deb.debian.org/debian buster/main amd64 manpages-dev all 4.16-2 [2232 kB]
debconf: delaying package configuration, since apt-utils is not installed
Fetched 35.5 MB in 1min 1s (583 kB/s)
Selecting previously unselected package manpages.
(Reading database ... 12557 files and directories currently installed.)
Preparing to unpack .../00-manpages_4.16-2_all.deb ...
Unpacking manpages (4.16-2) ...
Selecting previously unselected package binutils-common:amd64.
Preparing to unpack .../01-binutils-common_2.31.1-16_amd64.deb ...
Unpacking binutils-common:amd64 (2.31.1-16) ...
Selecting previously unselected package libbinutils:amd64.
Preparing to unpack .../02-libbinutils_2.31.1-16_amd64.deb ...
Unpacking libbinutils:amd64 (2.31.1-16) ...
Selecting previously unselected package binutils-x86-64-linux-gnu.
Preparing to unpack .../03-binutils-x86-64-linux-gnu_2.31.1-16_amd64.deb ...
Unpacking binutils-x86-64-linux-gnu (2.31.1-16) ...
Selecting previously unselected package binutils.
Preparing to unpack .../04-binutils_2.31.1-16_amd64.deb ...
Unpacking binutils (2.31.1-16) ...
Selecting previously unselected package libisl19:amd64.
Preparing to unpack .../05-libisl19_0.20-2_amd64.deb ...
Unpacking libisl19:amd64 (0.20-2) ...
Selecting previously unselected package libmpfr6:amd64.
Preparing to unpack .../06-libmpfr6_4.0.2-1_amd64.deb ...
Unpacking libmpfr6:amd64 (4.0.2-1) ...
Selecting previously unselected package libmpc3:amd64.
Preparing to unpack .../07-libmpc3_1.1.0-1_amd64.deb ...
Unpacking libmpc3:amd64 (1.1.0-1) ...
Selecting previously unselected package cpp-8.
Preparing to unpack .../08-cpp-8_8.3.0-6_amd64.deb ...
Unpacking cpp-8 (8.3.0-6) ...
Selecting previously unselected package cpp.
Preparing to unpack .../09-cpp_4%3a8.3.0-1_amd64.deb ...
Unpacking cpp (4:8.3.0-1) ...
Selecting previously unselected package libcc1-0:amd64.
Preparing to unpack .../10-libcc1-0_8.3.0-6_amd64.deb ...
Unpacking libcc1-0:amd64 (8.3.0-6) ...
Selecting previously unselected package libgomp1:amd64.
Preparing to unpack .../11-libgomp1_8.3.0-6_amd64.deb ...
Unpacking libgomp1:amd64 (8.3.0-6) ...
Selecting previously unselected package libitm1:amd64.
Preparing to unpack .../12-libitm1_8.3.0-6_amd64.deb ...
Unpacking libitm1:amd64 (8.3.0-6) ...
Selecting previously unselected package libatomic1:amd64.
Preparing to unpack .../13-libatomic1_8.3.0-6_amd64.deb ...
Unpacking libatomic1:amd64 (8.3.0-6) ...
Selecting previously unselected package libasan5:amd64.
Preparing to unpack .../14-libasan5_8.3.0-6_amd64.deb ...
Unpacking libasan5:amd64 (8.3.0-6) ...
Selecting previously unselected package liblsan0:amd64.
Preparing to unpack .../15-liblsan0_8.3.0-6_amd64.deb ...
Unpacking liblsan0:amd64 (8.3.0-6) ...
Selecting previously unselected package libtsan0:amd64.
Preparing to unpack .../16-libtsan0_8.3.0-6_amd64.deb ...
Unpacking libtsan0:amd64 (8.3.0-6) ...
Selecting previously unselected package libubsan1:amd64.
Preparing to unpack .../17-libubsan1_8.3.0-6_amd64.deb ...
Unpacking libubsan1:amd64 (8.3.0-6) ...
Selecting previously unselected package libmpx2:amd64.
Preparing to unpack .../18-libmpx2_8.3.0-6_amd64.deb ...
Unpacking libmpx2:amd64 (8.3.0-6) ...
Selecting previously unselected package libquadmath0:amd64.
Preparing to unpack .../19-libquadmath0_8.3.0-6_amd64.deb ...
Unpacking libquadmath0:amd64 (8.3.0-6) ...
Selecting previously unselected package libgcc-8-dev:amd64.
Preparing to unpack .../20-libgcc-8-dev_8.3.0-6_amd64.deb ...
Unpacking libgcc-8-dev:amd64 (8.3.0-6) ...
Selecting previously unselected package gcc-8.
Preparing to unpack .../21-gcc-8_8.3.0-6_amd64.deb ...
Unpacking gcc-8 (8.3.0-6) ...
Selecting previously unselected package gcc.
Preparing to unpack .../22-gcc_4%3a8.3.0-1_amd64.deb ...
Unpacking gcc (4:8.3.0-1) ...
Selecting previously unselected package libc-dev-bin.
Preparing to unpack .../23-libc-dev-bin_2.28-10_amd64.deb ...
Unpacking libc-dev-bin (2.28-10) ...
Selecting previously unselected package linux-libc-dev:amd64.
Preparing to unpack .../24-linux-libc-dev_4.19.67-2+deb10u2_amd64.deb ...
Unpacking linux-libc-dev:amd64 (4.19.67-2+deb10u2) ...
Selecting previously unselected package libc6-dev:amd64.
Preparing to unpack .../25-libc6-dev_2.28-10_amd64.deb ...
Unpacking libc6-dev:amd64 (2.28-10) ...
Selecting previously unselected package manpages-dev.
Preparing to unpack .../26-manpages-dev_4.16-2_all.deb ...
Unpacking manpages-dev (4.16-2) ...
Setting up manpages (4.16-2) ...
Setting up binutils-common:amd64 (2.31.1-16) ...
Setting up linux-libc-dev:amd64 (4.19.67-2+deb10u2) ...
Setting up libgomp1:amd64 (8.3.0-6) ...
Setting up libasan5:amd64 (8.3.0-6) ...
Setting up libmpfr6:amd64 (4.0.2-1) ...
Setting up libquadmath0:amd64 (8.3.0-6) ...
Setting up libmpc3:amd64 (1.1.0-1) ...
Setting up libatomic1:amd64 (8.3.0-6) ...
Setting up libmpx2:amd64 (8.3.0-6) ...
Setting up libubsan1:amd64 (8.3.0-6) ...
Setting up libisl19:amd64 (0.20-2) ...
Setting up libbinutils:amd64 (2.31.1-16) ...
Setting up cpp-8 (8.3.0-6) ...
Setting up libc-dev-bin (2.28-10) ...
Setting up libcc1-0:amd64 (8.3.0-6) ...
Setting up liblsan0:amd64 (8.3.0-6) ...
Setting up libitm1:amd64 (8.3.0-6) ...
Setting up binutils-x86-64-linux-gnu (2.31.1-16) ...
Setting up libtsan0:amd64 (8.3.0-6) ...
Setting up manpages-dev (4.16-2) ...
Setting up binutils (2.31.1-16) ...
Setting up libgcc-8-dev:amd64 (8.3.0-6) ...
Setting up cpp (4:8.3.0-1) ...
Setting up libc6-dev:amd64 (2.28-10) ...
Setting up gcc-8 (8.3.0-6) ...
Setting up gcc (4:8.3.0-1) ...
Processing triggers for libc-bin (2.28-10) ...
+ '[' -e /opt/app/environment.yml ']'
+ echo 'no environment.yml'
+ '[' '' ']'
+ '[' 'fastparquet git+https://github.com/dask/distributed' ']'
+ echo 'EXTRA_PIP_PACKAGES environment variable found.  Installing.'
no environment.yml
EXTRA_PIP_PACKAGES environment variable found.  Installing.
+ /opt/conda/bin/pip install fastparquet git+https://github.com/dask/distributed
Collecting git+https://github.com/dask/distributed
  Cloning https://github.com/dask/distributed to /tmp/pip-req-build-yszrzcnf
  Running command git clone -q https://github.com/dask/distributed /tmp/pip-req-build-yszrzcnf
Collecting fastparquet
  Downloading https://files.pythonhosted.org/packages/58/49/dccb790fa17ab3fbf84a6b848050083c7a1899e9586000e34e3e4fbf5538/fastparquet-0.3.2.tar.gz (151kB)
Requirement already satisfied: click>=6.6 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.1+3.ga285267d) (7.0)
Requirement already satisfied: cloudpickle>=0.2.2 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.1+3.ga285267d) (1.2.2)
Requirement already satisfied: dask>=2.7.0 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.1+3.ga285267d) (2.8.0)
Requirement already satisfied: msgpack in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.1+3.ga285267d) (0.6.2)
Requirement already satisfied: psutil>=5.0 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.1+3.ga285267d) (5.6.5)
Requirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.1+3.ga285267d) (2.1.0)
Requirement already satisfied: tblib in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.1+3.ga285267d) (1.4.0)
Requirement already satisfied: toolz>=0.7.4 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.1+3.ga285267d) (0.10.0)
Requirement already satisfied: tornado>=5 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.1+3.ga285267d) (6.0.3)
Requirement already satisfied: zict>=0.1.3 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.1+3.ga285267d) (1.0.0)
Requirement already satisfied: pyyaml in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.1+3.ga285267d) (5.1.2)
Requirement already satisfied: pandas>=0.19 in /opt/conda/lib/python3.7/site-packages (from fastparquet) (0.25.2)
Collecting numba>=0.28 (from fastparquet)
  Downloading https://files.pythonhosted.org/packages/57/66/7ebc88e87b4ddf9b1c204e24d467cb0d13a7a890ecb4f83d20949f768929/numba-0.46.0-cp37-cp37m-manylinux1_x86_64.whl (3.6MB)
Requirement already satisfied: numpy>=1.11 in /opt/conda/lib/python3.7/site-packages (from fastparquet) (1.17.3)
Collecting thrift>=0.11.0 (from fastparquet)
  Downloading https://files.pythonhosted.org/packages/97/1e/3284d19d7be99305eda145b8aa46b0c33244e4a496ec66440dac19f8274d/thrift-0.13.0.tar.gz (59kB)
Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from fastparquet) (1.12.0)
Requirement already satisfied: heapdict in /opt/conda/lib/python3.7/site-packages (from zict>=0.1.3->distributed==2.8.1+3.ga285267d) (1.0.1)
Requirement already satisfied: python-dateutil>=2.6.1 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.19->fastparquet) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.19->fastparquet) (2019.3)
Collecting llvmlite>=0.30.0dev0 (from numba>=0.28->fastparquet)
  Downloading https://files.pythonhosted.org/packages/1f/3e/642ffb29ed35ca5e93f171ba327452bdee81ec76f2d711ef0f15b411928a/llvmlite-0.30.0-cp37-cp37m-manylinux1_x86_64.whl (20.2MB)
Building wheels for collected packages: fastparquet, distributed, thrift
  Building wheel for fastparquet (setup.py): started
  Building wheel for fastparquet (setup.py): finished with status 'done'
  Created wheel for fastparquet: filename=fastparquet-0.3.2-cp37-cp37m-linux_x86_64.whl size=276808 sha256=cca6b01eacdc3d2180bb70e6ebf8a5f5b31f4b15771b04919bbd3353564f9c6a
  Stored in directory: /root/.cache/pip/wheels/b9/36/13/01416a760ddcab0eb8281ec9c9ffcbed945c9b831647c8b904
  Building wheel for distributed (setup.py): started
  Building wheel for distributed (setup.py): finished with status 'done'
  Created wheel for distributed: filename=distributed-2.8.1+3.ga285267d-cp37-none-any.whl size=569076 sha256=b359838b03314bbb4ef849e748d8b76bfa067da042cccb242347271db4c4c050
  Stored in directory: /tmp/pip-ephem-wheel-cache-ektmufm7/wheels/aa/21/a7/d9548d684f8e074360b7ad1bd8633843dba9658288b68b3dd5
  Building wheel for thrift (setup.py): started
  Building wheel for thrift (setup.py): finished with status 'done'
  Created wheel for thrift: filename=thrift-0.13.0-cp37-none-any.whl size=154884 sha256=e8a35252f5581d04a5b334cc37950ebd76c68de61f30a65b0738d392a373e27d
  Stored in directory: /root/.cache/pip/wheels/02/a2/46/689ccfcf40155c23edc7cdbd9de488611c8fdf49ff34b1706e
Successfully built fastparquet distributed thrift
Installing collected packages: llvmlite, numba, thrift, fastparquet, distributed
  Found existing installation: distributed 2.8.0
    Uninstalling distributed-2.8.0:
      Successfully uninstalled distributed-2.8.0
Successfully installed distributed-2.8.1+3.ga285267d fastparquet-0.3.2 llvmlite-0.30.0 numba-0.46.0 thrift-0.13.0
+ exec dask-worker --nthreads 2 --no-bokeh --memory-limit 6GB --death-timeout 60
/opt/conda/lib/python3.7/site-packages/distributed/cli/dask_worker.py:252: UserWarning: The --bokeh/--no-bokeh flag has been renamed to --dashboard/--no-dashboard.
  "The --bokeh/--no-bokeh flag has been renamed to --dashboard/--no-dashboard. "
distributed.nanny - INFO -         Start Nanny at: 'tcp://10.44.0.1:32869'
distributed.worker - INFO -       Start worker at:      tcp://10.44.0.1:41721
distributed.worker - INFO -          Listening to:      tcp://10.44.0.1:41721
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.0.76:33017
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          2
distributed.worker - INFO -                Memory:                    6.00 GB
distributed.worker - INFO -       Local Directory:           /worker-77vchaja
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.0.76:33017
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.0.76:33017
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.0.76:33017
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.0.76:33017
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.0.76:33017
distributed.nanny - INFO - Closing Nanny at 'tcp://10.44.0.1:32869'
distributed.worker - INFO - Stopping worker at tcp://10.44.0.1:41721
distributed.worker - INFO - Closed worker has not yet started: None
distributed.dask_worker - INFO - Timed out starting worker
distributed.dask_worker - INFO - End worker
jacobtomlinson commented 4 years ago

Looks like the worker is now behaving correctly but failing to connect to the scheduler. Could you share the scheduler logs?

MnvS commented 4 years ago

Scheduler logs below:

(base) [root@k8s-master example]#
(base) [root@k8s-master example]# kubectl -n kube-system logs kube-scheduler-k8s-master
I1126 15:34:16.048901       1 serving.go:319] Generated self-signed cert in-memory
W1126 15:34:18.709418       1 authentication.go:262] Unable to get configmap/extension-apiserver-authentication in kube-system.  Usually fixed by 'kubectl create rolebinding -n kube-system ROLEBINDING_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
W1126 15:34:18.709438       1 authentication.go:199] Error looking up in-cluster authentication configuration: configmaps "extension-apiserver-authentication" is forbidden: User "system:kube-scheduler" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
W1126 15:34:18.709447       1 authentication.go:200] Continuing without authentication configuration. This may treat all requests as anonymous.
W1126 15:34:18.709453       1 authentication.go:201] To require authentication configuration lookup to succeed, set --authentication-tolerate-lookup-failure=false
I1126 15:34:18.714711       1 server.go:148] Version: v1.16.3
I1126 15:34:18.714796       1 defaults.go:91] TaintNodesByCondition is enabled, PodToleratesNodeTaints predicate is mandatory
W1126 15:34:18.724908       1 authorization.go:47] Authorization is disabled
W1126 15:34:18.724921       1 authentication.go:79] Authentication is disabled
I1126 15:34:18.724930       1 deprecated_insecure_serving.go:51] Serving healthz insecurely on [::]:10251
I1126 15:34:18.725582       1 secure_serving.go:123] Serving securely on 127.0.0.1:10259
E1126 15:34:18.726754       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.CSINode: csinodes.storage.k8s.io is forbidden: User "system:kube-scheduler" cannot list resource "csinodes" in API group "storage.k8s.io" at the cluster scope
E1126 15:34:18.727678       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.StatefulSet: statefulsets.apps is forbidden: User "system:kube-scheduler" cannot list resource "statefulsets" in API group "apps" at the cluster scope
E1126 15:34:18.727685       1 reflector.go:123] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:250: Failed to list *v1.Pod: pods is forbidden: User "system:kube-scheduler" cannot list resource "pods" in API group "" at the cluster scope
E1126 15:34:18.727682       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.ReplicaSet: replicasets.apps is forbidden: User "system:kube-scheduler" cannot list resource "replicasets" in API group "apps" at the cluster scope
E1126 15:34:18.727695       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.PersistentVolume: persistentvolumes is forbidden: User "system:kube-scheduler" cannot list resource "persistentvolumes" in API group "" at the cluster scope
E1126 15:34:18.727743       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.Service: services is forbidden: User "system:kube-scheduler" cannot list resource "services" in API group "" at the cluster scope
E1126 15:34:18.727819       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.PersistentVolumeClaim: persistentvolumeclaims is forbidden: User "system:kube-scheduler" cannot list resource "persistentvolumeclaims" in API group "" at the cluster scope
E1126 15:34:18.727828       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.Node: nodes is forbidden: User "system:kube-scheduler" cannot list resource "nodes" in API group "" at the cluster scope
E1126 15:34:18.727875       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.StorageClass: storageclasses.storage.k8s.io is forbidden: User "system:kube-scheduler" cannot list resource "storageclasses" in API group "storage.k8s.io" at the cluster scope
E1126 15:34:18.727907       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:kube-scheduler" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E1126 15:34:18.728054       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.ReplicationController: replicationcontrollers is forbidden: User "system:kube-scheduler" cannot list resource "replicationcontrollers" in API group "" at the cluster scope
E1126 15:34:19.729111       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.CSINode: csinodes.storage.k8s.io is forbidden: User "system:kube-scheduler" cannot list resource "csinodes" in API group "storage.k8s.io" at the cluster scope
E1126 15:34:19.729119       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.StatefulSet: statefulsets.apps is forbidden: User "system:kube-scheduler" cannot list resource "statefulsets" in API group "apps" at the cluster scope
E1126 15:34:19.729697       1 reflector.go:123] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:250: Failed to list *v1.Pod: pods is forbidden: User "system:kube-scheduler" cannot list resource "pods" in API group "" at the cluster scope
E1126 15:34:19.730823       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.ReplicaSet: replicasets.apps is forbidden: User "system:kube-scheduler" cannot list resource "replicasets" in API group "apps" at the cluster scope
E1126 15:34:19.731811       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.PersistentVolume: persistentvolumes is forbidden: User "system:kube-scheduler" cannot list resource "persistentvolumes" in API group "" at the cluster scope
E1126 15:34:19.732952       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.Service: services is forbidden: User "system:kube-scheduler" cannot list resource "services" in API group "" at the cluster scope
E1126 15:34:19.733921       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.PersistentVolumeClaim: persistentvolumeclaims is forbidden: User "system:kube-scheduler" cannot list resource "persistentvolumeclaims" in API group "" at the cluster scope
E1126 15:34:19.735081       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.Node: nodes is forbidden: User "system:kube-scheduler" cannot list resource "nodes" in API group "" at the cluster scope
E1126 15:34:19.736108       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.StorageClass: storageclasses.storage.k8s.io is forbidden: User "system:kube-scheduler" cannot list resource "storageclasses" in API group "storage.k8s.io" at the cluster scope
E1126 15:34:19.737238       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:kube-scheduler" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E1126 15:34:19.738284       1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.ReplicationController: replicationcontrollers is forbidden: User "system:kube-scheduler" cannot list resource "replicationcontrollers" in API group "" at the cluster scope
I1126 15:34:20.825768       1 leaderelection.go:241] attempting to acquire leader lease  kube-system/kube-scheduler...
I1126 15:34:20.832408       1 leaderelection.go:251] successfully acquired lease kube-system/kube-scheduler
E1126 15:34:28.839414       1 factory.go:585] pod is already present in the activeQ
(base) [root@k8s-master example]#
jacobtomlinson commented 4 years ago

That doesn't look like the Dask scheduler logs.

As these issues seem to be related to your setup rather than dask-kubernetes itself I would recommend that you take a look at the Dask scheduler logs to see if you are able to identify the issue yourself.

MnvS commented 4 years ago

Thanks Jacob for reply. For setup, I followed https://kubernetes.dask.org/en/latest/ - which states requirements from pip install dask-kubernetes for native installation (without helm):

Steps I followed:

  1. Installed Kubernetes on 3 nodes(1 Master and 2 workers).
  2. installed miniconda3
  3. pip install dask-kubernetes
  4. dask_example.py with code to run dask array (same as example given on link)
  5. Worker-spec.yml file with pod configuration (same as example given on link)

Output of code shows dask scheduler logs:

(base) [root@k8s-master example]# cat nohup.out
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:   tcp://172.16.0.76:40641
distributed.scheduler - INFO - Receive client connection: Client-932e205e-1062-11ea-a09d-12bd5ffa93ff
distributed.core - INFO - Starting established connection
(base) [root@k8s-master example]#

Workerpod logs:

(base) [root@k8s-master example]# kubectl logs pod/workerpod
+ '[' gcc ']'
...
Successfully installed distributed-2.8.1+7.g856bba7c fastparquet-0.3.2 llvmlite-0.30.0 numba-0.46.0 thrift-0.13.0
+ exec dask-worker --nthreads 2 --no-bokeh --memory-limit 6GB --death-timeout 60
/opt/conda/lib/python3.7/site-packages/distributed/cli/dask_worker.py:252: UserWarning: The --bokeh/--no-bokeh flag has been renamed to --dashboard/--no-dashboard.
  "The --bokeh/--no-bokeh flag has been renamed to --dashboard/--no-dashboard. "
distributed.nanny - INFO -         Start Nanny at: 'tcp://10.32.0.2:34363'
distributed.worker - INFO -       Start worker at:      tcp://10.32.0.2:38961
distributed.worker - INFO -          Listening to:      tcp://10.32.0.2:38961
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.0.76:34895
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          2
distributed.worker - INFO -                Memory:                    6.00 GB
distributed.worker - INFO -       Local Directory:           /worker-2xaci4as
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.0.76:34895
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.0.76:34895
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.0.76:34895
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.0.76:34895
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.0.76:34895
distributed.nanny - INFO - Closing Nanny at 'tcp://10.32.0.2:34363'
distributed.worker - INFO - Stopping worker at tcp://10.32.0.2:38961
distributed.worker - INFO - Closed worker has not yet started: None
distributed.dask_worker - INFO - Timed out starting worker
distributed.dask_worker - INFO - End worker
(base) [root@k8s-master example]# cat nohup.out
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:   tcp://172.16.0.76:34895
distributed.scheduler - INFO - Receive client connection: Client-882faa9e-108f-11ea-a662-12bd5ffa93ff
distributed.core - INFO - Starting established connection

I do not see dask scheduler pod created on my system.

(base) [root@k8s-master example]# kubectl get nodes,service,pods --all-namespaces
NAME                STATUS   ROLES    AGE    VERSION
node/k8s-master     Ready    master   146m   v1.16.3
node/worker-node1   Ready    <none>   145m   v1.16.2
node/worker-node2   Ready    <none>   145m   v1.16.2

NAMESPACE     NAME                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
default       service/kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP                  146m
kube-system   service/kube-dns     ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   146m

NAMESPACE     NAME                                     READY   STATUS    RESTARTS   AGE
default       pod/workerpod                            0/1     Error     0          143m
kube-system   pod/coredns-5644d7b6d9-ht9dq             1/1     Running   0          146m
kube-system   pod/coredns-5644d7b6d9-vt6c9             1/1     Running   0          146m
kube-system   pod/etcd-k8s-master                      1/1     Running   0          145m
kube-system   pod/kube-apiserver-k8s-master            1/1     Running   0          145m
kube-system   pod/kube-controller-manager-k8s-master   1/1     Running   0          145m
kube-system   pod/kube-proxy-htvlr                     1/1     Running   0          145m
kube-system   pod/kube-proxy-mswm2                     1/1     Running   0          146m
kube-system   pod/kube-proxy-vls4w                     1/1     Running   0          145m
kube-system   pod/kube-scheduler-k8s-master            1/1     Running   0          145m
kube-system   pod/weave-net-kgrqz                      2/2     Running   0          144m
kube-system   pod/weave-net-lfndv                      2/2     Running   0          144m
kube-system   pod/weave-net-vgpxs                      2/2     Running   0          144m
(base) [root@k8s-master example]#

According to my understanding, dask-kubernetes is starting a distributed scheduler on master-node and not on kubernetes cluster as scheduler-pod so workerpod in unable to connect to dask-scheduler pod. Please correct me if that's not the case..

jacobtomlinson commented 4 years ago

By default dask-kubernetes starts a scheduler within your Python session. So the workers must be able to send traffic to wherever you are running Python.

You can specify deploy_mode='remote' to have dask-kubernetes launch the scheduler within a pod. But your local Python session will still need to be able to connect to the service that it creates (a LoadBalancer by default).

MnvS commented 4 years ago

Thanks for reply Jacob. I tried with deploy_mode='remote', I can see scheduler created as dask-root service and worker pod is showing status as completed but it is not showing any result and output is showing errors as below:

(base) [root@k8s-master example]# kubectl get nodes,service,pods --all-namespaces -o wide
NAME                STATUS   ROLES    AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
node/k8s-master     Ready    master   18m   v1.16.3   172.16.0.76    <none>        Amazon Linux 2   4.14.154-128.181.amzn2.x86_64   docker://18.9.9
node/worker-node1   Ready    worker   16m   v1.16.2   172.16.0.31    <none>        Amazon Linux 2   4.14.146-120.181.amzn2.x86_64   docker://18.9.9
node/worker-node2   Ready    worker   16m   v1.16.2   172.16.0.114   <none>        Amazon Linux 2   4.14.146-120.181.amzn2.x86_64   docker://18.9.9

NAMESPACE     NAME                           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                  AGE   SELECTOR
default       service/dask-root-c69fb20b-d   ClusterIP   10.104.149.32   <none>        8786/TCP,8787/TCP        11m   dask.org/cluster-name=dask-root-c69fb20b-d,dask.org/component=scheduler
default       service/kubernetes             ClusterIP   10.96.0.1       <none>        443/TCP                  18m   <none>
kube-system   service/kube-dns               ClusterIP   10.96.0.10      <none>        53/UDP,53/TCP,9153/TCP   18m   k8s-app=kube-dns

NAMESPACE     NAME                                     READY   STATUS      RESTARTS   AGE   IP             NODE           NOMINATED NODE   READINESS GATES
default       pod/workerpod                            0/1     Completed   0          14m   10.44.0.1      worker-node1   <none>           <none>
kube-system   pod/coredns-5644d7b6d9-l5xgh             1/1     Running     0          18m   10.32.0.2      k8s-master     <none>           <none>
kube-system   pod/coredns-5644d7b6d9-wr5cz             1/1     Running     0          18m   10.32.0.3      k8s-master     <none>           <none>
kube-system   pod/etcd-k8s-master                      1/1     Running     0          17m   172.16.0.76    k8s-master     <none>           <none>
kube-system   pod/kube-apiserver-k8s-master            1/1     Running     0          17m   172.16.0.76    k8s-master     <none>           <none>
kube-system   pod/kube-controller-manager-k8s-master   1/1     Running     0          17m   172.16.0.76    k8s-master     <none>           <none>
kube-system   pod/kube-proxy-p5khx                     1/1     Running     0          16m   172.16.0.114   worker-node2   <none>           <none>
kube-system   pod/kube-proxy-ss464                     1/1     Running     0          18m   172.16.0.76    k8s-master     <none>           <none>
kube-system   pod/kube-proxy-w8st5                     1/1     Running     0          16m   172.16.0.31    worker-node1   <none>           <none>
kube-system   pod/kube-scheduler-k8s-master            1/1     Running     0          17m   172.16.0.76    k8s-master     <none>           <none>
kube-system   pod/weave-net-g4xsq                      2/2     Running     0          17m   172.16.0.76    k8s-master     <none>           <none>
kube-system   pod/weave-net-hd54z                      2/2     Running     1          16m   172.16.0.114   worker-node2   <none>           <none>
kube-system   pod/weave-net-pjw8x                      2/2     Running     1          16m   172.16.0.31    worker-node1   <none>           <none>

Output of dask array code:

(base) [root@k8s-master example]# cat nohup.out
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 222, in connect
    _raise(error)
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 205, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://dask-root-c69fb20b-d.default:8786' after 10 s: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "dask_example.py", line 5, in <module>
    cluster = KubeCluster.from_yaml('worker-spec_2.yml', deploy_mode="remote")
  File "/root/miniconda3/lib/python3.7/site-packages/dask_kubernetes/core.py", line 566, in from_yaml
    return cls.from_dict(d, **kwargs)
  File "/root/miniconda3/lib/python3.7/site-packages/dask_kubernetes/core.py", line 528, in from_dict
    return cls(make_pod_from_dict(pod_spec), **kwargs)
  File "/root/miniconda3/lib/python3.7/site-packages/dask_kubernetes/core.py", line 380, in __init__
    super().__init__(**self.kwargs)
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/spec.py", line 242, in __init__
    self.sync(self._start)
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/cluster.py", line 162, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 334, in sync
    raise exc.with_traceback(tb)
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 318, in f
    result[0] = yield future
  File "/root/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/root/miniconda3/lib/python3.7/site-packages/dask_kubernetes/core.py", line 500, in _start
    await super()._start()
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/spec.py", line 273, in _start
    await super()._start()
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/cluster.py", line 59, in _start
    comm = await self.scheduler_comm.live_comm()
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 637, in live_comm
    connection_args=self.connection_args,
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 231, in connect
    _raise(error)
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 205, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://dask-root-c69fb20b-d.default:8786' after 10 s: Timed out trying to connect to 'tcp://dask-root-c69fb20b-d.default:8786' after 10 s: [Errno -2] Name or service not known
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 222, in connect
    _raise(error)
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 205, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://dask-root-c69fb20b-d.default:8786' after 10 s: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 186, in ignoring
    yield
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/spec.py", line 574, in close_clusters
    cluster.close(timeout=10)
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/cluster.py", line 83, in close
    return self.sync(self._close, callback_timeout=timeout)
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/cluster.py", line 162, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 334, in sync
    raise exc.with_traceback(tb)
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 318, in f
    result[0] = yield future
  File "/root/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/spec.py", line 368, in _close
    await self.scheduler_comm.close(close_workers=True)
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 675, in send_recv_from_rpc
    comm = await self.live_comm()
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 637, in live_comm
    connection_args=self.connection_args,
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 231, in connect
    _raise(error)
  File "/root/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 205, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://dask-root-c69fb20b-d.default:8786' after 10 s: Timed out trying to connect to 'tcp://dask-root-c69fb20b-d.default:8786' after 10 s: [Errno -2] Name or service not known
2019-12-02 16:25:35,780 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa51b34e950>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/v1/namespaces/default/pods?labelSelector=foo%3Dbar%2Cdask.org%2Fcluster-name%3Ddask-root-c69fb20b-d%2Cuser%3Droot%2Capp%3Ddask
WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa51b34e950>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/v1/namespaces/default/pods?labelSelector=foo%3Dbar%2Cdask.org%2Fcluster-name%3Ddask-root-c69fb20b-d%2Cuser%3Droot%2Capp%3Ddask
2019-12-02 16:25:35,781 WARNING Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa52a9cc990>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/v1/namespaces/default/pods?labelSelector=foo%3Dbar%2Cdask.org%2Fcluster-name%3Ddask-root-c69fb20b-d%2Cuser%3Droot%2Capp%3Ddask
WARNING:urllib3.connectionpool:Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa52a9cc990>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/v1/namespaces/default/pods?labelSelector=foo%3Dbar%2Cdask.org%2Fcluster-name%3Ddask-root-c69fb20b-d%2Cuser%3Droot%2Capp%3Ddask
2019-12-02 16:25:35,781 WARNING Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa52a9cc3d0>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/v1/namespaces/default/pods?labelSelector=foo%3Dbar%2Cdask.org%2Fcluster-name%3Ddask-root-c69fb20b-d%2Cuser%3Droot%2Capp%3Ddask
WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa52a9cc3d0>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/v1/namespaces/default/pods?labelSelector=foo%3Dbar%2Cdask.org%2Fcluster-name%3Ddask-root-c69fb20b-d%2Cuser%3Droot%2Capp%3Ddask
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.7/site-packages/urllib3/connection.py", line 159, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw)
  File "/root/miniconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 80, in create_connection
    raise err
  File "/root/miniconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 70, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/root/miniconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 343, in _make_request
    self._validate_conn(conn)
  File "/root/miniconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 839, in _validate_conn
    conn.connect()
  File "/root/miniconda3/lib/python3.7/site-packages/urllib3/connection.py", line 301, in connect
    conn = self._new_conn()
  File "/root/miniconda3/lib/python3.7/site-packages/urllib3/connection.py", line 168, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7fa52a9ccfd0>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.7/weakref.py", line 648, in _exitfunc
    f()
  File "/root/miniconda3/lib/python3.7/weakref.py", line 572, in __call__
    return info.func(*info.args, **(info.kwargs or {}))
  File "/root/miniconda3/lib/python3.7/site-packages/dask_kubernetes/core.py", line 623, in _cleanup_resources
    pods = core_api.list_namespaced_pod(namespace, label_selector=format_labels(labels))
  File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 12372, in list_namespaced_pod
    (data) = self.list_namespaced_pod_with_http_info(namespace, **kwargs)
  File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 12472, in list_namespaced_pod_with_http_info
    collection_formats=collection_formats)
  File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 334, in call_api
    _return_http_data_only, collection_formats, _preload_content, _request_timeout)
  File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 168, in __call_api
    _request_timeout=_request_timeout)
  File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 355, in request
    headers=headers)
  File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 231, in GET
    query_params=query_params)
  File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 205, in request
    headers=headers)
  File "/root/miniconda3/lib/python3.7/site-packages/urllib3/request.py", line 68, in request
    **urlopen_kw)
  File "/root/miniconda3/lib/python3.7/site-packages/urllib3/request.py", line 89, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/root/miniconda3/lib/python3.7/site-packages/urllib3/poolmanager.py", line 324, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/root/miniconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 667, in urlopen
    **response_kw)
  File "/root/miniconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 667, in urlopen
    **response_kw)
  File "/root/miniconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 667, in urlopen
    **response_kw)
  File "/root/miniconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/root/miniconda3/lib/python3.7/site-packages/urllib3/util/retry.py", line 399, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='localhost', port=443): Max retries exceeded with url: /api/v1/namespaces/default/pods?labelSelector=foo%3Dbar%2Cdask.org%2Fcluster-name%3Ddask-root-c69fb20b-d%2Cuser%3Droot%2Capp%3Ddask (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa52a9ccfd0>: Failed to establish a new connection: [Errno 111] Connection refused'))
ERROR:asyncio:Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7fa53f3a4d10>
ERROR:asyncio:Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7fa52b28ab90>
ERROR:asyncio:Unclosed connector
connections: ['[(<aiohttp.client_proto.ResponseHandler object at 0x7fa52aa1b670>, 1140.130852326)]']
connector: <aiohttp.connector.TCPConnector object at 0x7fa52b286dd0>

workerpod logs:

(base) [root@k8s-master example]# kubectl logs pod/workerpod
+ '[' gcc ']'
...
...
Successfully installed distributed-2.8.1+20.gf15abc58 fastparquet-0.3.2 llvmlite-0.30.0 numba-0.46.0 thrift-0.13.0
+ exec dask-scheduler --idle-timeout '5 minutes'
distributed.scheduler - INFO - -----------------------------------------------
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: pip install jupyter-server-proxy
distributed.scheduler - INFO - Local Directory:    /tmp/scheduler-wb_uyvs8
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:      tcp://10.44.0.1:8786
distributed.scheduler - INFO -   dashboard at:                     :8787
distributed.scheduler - INFO - Scheduler closing...
distributed.scheduler - INFO - Scheduler closing all comms
distributed.scheduler - INFO - End scheduler at 'tcp://10.44.0.1:8786'
(base) [root@k8s-master example]#

Do I need to create Cluster role binding for communication to execute this dask array code on native Kube8s and dask-kubernetes installation example?

jacobtomlinson commented 2 years ago

Given the age of this issue I'm going to close it out as unresolved. Apologies that we never got to the bottom of this.