Closed MnvS closed 2 years ago
Could you run kubectl describe pod workerpod
? It would be interesting to see why it isn't getting placed.
Below is the output for describe pod command:
[root@k8s-master example]# kubectl describe pod workerpod
Name: workerpod
Namespace: default
Priority: 0
Node: <none>
Labels: app=dask
dask.org/cluster-name=dask-root-ede98556-b
dask.org/component=worker
foo=bar
user=root
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Containers:
dask:
Image: daskdev/dask:latest
Port: <none>
Host Port: <none>
Args:
dask-worker
--nthreads
2
--no-bokeh
--memory-limit
6GB
--death-timeout
60
Limits:
cpu: 2
memory: 6G
Requests:
cpu: 2
memory: 6G
Environment:
EXTRA_PIP_PACKAGES: fastparquet git+https://github.com/dask/distributed
DASK_SCHEDULER_ADDRESS: tcp://172.16.0.76:34581
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-7mjzj (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-7mjzj:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-7mjzj
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: k8s.dask.org/dedicated=worker:NoSchedule
k8s.dask.org_dedicated=worker:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 35s default-scheduler 0/3 nodes are available: 3 Insufficient cpu, 3
Insufficient memory.
[root@k8s-master example]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master Ready master 7m19s v1.16.3
worker-node1 Ready <none> 5m38s v1.16.2
worker-node2 Ready <none> 5m30s v1.16.2
[root@k8s-master example]# kubectl get nodes
3 Insufficient cpu, 3 Insufficient memory.
It looks like your cluster is not able to fulfil the requirements you have set your pod. You will either need to use bigger nodes, enable autoscaling or reduce your worker requirements.
What kind of cluster are you using (GKE, EKS, Bare metal, local, etc)?
Thanks for reply. I am able to run dask array example after increasing the memory and cpus on nodes, but still getting below errors as output (along with output 1). I am using local kube8s cluster on ec2 instances.
(base) [root@k8s-master example]# python dask_example.py
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://172.16.0.76:34237
distributed.scheduler - INFO - Receive client connection: Client-0eab0792-0b04-11ea-90ce-12bd5ffa93ff
distributed.core - INFO - Starting established connection
ERROR:asyncio:Task exception was never retrieved
future: <Task finished coro=<_wrap_awaitable() done, defined at /root/miniconda3/lib/python3.7/asyncio/tasks.py:623> exception=AssertionError()>
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.7/asyncio/tasks.py", line 630, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/spec.py", line 42, in _
assert self.status == "running"
AssertionError
…..Same errors repeated...
…..Received output...
1.0
….errors repeated...
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fc369298e90>>, <Task finished coro=<SpecCluster._correct_state_internal() done, defined at /root/miniconda3/lib/python3.7/site-packages/distributed/deploy/spec.py:284> exception=AssertionError()>)
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/root/miniconda3/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/spec.py", line 317, in _correct_state_internal
await w # for tornado gen.coroutine support
File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/spec.py", line 42, in _
assert self.status == "running"
AssertionError
distributed.scheduler - INFO - Scheduler closing...
distributed.scheduler - INFO - Scheduler closing all comms
distributed.scheduler - INFO - Remove worker tcp://10.44.0.1:38635
distributed.core - INFO - Removing comms to tcp://10.44.0.1:38635
distributed.scheduler - INFO - Lost all workers
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.7/contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "/root/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 186, in ignoring
yield
File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/spec.py", line 574, in close_clusters
cluster.close(timeout=10)
File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/cluster.py", line 83, in close
return self.sync(self._close, callback_timeout=timeout)
File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/cluster.py", line 162, in sync
return sync(self.loop, func, *args, **kwargs)
File "/root/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 334, in sync
raise exc.with_traceback(tb)
File "/root/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 318, in f
result[0] = yield future
File "/root/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/spec.py", line 372, in _close
assert w.status == "closed", w.status
AssertionError: created
2019-11-19 20:28:16,181 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fc34d55dd10>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/v1/namespaces/default/pods?labelSelector=foo%3Dbar%2Cdask.org%2Fcluster-name%3Ddask-root-ead3c94d-f%2Cuser%3Droot%2Capp%3Ddask
...
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.7/weakref.py", line 648, in _exitfunc
f()
File "/root/miniconda3/lib/python3.7/weakref.py", line 572, in __call__
return info.func(*info.args, **(info.kwargs or {}))
File "/root/miniconda3/lib/python3.7/site-packages/dask_kubernetes/core.py", line 623, in _cleanup_resources
pods = core_api.list_namespaced_pod(namespace, label_selector=format_labels(labels))
File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 12372, in list_namespaced_pod
(data) = self.list_namespaced_pod_with_http_info(namespace, **kwargs)
File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 12472, in list_namespaced_pod_with_http_info
collection_formats=collection_formats)
File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 334, in call_api
_return_http_data_only, collection_formats, _preload_content, _request_timeout)
File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 168, in __call_api
_request_timeout=_request_timeout)
File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 355, in request
headers=headers)
File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 231, in GET
query_params=query_params)
...
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='localhost', port=443): Max retries
exceeded with url: /api/v1/namespaces/default/pods?labelSelector=foo%3Dbar%2Cdask.org%2Fcluster-name%3Ddask-root-ead3c94d-f%2Cuser%3Droot%2Capp%3Ddask (Caused by
NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fc362d1a090>:
Failed to establish a new connection: [Errno 111] Connection refused'))
Update: Worker pod is giving error status as below:
(base) [root@k8s-master example]# ls
dask_example.py worker-spec.yml
(base) [root@k8s-master example]# nohup python dask_example.py &
[1] 3660
(base) [root@k8s-master example]# cat nohup.out
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://172.16.0.76:40119
distributed.scheduler - INFO - Receive client connection: Client-df4caa18-0bc8-11ea-8e4c-12bd5ffa93ff
distributed.core - INFO - Starting established connection
(base) [root@k8s-master example]# kubectl get pods -o wide --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
default workerpod 1/1 Running 0 70s 10.32.0.2 worker-node1 <none> <none>
kube-system coredns-5644d7b6d9-l4jsd 1/1 Running 0 8m19s 10.32.0.4 k8s-master <none> <none>
kube-system coredns-5644d7b6d9-q679h 1/1 Running 0 8m19s 10.32.0.3 k8s-master <none> <none>
kube-system etcd-k8s-master 1/1 Running 0 7m16s 172.16.0.76 k8s-master <none> <none>
kube-system kube-apiserver-k8s-master 1/1 Running 0 7m1s 172.16.0.76 k8s-master <none> <none>
kube-system kube-controller-manager-k8s-master 1/1 Running 0 7m27s 172.16.0.76 k8s-master <none> <none>
kube-system kube-proxy-ctgj8 1/1 Running 0 5m7s 172.16.0.114 worker-node2 <none> <none>
kube-system kube-proxy-f78bm 1/1 Running 0 8m18s 172.16.0.76 k8s-master <none> <none>
kube-system kube-proxy-ksk59 1/1 Running 0 5m15s 172.16.0.31 worker-node1 <none> <none>
kube-system kube-scheduler-k8s-master 1/1 Running 0 7m2s 172.16.0.76 k8s-master <none> <none>
kube-system weave-net-q2zwn 2/2 Running 0 6m22s 172.16.0.76 k8s-master <none> <none>
kube-system weave-net-r9tzs 2/2 Running 0 5m15s 172.16.0.31 worker-node1 <none> <none>
kube-system weave-net-tm8xx 2/2 Running 0 5m7s 172.16.0.114 worker-node2 <none> <none>
(base) [root@k8s-master example]# kubectl get pods -o wide --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
default workerpod 0/1 Error 0 4m23s 10.32.0.2 worker-node1 <none> <none>
kube-system coredns-5644d7b6d9-l4jsd 1/1 Running 0 11m 10.32.0.4 k8s-master <none> <none>
kube-system coredns-5644d7b6d9-q679h 1/1 Running 0 11m 10.32.0.3 k8s-master <none> <none>
kube-system etcd-k8s-master 1/1 Running 0 10m 172.16.0.76 k8s-master <none> <none>
kube-system kube-apiserver-k8s-master 1/1 Running 0 10m 172.16.0.76 k8s-master <none> <none>
kube-system kube-controller-manager-k8s-master 1/1 Running 0 10m 172.16.0.76 k8s-master <none> <none>
kube-system kube-proxy-ctgj8 1/1 Running 0 8m20s 172.16.0.114 worker-node2 <none> <none>
kube-system kube-proxy-f78bm 1/1 Running 0 11m 172.16.0.76 k8s-master <none> <none>
kube-system kube-proxy-ksk59 1/1 Running 0 8m28s 172.16.0.31 worker-node1 <none> <none>
kube-system kube-scheduler-k8s-master 1/1 Running 0 10m 172.16.0.76 k8s-master <none> <none>
kube-system weave-net-q2zwn 2/2 Running 0 9m35s 172.16.0.76 k8s-master <none> <none>
kube-system weave-net-r9tzs 2/2 Running 0 8m28s 172.16.0.31 worker-node1 <none> <none>
kube-system weave-net-tm8xx 2/2 Running 0 8m20s 172.16.0.114 worker-node2 <none> <none>
(base) [root@k8s-master example]# cat nohup.out
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://172.16.0.76:40119
distributed.scheduler - INFO - Receive client connection: Client-df4caa18-0bc8-11ea-8e4c-12bd5ffa93ff
distributed.core - INFO - Starting established connection
(base) [root@k8s-master example]# kubectl describe pod workerpod
Name: workerpod
Namespace: default
Priority: 0
Node: worker-node1/172.16.0.31
Start Time: Wed, 20 Nov 2019 19:06:36 +0000
Labels: app=dask
dask.org/cluster-name=dask-root-99dcf768-4
dask.org/component=worker
foo=bar
user=root
Annotations: <none>
Status: Failed
IP: 10.32.0.2
IPs:
IP: 10.32.0.2
Containers:
dask:
Container ID: docker://578dc575fc263c4a3889a4f2cb5e06cd82a00e03cfc6acfd7a98fef703421390
Image: daskdev/dask:latest
Image ID: docker-pullable://daskdev/dask@sha256:0a936daa94c82cea371c19a2c90c695688ab4e1e7acc905f8b30dfd419adfb6f
Port: <none>
Host Port: <none>
Args:
dask-worker
--nthreads
2
--no-bokeh
--memory-limit
6GB
--death-timeout
60
State: Terminated
Reason: Error
Exit Code: 1
Started: Wed, 20 Nov 2019 19:06:38 +0000
Finished: Wed, 20 Nov 2019 19:08:20 +0000
Ready: False
Restart Count: 0
Limits:
cpu: 2
memory: 6G
Requests:
cpu: 2
memory: 6G
Environment:
EXTRA_PIP_PACKAGES: fastparquet git+https://github.com/dask/distributed
DASK_SCHEDULER_ADDRESS: tcp://172.16.0.76:40119
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-p9f9v (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-p9f9v:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-p9f9v
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: k8s.dask.org/dedicated=worker:NoSchedule
k8s.dask.org_dedicated=worker:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5m47s default-scheduler Successfully assigned default/workerpod to worker-node1
Normal Pulled 5m45s kubelet, worker-node1 Container image "daskdev/dask:latest" already present on machine
Normal Created 5m45s kubelet, worker-node1 Created container dask
Normal Started 5m45s kubelet, worker-node1 Started container dask
(base) [root@k8s-master example]#
(base) [root@k8s-master example]# kubectl get events
LAST SEEN TYPE REASON OBJECT MESSAGE
21m Normal Starting node/k8s-master Starting kubelet.
21m Normal NodeHasSufficientMemory node/k8s-master Node k8s-master status is now: NodeHasSufficientMemory
21m Normal NodeHasNoDiskPressure node/k8s-master Node k8s-master status is now: NodeHasNoDiskPressure
21m Normal NodeHasSufficientPID node/k8s-master Node k8s-master status is now: NodeHasSufficientPID
21m Normal NodeAllocatableEnforced node/k8s-master Updated Node Allocatable limit across pods
21m Normal RegisteredNode node/k8s-master Node k8s-master event: Registered Node k8s-master in Controller
21m Normal Starting node/k8s-master Starting kube-proxy.
18m Normal Starting node/worker-node1 Starting kubelet.
18m Normal NodeHasSufficientMemory node/worker-node1 Node worker-node1 status is now: NodeHasSufficientMemory
18m Normal NodeHasNoDiskPressure node/worker-node1 Node worker-node1 status is now: NodeHasNoDiskPressure
18m Normal NodeHasSufficientPID node/worker-node1 Node worker-node1 status is now: NodeHasSufficientPID
18m Normal NodeAllocatableEnforced node/worker-node1 Updated Node Allocatable limit across pods
18m Normal Starting node/worker-node1 Starting kube-proxy.
18m Normal RegisteredNode node/worker-node1 Node worker-node1 event: Registered Node worker-node1 in Controller
17m Normal NodeReady node/worker-node1 Node worker-node1 status is now: NodeReady
18m Normal Starting node/worker-node2 Starting kubelet.
18m Normal NodeHasSufficientMemory node/worker-node2 Node worker-node2 status is now: NodeHasSufficientMemory
18m Normal NodeHasNoDiskPressure node/worker-node2 Node worker-node2 status is now: NodeHasNoDiskPressure
18m Normal NodeHasSufficientPID node/worker-node2 Node worker-node2 status is now: NodeHasSufficientPID
18m Normal NodeAllocatableEnforced node/worker-node2 Updated Node Allocatable limit across pods
18m Normal Starting node/worker-node2 Starting kube-proxy.
17m Normal RegisteredNode node/worker-node2 Node worker-node2 event: Registered Node worker-node2 in Controller
17m Normal NodeReady node/worker-node2 Node worker-node2 status is now: NodeReady
14m Normal Scheduled pod/workerpod Successfully assigned default/workerpod to worker-node1
14m Normal Pulled pod/workerpod Container image "daskdev/dask:latest" already present on machine
14m Normal Created pod/workerpod Created container dask
14m Normal Started pod/workerpod Started container dask
(base) [root@k8s-master example]#
Thanks for providing the extra info. There definitely seems to be something up with you k8s cluster as the pod is erroring but not providing much reasoning behind it.
When you upped the memory is there actually enough memory for it to use?
Memory is increased to 64GB on all 3 nodes so it should not be issue. Logs of workerpod show that it is not able to resolve github.com so it could be dns issue (Output below):
(base) [root@k8s-master example]# free -mh
total used free shared buff/cache available
Mem: 62G 1.2G 59G 992K 1.2G 60G
Swap: 0B 0B 0B
(base) [root@k8s-master example]# kubectl logs workerpod
+ '[' '' ']'
+ '[' -e /opt/app/environment.yml ']'
no environment.yml
+ echo 'no environment.yml'
+ '[' '' ']'
EXTRA_PIP_PACKAGES environment variable found. Installing.
+ '[' 'fastparquet git+https://github.com/dask/distributed' ']'
+ echo 'EXTRA_PIP_PACKAGES environment variable found. Installing.'
+ /opt/conda/bin/pip install fastparquet git+https://github.com/dask/distributed
Collecting git+https://github.com/dask/distributed
Cloning https://github.com/dask/distributed to /tmp/pip-req-build-auau5085
Running command git clone -q https://github.com/dask/distributed /tmp/pip-req-build-auau5085
fatal: unable to access 'https://github.com/dask/distributed/': Could not resolve host: github.com
ERROR: Command errored out with exit status 128: git clone -q https://github.com/dask/distributed /tmp/pip-req-build-auau5085 Check the logs for full command output.
+ exec dask-worker --nthreads 2 --no-bokeh --memory-limit 6GB --death-timeout 60
/opt/conda/lib/python3.7/site-packages/distributed/cli/dask_worker.py:252: UserWarning: The --bokeh/--no-bokeh flag has been renamed to --dashboard/--no-dashboard.
"The --bokeh/--no-bokeh flag has been renamed to --dashboard/--no-dashboard. "
distributed.nanny - INFO - Start Nanny at: 'tcp://10.32.0.3:45001'
distributed.worker - INFO - Start worker at: tcp://10.32.0.3:39147
distributed.worker - INFO - Listening to: tcp://10.32.0.3:39147
distributed.worker - INFO - Waiting to connect to: tcp://172.16.0.76:41719
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 2
distributed.worker - INFO - Memory: 6.00 GB
distributed.worker - INFO - Local Directory: /worker-2vb5q4k8
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to: tcp://172.16.0.76:41719
distributed.worker - INFO - Waiting to connect to: tcp://172.16.0.76:41719
distributed.worker - INFO - Waiting to connect to: tcp://172.16.0.76:41719
distributed.worker - INFO - Waiting to connect to: tcp://172.16.0.76:41719
distributed.worker - INFO - Waiting to connect to: tcp://172.16.0.76:41719
distributed.nanny - INFO - Closing Nanny at 'tcp://10.32.0.3:45001'
distributed.worker - INFO - Stopping worker at tcp://10.32.0.3:39147
distributed.worker - INFO - Closed worker has not yet started: None
distributed.dask_worker - INFO - Timed out starting worker
distributed.dask_worker - INFO - End worker
(base) [root@k8s-master example]#
dns check when workerpod was is running state
(base) [root@k8s-master example]# kubectl exec workerpod cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:5
(base) [root@k8s-master example]# kubectl exec workerpod nslookup github.com
OCI runtime exec failed: exec failed: container_linux.go:345: starting container process caused "exec: \"nslookup\": executable file not found in $PATH": unknown
command terminated with exit code 126
Not sure how to set executable file in $PATH of dask worker pod. It is set and working on my master host.
(base) [root@k8s-master example]# nslookup github.com
Server: 172.31.0.2
Address: 172.31.0.2#53
Non-authoritative answer:
Name: github.com
Address: 140.82.114.3
Yes it seems like your cluster is not able to access the internet. This is a requirement here.
OCI runtime exec failed: exec failed: container_linux.go:345: starting container process caused "exec: \"nslookup\": executable file not found in $PATH": unknown
This is failing because nslookup
is not included in the base image. It's quite a minimal image so you wont find many network tools. Your easiest way to test it is probably to use Python requests as we know that will be available. Here's an example running docker locally, should be almost the same on k8s.
$ docker exec -it 8d22 python -c "import requests; print(requests.get('https://github.com').headers)"
{'Date': 'Fri, 22 Nov 2019 08:58:52 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked',
'Server': 'GitHub.com', 'Status': '200 OK', 'Vary': 'X-PJAX, Accept-Encoding', 'ETag': 'W/"8ec94cb60917f9348f3965fa3f6
341fe"', 'Cache-Control': 'max-age=0, private, must-revalidate', 'Set-Cookie': 'has_recent_activity=1; path=/; expires
=Fri, 22 Nov 2019 09:58:52 -0000, _octo=GH1.1.358802128.1574413132; domain=.github.com; path=/; expires=Mon, 22 Nov 20
21 08:58:52 -0000, logged_in=no; domain=.github.com; path=/; expires=Tue, 22 Nov 2039 08:58:52 -0000; secure; HttpOnly
, _gh_sess=NXBjTmhrNURKZnRrS294Q1llTDU2c25tVFlVVTNCZTJFKzIyeFN0KzJHM1lGdUR4d1F5Zzh6aWgySExFNXBOVDZjeXZHLzhZaHJCMFhEWjk
ra252NEZML2sySHRkdnh3TE8vUC9Ia21iRHFHYUNnQlVveDdRMTRndzV5OStoL0daandQQis1c0ppQ05RVDA3ZzFZNWNRPT0tLWxRUUhwOHJ2REpuNTc1c
Dhobk9hNEE9PQ%3D%3D--2f084d8c69d414d743a33055c134c010f40da5de; path=/; secure; HttpOnly', 'X-Request-Id': '73fb8c58-58
0f-4e74-b072-971356412e67', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains; preload', 'X-Frame-Opti
ons': 'deny', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Referrer-Policy': 'origin-whe
n-cross-origin, strict-origin-when-cross-origin', 'Expect-CT': 'max-age=2592000, report-uri="https://api.github.com/_p
rivate/browser/errors"', 'Content-Security-Policy': "default-src 'none'; base-uri 'self'; block-all-mixed-content; con
nect-src 'self' uploads.github.com www.githubstatus.com collector.githubapp.com api.github.com www.google-analytics.co
m github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-man
ifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com wss://live.github.com; font-sr
c github.githubassets.com; form-action 'self' github.com gist.github.com; frame-ancestors 'none'; frame-src render.git
hubusercontent.com; img-src 'self' data: github.githubassets.com identicons.github.com collector.githubapp.com github-
cloud.s3.amazonaws.com *.githubusercontent.com customer-stories-feed.github.com spotlights-feed.github.com; manifest-s
rc 'self'; media-src 'none'; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com", '
Content-Encoding': 'gzip', 'X-GitHub-Request-Id': 'A2E2:1D58E:B8DF62:11696C7:5DD7A34C'}
Thanks so much for reply. nslookup issue is resolved after re-installing Kube8s and dask, now getting issues while building fastparquet on worker pod. Logs below:
(base) [root@k8s-master example]# kubectl logs workerpod
+ '[' '' ']'
+ '[' -e /opt/app/environment.yml ']'
+ echo 'no environment.yml'
no environment.yml
+ '[' '' ']'
+ '[' 'fastparquet git+https://github.com/dask/distributed' ']'
+ echo 'EXTRA_PIP_PACKAGES environment variable found. Installing.'
+ /opt/conda/bin/pip install fastparquet git+https://github.com/dask/distributed
EXTRA_PIP_PACKAGES environment variable found. Installing.
Collecting git+https://github.com/dask/distributed
Cloning https://github.com/dask/distributed to /tmp/pip-req-build-9pgvdhjf
Running command git clone -q https://github.com/dask/distributed /tmp/pip-req-build-9pgvdhjf
Collecting fastparquet
Downloading https://files.pythonhosted.org/packages/58/49/dccb790fa17ab3fbf84a6b848050083c7a1899e9586000e34e3e4fbf5538/fastparquet-0.3.2.tar.gz (151kB)
Requirement already satisfied: click>=6.6 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.0+8.g5b33d54c) (7.0)
Requirement already satisfied: cloudpickle>=0.2.2 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.0+8.g5b33d54c) (1.2.2)
Requirement already satisfied: dask>=2.7.0 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.0+8.g5b33d54c) (2.8.0)
Requirement already satisfied: msgpack in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.0+8.g5b33d54c) (0.6.2)
Requirement already satisfied: psutil>=5.0 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.0+8.g5b33d54c) (5.6.5)
Requirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.0+8.g5b33d54c) (2.1.0)
Requirement already satisfied: tblib in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.0+8.g5b33d54c) (1.4.0)
Requirement already satisfied: toolz>=0.7.4 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.0+8.g5b33d54c) (0.10.0)
Requirement already satisfied: tornado>=5 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.0+8.g5b33d54c) (6.0.3)
Requirement already satisfied: zict>=0.1.3 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.0+8.g5b33d54c) (1.0.0)
Requirement already satisfied: pyyaml in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.0+8.g5b33d54c) (5.1.2)
Requirement already satisfied: pandas>=0.19 in /opt/conda/lib/python3.7/site-packages (from fastparquet) (0.25.2)
Collecting numba>=0.28 (from fastparquet)
Downloading https://files.pythonhosted.org/packages/57/66/7ebc88e87b4ddf9b1c204e24d467cb0d13a7a890ecb4f83d20949f768929/numba-0.46.0-cp37-cp37m-manylinux1_x86_64.whl (3.6MB)
Requirement already satisfied: numpy>=1.11 in /opt/conda/lib/python3.7/site-packages (from fastparquet) (1.17.3)
Collecting thrift>=0.11.0 (from fastparquet)
Downloading https://files.pythonhosted.org/packages/97/1e/3284d19d7be99305eda145b8aa46b0c33244e4a496ec66440dac19f8274d/thrift-0.13.0.tar.gz (59kB)
Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from fastparquet) (1.12.0)
Requirement already satisfied: heapdict in /opt/conda/lib/python3.7/site-packages (from zict>=0.1.3->distributed==2.8.0+8.g5b33d54c) (1.0.1)
Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.19->fastparquet) (2019.3)
Requirement already satisfied: python-dateutil>=2.6.1 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.19->fastparquet) (2.8.1)
Collecting llvmlite>=0.30.0dev0 (from numba>=0.28->fastparquet)
Downloading https://files.pythonhosted.org/packages/1f/3e/642ffb29ed35ca5e93f171ba327452bdee81ec76f2d711ef0f15b411928a/llvmlite-0.30.0-cp37-cp37m-manylinux1_x86_64.whl (20.2MB)
Building wheels for collected packages: fastparquet, distributed, thrift
Building wheel for fastparquet (setup.py): started
Building wheel for fastparquet (setup.py): finished with status 'error'
ERROR: Command errored out with exit status 1:
command: /opt/conda/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-m6berq44/fastparquet/setup.py'"'"'; __file__='"'"'/tmp/pip-install-m6berq44/fastparquet/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-tcex8z1e --python-tag cp37
cwd: /tmp/pip-install-m6berq44/fastparquet/
Complete output (61 lines):
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.7
creating build/lib.linux-x86_64-3.7/fastparquet
copying fastparquet/__init__.py -> build/lib.linux-x86_64-3.7/fastparquet
copying fastparquet/api.py -> build/lib.linux-x86_64-3.7/fastparquet
copying fastparquet/compression.py -> build/lib.linux-x86_64-3.7/fastparquet
copying fastparquet/converted_types.py -> build/lib.linux-x86_64-3.7/fastparquet
copying fastparquet/core.py -> build/lib.linux-x86_64-3.7/fastparquet
copying fastparquet/dataframe.py -> build/lib.linux-x86_64-3.7/fastparquet
copying fastparquet/encoding.py -> build/lib.linux-x86_64-3.7/fastparquet
copying fastparquet/schema.py -> build/lib.linux-x86_64-3.7/fastparquet
copying fastparquet/thrift_structures.py -> build/lib.linux-x86_64-3.7/fastparquet
copying fastparquet/util.py -> build/lib.linux-x86_64-3.7/fastparquet
copying fastparquet/writer.py -> build/lib.linux-x86_64-3.7/fastparquet
running egg_info
writing fastparquet.egg-info/PKG-INFO
writing dependency_links to fastparquet.egg-info/dependency_links.txt
writing requirements to fastparquet.egg-info/requires.txt
writing top-level names to fastparquet.egg-info/top_level.txt
reading manifest file 'fastparquet.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
no previously-included directories found matching 'docs/_build'
writing manifest file 'fastparquet.egg-info/SOURCES.txt'
copying fastparquet/parquet.thrift -> build/lib.linux-x86_64-3.7/fastparquet
copying fastparquet/speedups.c -> build/lib.linux-x86_64-3.7/fastparquet
copying fastparquet/speedups.pyx -> build/lib.linux-x86_64-3.7/fastparquet
creating build/lib.linux-x86_64-3.7/fastparquet/benchmarks
copying fastparquet/benchmarks/columns.py -> build/lib.linux-x86_64-3.7/fastparquet/benchmarks
creating build/lib.linux-x86_64-3.7/fastparquet/parquet_thrift
copying fastparquet/parquet_thrift/__init__.py -> build/lib.linux-x86_64-3.7/fastparquet/parquet_thrift
creating build/lib.linux-x86_64-3.7/fastparquet/parquet_thrift/parquet
copying fastparquet/parquet_thrift/parquet/__init__.py -> build/lib.linux-x86_64-3.7/fastparquet/parquet_thrift/parquet
copying fastparquet/parquet_thrift/parquet/constants.py -> build/lib.linux-x86_64-3.7/fastparquet/parquet_thrift/parquet
copying fastparquet/parquet_thrift/parquet/ttypes.py -> build/lib.linux-x86_64-3.7/fastparquet/parquet_thrift/parquet
creating build/lib.linux-x86_64-3.7/fastparquet/test
copying fastparquet/test/__init__.py -> build/lib.linux-x86_64-3.7/fastparquet/test
copying fastparquet/test/test_api.py -> build/lib.linux-x86_64-3.7/fastparquet/test
copying fastparquet/test/test_aroundtrips.py -> build/lib.linux-x86_64-3.7/fastparquet/test
copying fastparquet/test/test_compression.py -> build/lib.linux-x86_64-3.7/fastparquet/test
copying fastparquet/test/test_converted_types.py -> build/lib.linux-x86_64-3.7/fastparquet/test
copying fastparquet/test/test_dataframe.py -> build/lib.linux-x86_64-3.7/fastparquet/test
copying fastparquet/test/test_encoding.py -> build/lib.linux-x86_64-3.7/fastparquet/test
copying fastparquet/test/test_output.py -> build/lib.linux-x86_64-3.7/fastparquet/test
copying fastparquet/test/test_partition_filters_specialstrings.py -> build/lib.linux-x86_64-3.7/fastparquet/test
copying fastparquet/test/test_read.py -> build/lib.linux-x86_64-3.7/fastparquet/test
copying fastparquet/test/test_schema.py -> build/lib.linux-x86_64-3.7/fastparquet/test
copying fastparquet/test/test_speedups.py -> build/lib.linux-x86_64-3.7/fastparquet/test
copying fastparquet/test/test_thrift_structures.py -> build/lib.linux-x86_64-3.7/fastparquet/test
copying fastparquet/test/test_util.py -> build/lib.linux-x86_64-3.7/fastparquet/test
copying fastparquet/test/test_with_n.py -> build/lib.linux-x86_64-3.7/fastparquet/test
copying fastparquet/test/util.py -> build/lib.linux-x86_64-3.7/fastparquet/test
running build_ext
building 'fastparquet.speedups' extension
creating build/temp.linux-x86_64-3.7
creating build/temp.linux-x86_64-3.7/fastparquet
gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/include/python3.7m -I/opt/conda/lib/python3.7/site-packages/numpy/core/include -c fastparquet/speedups.c -o build/temp.linux-x86_64-3.7/fastparquet/speedups.o
unable to execute 'gcc': No such file or directory
error: command 'gcc' failed with exit status 1
----------------------------------------
ERROR: Failed building wheel for fastparquet
Running setup.py clean for fastparquet
Building wheel for distributed (setup.py): started
Building wheel for distributed (setup.py): finished with status 'done'
Created wheel for distributed: filename=distributed-2.8.0+8.g5b33d54c-cp37-none-any.whl size=568764 sha256=9712974396e1221fa5dd195616e85031da70894222c2c7ff574bcfb318b5f80c
Stored in directory: /tmp/pip-ephem-wheel-cache-v5jnd4bs/wheels/aa/21/a7/d9548d684f8e074360b7ad1bd8633843dba9658288b68b3dd5
Building wheel for thrift (setup.py): started
Building wheel for thrift (setup.py): finished with status 'done'
Created wheel for thrift: filename=thrift-0.13.0-cp37-none-any.whl size=154884 sha256=c32af6aa5c4cfced68fadc2997e173f14ed0595a4bb9bb407eb7ef62794fafd8
Stored in directory: /root/.cache/pip/wheels/02/a2/46/689ccfcf40155c23edc7cdbd9de488611c8fdf49ff34b1706e
Successfully built distributed thrift
Failed to build fastparquet
Installing collected packages: llvmlite, numba, thrift, fastparquet, distributed
Running setup.py install for fastparquet: started
Running setup.py install for fastparquet: finished with status 'error'
ERROR: Command errored out with exit status 1:
command: /opt/conda/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-m6berq44/fastparquet/setup.py'"'"'; __file__='"'"'/tmp/pip-install-m6berq44/fastparquet/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-qrgpngpc/install-record.txt --single-version-externally-managed --compile
cwd: /tmp/pip-install-m6berq44/fastparquet/
Complete output (19 lines):
running install
running build
running build_py
running egg_info
writing fastparquet.egg-info/PKG-INFO
writing dependency_links to fastparquet.egg-info/dependency_links.txt
writing requirements to fastparquet.egg-info/requires.txt
writing top-level names to fastparquet.egg-info/top_level.txt
reading manifest file 'fastparquet.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
no previously-included directories found matching 'docs/_build'
writing manifest file 'fastparquet.egg-info/SOURCES.txt'
running build_ext
building 'fastparquet.speedups' extension
creating build/temp.linux-x86_64-3.7
creating build/temp.linux-x86_64-3.7/fastparquet
gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/opt/conda/include/python3.7m -I/opt/conda/lib/python3.7/site-packages/numpy/core/include -c fastparquet/speedups.c -o build/temp.linux-x86_64-3.7/fastparquet/speedups.o
unable to execute 'gcc': No such file or directory
error: command 'gcc' failed with exit status 1
----------------------------------------
ERROR: Command errored out with exit status 1: /opt/conda/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-m6berq44/fastparquet/setup.py'"'"'; __file__='"'"'/tmp/pip-install-m6berq44/fastparquet/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-qrgpngpc/install-record.txt --single-version-externally-managed --compile Check the logs for full command output.
+ exec dask-worker --nthreads 2 --no-bokeh --memory-limit 6GB --death-timeout 60
/opt/conda/lib/python3.7/site-packages/distributed/cli/dask_worker.py:252: UserWarning: The --bokeh/--no-bokeh flag has been renamed to --dashboard/--no-dashboard.
"The --bokeh/--no-bokeh flag has been renamed to --dashboard/--no-dashboard. "
distributed.nanny - INFO - Start Nanny at: 'tcp://10.44.0.1:46597'
distributed.worker - INFO - Start worker at: tcp://10.44.0.1:36897
distributed.worker - INFO - Listening to: tcp://10.44.0.1:36897
distributed.worker - INFO - Waiting to connect to: tcp://172.16.0.76:42143
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 2
distributed.worker - INFO - Memory: 6.00 GB
distributed.worker - INFO - Local Directory: /worker-vlqvrk15
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to: tcp://172.16.0.76:42143
distributed.worker - INFO - Waiting to connect to: tcp://172.16.0.76:42143
distributed.worker - INFO - Waiting to connect to: tcp://172.16.0.76:42143
distributed.worker - INFO - Waiting to connect to: tcp://172.16.0.76:42143
distributed.worker - INFO - Waiting to connect to: tcp://172.16.0.76:42143
distributed.nanny - INFO - Closing Nanny at 'tcp://10.44.0.1:46597'
distributed.worker - INFO - Stopping worker at tcp://10.44.0.1:36897
distributed.worker - INFO - Closed worker has not yet started: None
distributed.dask_worker - INFO - Timed out starting worker
distributed.dask_worker - INFO - End worker
(base) [root@k8s-master example]#
Looks like you also need to specify gcc
in the EXTRA_APT_PACKAGES
env var.
After adding below:
env:
- name: EXTRA_APT_PACKAGES
value: gcc
gcc related issue is resolved, but workerpod is still giving error status.
(base) [root@k8s-master example]# kubectl get service,pods -o wide --all-namespaces
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
default service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 6h32m <none>
kube-system service/kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 6h32m k8s-app=kube-dns
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
default pod/workerpod 0/1 Error 0 16m 10.44.0.1 worker-node1 <none> <none>
kube-system pod/coredns-5644d7b6d9-82kqk 1/1 Running 0 6h32m 10.32.0.3 k8s-master <none> <none>
kube-system pod/coredns-5644d7b6d9-xh4pg 1/1 Running 0 6h32m 10.32.0.4 k8s-master <none> <none>
kube-system pod/etcd-k8s-master 1/1 Running 0 6h31m 172.16.0.76 k8s-master <none> <none>
kube-system pod/kube-apiserver-k8s-master 1/1 Running 0 6h31m 172.16.0.76 k8s-master <none> <none>
kube-system pod/kube-controller-manager-k8s-master 1/1 Running 0 6h31m 172.16.0.76 k8s-master <none> <none>
kube-system pod/kube-proxy-25nnf 1/1 Running 0 6h28m 172.16.0.114 worker-node2 <none> <none>
kube-system pod/kube-proxy-cr84h 1/1 Running 0 6h28m 172.16.0.31 worker-node1 <none> <none>
kube-system pod/kube-proxy-lvs9g 1/1 Running 0 6h32m 172.16.0.76 k8s-master <none> <none>
kube-system pod/kube-scheduler-k8s-master 1/1 Running 0 6h30m 172.16.0.76 k8s-master <none> <none>
kube-system pod/weave-net-d5jsg 2/2 Running 1 6h28m 172.16.0.31 worker-node1 <none> <none>
kube-system pod/weave-net-nnfzh 2/2 Running 0 6h29m 172.16.0.76 k8s-master <none> <none>
kube-system pod/weave-net-zcv8v 2/2 Running 1 6h28m 172.16.0.114 worker-node2 <none> <none>
(base) [root@k8s-master example]#
(base) [root@k8s-master example]# kubectl logs workerpod
+ '[' gcc ']'
+ echo 'EXTRA_APT_PACKAGES environment variable found. Installing.'
+ apt update -y
EXTRA_APT_PACKAGES environment variable found. Installing.
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
Get:1 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB]
Get:2 http://deb.debian.org/debian buster InRelease [122 kB]
Get:3 http://security.debian.org/debian-security buster/updates/main amd64 Packages [158 kB]
Get:4 http://deb.debian.org/debian buster-updates InRelease [49.3 kB]
Get:5 http://deb.debian.org/debian buster/main amd64 Packages [7908 kB]
Fetched 8302 kB in 17s (497 kB/s)
Reading package lists...
Building dependency tree...
Reading state information...
21 packages can be upgraded. Run 'apt list --upgradable' to see them.
+ apt install -y gcc
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
binutils binutils-common binutils-x86-64-linux-gnu cpp cpp-8 gcc-8 libasan5
libatomic1 libbinutils libc-dev-bin libc6-dev libcc1-0 libgcc-8-dev libgomp1
libisl19 libitm1 liblsan0 libmpc3 libmpfr6 libmpx2 libquadmath0 libtsan0
libubsan1 linux-libc-dev manpages manpages-dev
Suggested packages:
binutils-doc cpp-doc gcc-8-locales gcc-multilib make autoconf automake
libtool flex bison gdb gcc-doc gcc-8-multilib gcc-8-doc libgcc1-dbg
libgomp1-dbg libitm1-dbg libatomic1-dbg libasan5-dbg liblsan0-dbg
libtsan0-dbg libubsan1-dbg libmpx2-dbg libquadmath0-dbg glibc-doc
man-browser
The following NEW packages will be installed:
binutils binutils-common binutils-x86-64-linux-gnu cpp cpp-8 gcc gcc-8
libasan5 libatomic1 libbinutils libc-dev-bin libc6-dev libcc1-0 libgcc-8-dev
libgomp1 libisl19 libitm1 liblsan0 libmpc3 libmpfr6 libmpx2 libquadmath0
libtsan0 libubsan1 linux-libc-dev manpages manpages-dev
0 upgraded, 27 newly installed, 0 to remove and 21 not upgraded.
Need to get 35.5 MB of archives.
After this operation, 135 MB of additional disk space will be used.
Get:1 http://deb.debian.org/debian buster/main amd64 manpages all 4.16-2 [1295 kB]
Get:2 http://security.debian.org/debian-security buster/updates/main amd64 linux-libc-dev amd64 4.19.67-2+deb10u2 [1234 kB]
Get:3 http://deb.debian.org/debian buster/main amd64 binutils-common amd64 2.31.1-16 [2073 kB]
Get:4 http://deb.debian.org/debian buster/main amd64 libbinutils amd64 2.31.1-16 [478 kB]
Get:5 http://deb.debian.org/debian buster/main amd64 binutils-x86-64-linux-gnu amd64 2.31.1-16 [1823 kB]
Get:6 http://deb.debian.org/debian buster/main amd64 binutils amd64 2.31.1-16 [56.8 kB]
Get:7 http://deb.debian.org/debian buster/main amd64 libisl19 amd64 0.20-2 [587 kB]
Get:8 http://deb.debian.org/debian buster/main amd64 libmpfr6 amd64 4.0.2-1 [775 kB]
Get:9 http://deb.debian.org/debian buster/main amd64 libmpc3 amd64 1.1.0-1 [41.3 kB]
Get:10 http://deb.debian.org/debian buster/main amd64 cpp-8 amd64 8.3.0-6 [8914 kB]
Get:11 http://deb.debian.org/debian buster/main amd64 cpp amd64 4:8.3.0-1 [19.4 kB]
Get:12 http://deb.debian.org/debian buster/main amd64 libcc1-0 amd64 8.3.0-6 [46.6 kB]
Get:13 http://deb.debian.org/debian buster/main amd64 libgomp1 amd64 8.3.0-6 [75.8 kB]
Get:14 http://deb.debian.org/debian buster/main amd64 libitm1 amd64 8.3.0-6 [27.7 kB]
Get:15 http://deb.debian.org/debian buster/main amd64 libatomic1 amd64 8.3.0-6 [9032 B]
Get:16 http://deb.debian.org/debian buster/main amd64 libasan5 amd64 8.3.0-6 [362 kB]
Get:17 http://deb.debian.org/debian buster/main amd64 liblsan0 amd64 8.3.0-6 [131 kB]
Get:18 http://deb.debian.org/debian buster/main amd64 libtsan0 amd64 8.3.0-6 [283 kB]
Get:19 http://deb.debian.org/debian buster/main amd64 libubsan1 amd64 8.3.0-6 [120 kB]
Get:20 http://deb.debian.org/debian buster/main amd64 libmpx2 amd64 8.3.0-6 [11.4 kB]
Get:21 http://deb.debian.org/debian buster/main amd64 libquadmath0 amd64 8.3.0-6 [133 kB]
Get:22 http://deb.debian.org/debian buster/main amd64 libgcc-8-dev amd64 8.3.0-6 [2298 kB]
Get:23 http://deb.debian.org/debian buster/main amd64 gcc-8 amd64 8.3.0-6 [9452 kB]
Get:24 http://deb.debian.org/debian buster/main amd64 gcc amd64 4:8.3.0-1 [5196 B]
Get:25 http://deb.debian.org/debian buster/main amd64 libc-dev-bin amd64 2.28-10 [275 kB]
Get:26 http://deb.debian.org/debian buster/main amd64 libc6-dev amd64 2.28-10 [2691 kB]
Get:27 http://deb.debian.org/debian buster/main amd64 manpages-dev all 4.16-2 [2232 kB]
debconf: delaying package configuration, since apt-utils is not installed
Fetched 35.5 MB in 1min 1s (583 kB/s)
Selecting previously unselected package manpages.
(Reading database ... 12557 files and directories currently installed.)
Preparing to unpack .../00-manpages_4.16-2_all.deb ...
Unpacking manpages (4.16-2) ...
Selecting previously unselected package binutils-common:amd64.
Preparing to unpack .../01-binutils-common_2.31.1-16_amd64.deb ...
Unpacking binutils-common:amd64 (2.31.1-16) ...
Selecting previously unselected package libbinutils:amd64.
Preparing to unpack .../02-libbinutils_2.31.1-16_amd64.deb ...
Unpacking libbinutils:amd64 (2.31.1-16) ...
Selecting previously unselected package binutils-x86-64-linux-gnu.
Preparing to unpack .../03-binutils-x86-64-linux-gnu_2.31.1-16_amd64.deb ...
Unpacking binutils-x86-64-linux-gnu (2.31.1-16) ...
Selecting previously unselected package binutils.
Preparing to unpack .../04-binutils_2.31.1-16_amd64.deb ...
Unpacking binutils (2.31.1-16) ...
Selecting previously unselected package libisl19:amd64.
Preparing to unpack .../05-libisl19_0.20-2_amd64.deb ...
Unpacking libisl19:amd64 (0.20-2) ...
Selecting previously unselected package libmpfr6:amd64.
Preparing to unpack .../06-libmpfr6_4.0.2-1_amd64.deb ...
Unpacking libmpfr6:amd64 (4.0.2-1) ...
Selecting previously unselected package libmpc3:amd64.
Preparing to unpack .../07-libmpc3_1.1.0-1_amd64.deb ...
Unpacking libmpc3:amd64 (1.1.0-1) ...
Selecting previously unselected package cpp-8.
Preparing to unpack .../08-cpp-8_8.3.0-6_amd64.deb ...
Unpacking cpp-8 (8.3.0-6) ...
Selecting previously unselected package cpp.
Preparing to unpack .../09-cpp_4%3a8.3.0-1_amd64.deb ...
Unpacking cpp (4:8.3.0-1) ...
Selecting previously unselected package libcc1-0:amd64.
Preparing to unpack .../10-libcc1-0_8.3.0-6_amd64.deb ...
Unpacking libcc1-0:amd64 (8.3.0-6) ...
Selecting previously unselected package libgomp1:amd64.
Preparing to unpack .../11-libgomp1_8.3.0-6_amd64.deb ...
Unpacking libgomp1:amd64 (8.3.0-6) ...
Selecting previously unselected package libitm1:amd64.
Preparing to unpack .../12-libitm1_8.3.0-6_amd64.deb ...
Unpacking libitm1:amd64 (8.3.0-6) ...
Selecting previously unselected package libatomic1:amd64.
Preparing to unpack .../13-libatomic1_8.3.0-6_amd64.deb ...
Unpacking libatomic1:amd64 (8.3.0-6) ...
Selecting previously unselected package libasan5:amd64.
Preparing to unpack .../14-libasan5_8.3.0-6_amd64.deb ...
Unpacking libasan5:amd64 (8.3.0-6) ...
Selecting previously unselected package liblsan0:amd64.
Preparing to unpack .../15-liblsan0_8.3.0-6_amd64.deb ...
Unpacking liblsan0:amd64 (8.3.0-6) ...
Selecting previously unselected package libtsan0:amd64.
Preparing to unpack .../16-libtsan0_8.3.0-6_amd64.deb ...
Unpacking libtsan0:amd64 (8.3.0-6) ...
Selecting previously unselected package libubsan1:amd64.
Preparing to unpack .../17-libubsan1_8.3.0-6_amd64.deb ...
Unpacking libubsan1:amd64 (8.3.0-6) ...
Selecting previously unselected package libmpx2:amd64.
Preparing to unpack .../18-libmpx2_8.3.0-6_amd64.deb ...
Unpacking libmpx2:amd64 (8.3.0-6) ...
Selecting previously unselected package libquadmath0:amd64.
Preparing to unpack .../19-libquadmath0_8.3.0-6_amd64.deb ...
Unpacking libquadmath0:amd64 (8.3.0-6) ...
Selecting previously unselected package libgcc-8-dev:amd64.
Preparing to unpack .../20-libgcc-8-dev_8.3.0-6_amd64.deb ...
Unpacking libgcc-8-dev:amd64 (8.3.0-6) ...
Selecting previously unselected package gcc-8.
Preparing to unpack .../21-gcc-8_8.3.0-6_amd64.deb ...
Unpacking gcc-8 (8.3.0-6) ...
Selecting previously unselected package gcc.
Preparing to unpack .../22-gcc_4%3a8.3.0-1_amd64.deb ...
Unpacking gcc (4:8.3.0-1) ...
Selecting previously unselected package libc-dev-bin.
Preparing to unpack .../23-libc-dev-bin_2.28-10_amd64.deb ...
Unpacking libc-dev-bin (2.28-10) ...
Selecting previously unselected package linux-libc-dev:amd64.
Preparing to unpack .../24-linux-libc-dev_4.19.67-2+deb10u2_amd64.deb ...
Unpacking linux-libc-dev:amd64 (4.19.67-2+deb10u2) ...
Selecting previously unselected package libc6-dev:amd64.
Preparing to unpack .../25-libc6-dev_2.28-10_amd64.deb ...
Unpacking libc6-dev:amd64 (2.28-10) ...
Selecting previously unselected package manpages-dev.
Preparing to unpack .../26-manpages-dev_4.16-2_all.deb ...
Unpacking manpages-dev (4.16-2) ...
Setting up manpages (4.16-2) ...
Setting up binutils-common:amd64 (2.31.1-16) ...
Setting up linux-libc-dev:amd64 (4.19.67-2+deb10u2) ...
Setting up libgomp1:amd64 (8.3.0-6) ...
Setting up libasan5:amd64 (8.3.0-6) ...
Setting up libmpfr6:amd64 (4.0.2-1) ...
Setting up libquadmath0:amd64 (8.3.0-6) ...
Setting up libmpc3:amd64 (1.1.0-1) ...
Setting up libatomic1:amd64 (8.3.0-6) ...
Setting up libmpx2:amd64 (8.3.0-6) ...
Setting up libubsan1:amd64 (8.3.0-6) ...
Setting up libisl19:amd64 (0.20-2) ...
Setting up libbinutils:amd64 (2.31.1-16) ...
Setting up cpp-8 (8.3.0-6) ...
Setting up libc-dev-bin (2.28-10) ...
Setting up libcc1-0:amd64 (8.3.0-6) ...
Setting up liblsan0:amd64 (8.3.0-6) ...
Setting up libitm1:amd64 (8.3.0-6) ...
Setting up binutils-x86-64-linux-gnu (2.31.1-16) ...
Setting up libtsan0:amd64 (8.3.0-6) ...
Setting up manpages-dev (4.16-2) ...
Setting up binutils (2.31.1-16) ...
Setting up libgcc-8-dev:amd64 (8.3.0-6) ...
Setting up cpp (4:8.3.0-1) ...
Setting up libc6-dev:amd64 (2.28-10) ...
Setting up gcc-8 (8.3.0-6) ...
Setting up gcc (4:8.3.0-1) ...
Processing triggers for libc-bin (2.28-10) ...
+ '[' -e /opt/app/environment.yml ']'
+ echo 'no environment.yml'
+ '[' '' ']'
+ '[' 'fastparquet git+https://github.com/dask/distributed' ']'
+ echo 'EXTRA_PIP_PACKAGES environment variable found. Installing.'
no environment.yml
EXTRA_PIP_PACKAGES environment variable found. Installing.
+ /opt/conda/bin/pip install fastparquet git+https://github.com/dask/distributed
Collecting git+https://github.com/dask/distributed
Cloning https://github.com/dask/distributed to /tmp/pip-req-build-yszrzcnf
Running command git clone -q https://github.com/dask/distributed /tmp/pip-req-build-yszrzcnf
Collecting fastparquet
Downloading https://files.pythonhosted.org/packages/58/49/dccb790fa17ab3fbf84a6b848050083c7a1899e9586000e34e3e4fbf5538/fastparquet-0.3.2.tar.gz (151kB)
Requirement already satisfied: click>=6.6 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.1+3.ga285267d) (7.0)
Requirement already satisfied: cloudpickle>=0.2.2 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.1+3.ga285267d) (1.2.2)
Requirement already satisfied: dask>=2.7.0 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.1+3.ga285267d) (2.8.0)
Requirement already satisfied: msgpack in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.1+3.ga285267d) (0.6.2)
Requirement already satisfied: psutil>=5.0 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.1+3.ga285267d) (5.6.5)
Requirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.1+3.ga285267d) (2.1.0)
Requirement already satisfied: tblib in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.1+3.ga285267d) (1.4.0)
Requirement already satisfied: toolz>=0.7.4 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.1+3.ga285267d) (0.10.0)
Requirement already satisfied: tornado>=5 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.1+3.ga285267d) (6.0.3)
Requirement already satisfied: zict>=0.1.3 in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.1+3.ga285267d) (1.0.0)
Requirement already satisfied: pyyaml in /opt/conda/lib/python3.7/site-packages (from distributed==2.8.1+3.ga285267d) (5.1.2)
Requirement already satisfied: pandas>=0.19 in /opt/conda/lib/python3.7/site-packages (from fastparquet) (0.25.2)
Collecting numba>=0.28 (from fastparquet)
Downloading https://files.pythonhosted.org/packages/57/66/7ebc88e87b4ddf9b1c204e24d467cb0d13a7a890ecb4f83d20949f768929/numba-0.46.0-cp37-cp37m-manylinux1_x86_64.whl (3.6MB)
Requirement already satisfied: numpy>=1.11 in /opt/conda/lib/python3.7/site-packages (from fastparquet) (1.17.3)
Collecting thrift>=0.11.0 (from fastparquet)
Downloading https://files.pythonhosted.org/packages/97/1e/3284d19d7be99305eda145b8aa46b0c33244e4a496ec66440dac19f8274d/thrift-0.13.0.tar.gz (59kB)
Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from fastparquet) (1.12.0)
Requirement already satisfied: heapdict in /opt/conda/lib/python3.7/site-packages (from zict>=0.1.3->distributed==2.8.1+3.ga285267d) (1.0.1)
Requirement already satisfied: python-dateutil>=2.6.1 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.19->fastparquet) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /opt/conda/lib/python3.7/site-packages (from pandas>=0.19->fastparquet) (2019.3)
Collecting llvmlite>=0.30.0dev0 (from numba>=0.28->fastparquet)
Downloading https://files.pythonhosted.org/packages/1f/3e/642ffb29ed35ca5e93f171ba327452bdee81ec76f2d711ef0f15b411928a/llvmlite-0.30.0-cp37-cp37m-manylinux1_x86_64.whl (20.2MB)
Building wheels for collected packages: fastparquet, distributed, thrift
Building wheel for fastparquet (setup.py): started
Building wheel for fastparquet (setup.py): finished with status 'done'
Created wheel for fastparquet: filename=fastparquet-0.3.2-cp37-cp37m-linux_x86_64.whl size=276808 sha256=cca6b01eacdc3d2180bb70e6ebf8a5f5b31f4b15771b04919bbd3353564f9c6a
Stored in directory: /root/.cache/pip/wheels/b9/36/13/01416a760ddcab0eb8281ec9c9ffcbed945c9b831647c8b904
Building wheel for distributed (setup.py): started
Building wheel for distributed (setup.py): finished with status 'done'
Created wheel for distributed: filename=distributed-2.8.1+3.ga285267d-cp37-none-any.whl size=569076 sha256=b359838b03314bbb4ef849e748d8b76bfa067da042cccb242347271db4c4c050
Stored in directory: /tmp/pip-ephem-wheel-cache-ektmufm7/wheels/aa/21/a7/d9548d684f8e074360b7ad1bd8633843dba9658288b68b3dd5
Building wheel for thrift (setup.py): started
Building wheel for thrift (setup.py): finished with status 'done'
Created wheel for thrift: filename=thrift-0.13.0-cp37-none-any.whl size=154884 sha256=e8a35252f5581d04a5b334cc37950ebd76c68de61f30a65b0738d392a373e27d
Stored in directory: /root/.cache/pip/wheels/02/a2/46/689ccfcf40155c23edc7cdbd9de488611c8fdf49ff34b1706e
Successfully built fastparquet distributed thrift
Installing collected packages: llvmlite, numba, thrift, fastparquet, distributed
Found existing installation: distributed 2.8.0
Uninstalling distributed-2.8.0:
Successfully uninstalled distributed-2.8.0
Successfully installed distributed-2.8.1+3.ga285267d fastparquet-0.3.2 llvmlite-0.30.0 numba-0.46.0 thrift-0.13.0
+ exec dask-worker --nthreads 2 --no-bokeh --memory-limit 6GB --death-timeout 60
/opt/conda/lib/python3.7/site-packages/distributed/cli/dask_worker.py:252: UserWarning: The --bokeh/--no-bokeh flag has been renamed to --dashboard/--no-dashboard.
"The --bokeh/--no-bokeh flag has been renamed to --dashboard/--no-dashboard. "
distributed.nanny - INFO - Start Nanny at: 'tcp://10.44.0.1:32869'
distributed.worker - INFO - Start worker at: tcp://10.44.0.1:41721
distributed.worker - INFO - Listening to: tcp://10.44.0.1:41721
distributed.worker - INFO - Waiting to connect to: tcp://172.16.0.76:33017
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 2
distributed.worker - INFO - Memory: 6.00 GB
distributed.worker - INFO - Local Directory: /worker-77vchaja
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to: tcp://172.16.0.76:33017
distributed.worker - INFO - Waiting to connect to: tcp://172.16.0.76:33017
distributed.worker - INFO - Waiting to connect to: tcp://172.16.0.76:33017
distributed.worker - INFO - Waiting to connect to: tcp://172.16.0.76:33017
distributed.worker - INFO - Waiting to connect to: tcp://172.16.0.76:33017
distributed.nanny - INFO - Closing Nanny at 'tcp://10.44.0.1:32869'
distributed.worker - INFO - Stopping worker at tcp://10.44.0.1:41721
distributed.worker - INFO - Closed worker has not yet started: None
distributed.dask_worker - INFO - Timed out starting worker
distributed.dask_worker - INFO - End worker
Looks like the worker is now behaving correctly but failing to connect to the scheduler. Could you share the scheduler logs?
Scheduler logs below:
(base) [root@k8s-master example]#
(base) [root@k8s-master example]# kubectl -n kube-system logs kube-scheduler-k8s-master
I1126 15:34:16.048901 1 serving.go:319] Generated self-signed cert in-memory
W1126 15:34:18.709418 1 authentication.go:262] Unable to get configmap/extension-apiserver-authentication in kube-system. Usually fixed by 'kubectl create rolebinding -n kube-system ROLEBINDING_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
W1126 15:34:18.709438 1 authentication.go:199] Error looking up in-cluster authentication configuration: configmaps "extension-apiserver-authentication" is forbidden: User "system:kube-scheduler" cannot get resource "configmaps" in API group "" in the namespace "kube-system"
W1126 15:34:18.709447 1 authentication.go:200] Continuing without authentication configuration. This may treat all requests as anonymous.
W1126 15:34:18.709453 1 authentication.go:201] To require authentication configuration lookup to succeed, set --authentication-tolerate-lookup-failure=false
I1126 15:34:18.714711 1 server.go:148] Version: v1.16.3
I1126 15:34:18.714796 1 defaults.go:91] TaintNodesByCondition is enabled, PodToleratesNodeTaints predicate is mandatory
W1126 15:34:18.724908 1 authorization.go:47] Authorization is disabled
W1126 15:34:18.724921 1 authentication.go:79] Authentication is disabled
I1126 15:34:18.724930 1 deprecated_insecure_serving.go:51] Serving healthz insecurely on [::]:10251
I1126 15:34:18.725582 1 secure_serving.go:123] Serving securely on 127.0.0.1:10259
E1126 15:34:18.726754 1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.CSINode: csinodes.storage.k8s.io is forbidden: User "system:kube-scheduler" cannot list resource "csinodes" in API group "storage.k8s.io" at the cluster scope
E1126 15:34:18.727678 1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.StatefulSet: statefulsets.apps is forbidden: User "system:kube-scheduler" cannot list resource "statefulsets" in API group "apps" at the cluster scope
E1126 15:34:18.727685 1 reflector.go:123] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:250: Failed to list *v1.Pod: pods is forbidden: User "system:kube-scheduler" cannot list resource "pods" in API group "" at the cluster scope
E1126 15:34:18.727682 1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.ReplicaSet: replicasets.apps is forbidden: User "system:kube-scheduler" cannot list resource "replicasets" in API group "apps" at the cluster scope
E1126 15:34:18.727695 1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.PersistentVolume: persistentvolumes is forbidden: User "system:kube-scheduler" cannot list resource "persistentvolumes" in API group "" at the cluster scope
E1126 15:34:18.727743 1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.Service: services is forbidden: User "system:kube-scheduler" cannot list resource "services" in API group "" at the cluster scope
E1126 15:34:18.727819 1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.PersistentVolumeClaim: persistentvolumeclaims is forbidden: User "system:kube-scheduler" cannot list resource "persistentvolumeclaims" in API group "" at the cluster scope
E1126 15:34:18.727828 1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.Node: nodes is forbidden: User "system:kube-scheduler" cannot list resource "nodes" in API group "" at the cluster scope
E1126 15:34:18.727875 1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.StorageClass: storageclasses.storage.k8s.io is forbidden: User "system:kube-scheduler" cannot list resource "storageclasses" in API group "storage.k8s.io" at the cluster scope
E1126 15:34:18.727907 1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:kube-scheduler" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E1126 15:34:18.728054 1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.ReplicationController: replicationcontrollers is forbidden: User "system:kube-scheduler" cannot list resource "replicationcontrollers" in API group "" at the cluster scope
E1126 15:34:19.729111 1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.CSINode: csinodes.storage.k8s.io is forbidden: User "system:kube-scheduler" cannot list resource "csinodes" in API group "storage.k8s.io" at the cluster scope
E1126 15:34:19.729119 1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.StatefulSet: statefulsets.apps is forbidden: User "system:kube-scheduler" cannot list resource "statefulsets" in API group "apps" at the cluster scope
E1126 15:34:19.729697 1 reflector.go:123] k8s.io/kubernetes/cmd/kube-scheduler/app/server.go:250: Failed to list *v1.Pod: pods is forbidden: User "system:kube-scheduler" cannot list resource "pods" in API group "" at the cluster scope
E1126 15:34:19.730823 1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.ReplicaSet: replicasets.apps is forbidden: User "system:kube-scheduler" cannot list resource "replicasets" in API group "apps" at the cluster scope
E1126 15:34:19.731811 1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.PersistentVolume: persistentvolumes is forbidden: User "system:kube-scheduler" cannot list resource "persistentvolumes" in API group "" at the cluster scope
E1126 15:34:19.732952 1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.Service: services is forbidden: User "system:kube-scheduler" cannot list resource "services" in API group "" at the cluster scope
E1126 15:34:19.733921 1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.PersistentVolumeClaim: persistentvolumeclaims is forbidden: User "system:kube-scheduler" cannot list resource "persistentvolumeclaims" in API group "" at the cluster scope
E1126 15:34:19.735081 1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.Node: nodes is forbidden: User "system:kube-scheduler" cannot list resource "nodes" in API group "" at the cluster scope
E1126 15:34:19.736108 1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.StorageClass: storageclasses.storage.k8s.io is forbidden: User "system:kube-scheduler" cannot list resource "storageclasses" in API group "storage.k8s.io" at the cluster scope
E1126 15:34:19.737238 1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1beta1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:kube-scheduler" cannot list resource "poddisruptionbudgets" in API group "policy" at the cluster scope
E1126 15:34:19.738284 1 reflector.go:123] k8s.io/client-go/informers/factory.go:134: Failed to list *v1.ReplicationController: replicationcontrollers is forbidden: User "system:kube-scheduler" cannot list resource "replicationcontrollers" in API group "" at the cluster scope
I1126 15:34:20.825768 1 leaderelection.go:241] attempting to acquire leader lease kube-system/kube-scheduler...
I1126 15:34:20.832408 1 leaderelection.go:251] successfully acquired lease kube-system/kube-scheduler
E1126 15:34:28.839414 1 factory.go:585] pod is already present in the activeQ
(base) [root@k8s-master example]#
That doesn't look like the Dask scheduler logs.
As these issues seem to be related to your setup rather than dask-kubernetes itself I would recommend that you take a look at the Dask scheduler logs to see if you are able to identify the issue yourself.
Thanks Jacob for reply. For setup, I followed https://kubernetes.dask.org/en/latest/ - which states requirements from pip install dask-kubernetes for native installation (without helm):
Steps I followed:
Output of code shows dask scheduler logs:
(base) [root@k8s-master example]# cat nohup.out
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://172.16.0.76:40641
distributed.scheduler - INFO - Receive client connection: Client-932e205e-1062-11ea-a09d-12bd5ffa93ff
distributed.core - INFO - Starting established connection
(base) [root@k8s-master example]#
Workerpod logs:
(base) [root@k8s-master example]# kubectl logs pod/workerpod
+ '[' gcc ']'
...
Successfully installed distributed-2.8.1+7.g856bba7c fastparquet-0.3.2 llvmlite-0.30.0 numba-0.46.0 thrift-0.13.0
+ exec dask-worker --nthreads 2 --no-bokeh --memory-limit 6GB --death-timeout 60
/opt/conda/lib/python3.7/site-packages/distributed/cli/dask_worker.py:252: UserWarning: The --bokeh/--no-bokeh flag has been renamed to --dashboard/--no-dashboard.
"The --bokeh/--no-bokeh flag has been renamed to --dashboard/--no-dashboard. "
distributed.nanny - INFO - Start Nanny at: 'tcp://10.32.0.2:34363'
distributed.worker - INFO - Start worker at: tcp://10.32.0.2:38961
distributed.worker - INFO - Listening to: tcp://10.32.0.2:38961
distributed.worker - INFO - Waiting to connect to: tcp://172.16.0.76:34895
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 2
distributed.worker - INFO - Memory: 6.00 GB
distributed.worker - INFO - Local Directory: /worker-2xaci4as
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to: tcp://172.16.0.76:34895
distributed.worker - INFO - Waiting to connect to: tcp://172.16.0.76:34895
distributed.worker - INFO - Waiting to connect to: tcp://172.16.0.76:34895
distributed.worker - INFO - Waiting to connect to: tcp://172.16.0.76:34895
distributed.worker - INFO - Waiting to connect to: tcp://172.16.0.76:34895
distributed.nanny - INFO - Closing Nanny at 'tcp://10.32.0.2:34363'
distributed.worker - INFO - Stopping worker at tcp://10.32.0.2:38961
distributed.worker - INFO - Closed worker has not yet started: None
distributed.dask_worker - INFO - Timed out starting worker
distributed.dask_worker - INFO - End worker
(base) [root@k8s-master example]# cat nohup.out
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://172.16.0.76:34895
distributed.scheduler - INFO - Receive client connection: Client-882faa9e-108f-11ea-a662-12bd5ffa93ff
distributed.core - INFO - Starting established connection
I do not see dask scheduler pod created on my system.
(base) [root@k8s-master example]# kubectl get nodes,service,pods --all-namespaces
NAME STATUS ROLES AGE VERSION
node/k8s-master Ready master 146m v1.16.3
node/worker-node1 Ready <none> 145m v1.16.2
node/worker-node2 Ready <none> 145m v1.16.2
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 146m
kube-system service/kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 146m
NAMESPACE NAME READY STATUS RESTARTS AGE
default pod/workerpod 0/1 Error 0 143m
kube-system pod/coredns-5644d7b6d9-ht9dq 1/1 Running 0 146m
kube-system pod/coredns-5644d7b6d9-vt6c9 1/1 Running 0 146m
kube-system pod/etcd-k8s-master 1/1 Running 0 145m
kube-system pod/kube-apiserver-k8s-master 1/1 Running 0 145m
kube-system pod/kube-controller-manager-k8s-master 1/1 Running 0 145m
kube-system pod/kube-proxy-htvlr 1/1 Running 0 145m
kube-system pod/kube-proxy-mswm2 1/1 Running 0 146m
kube-system pod/kube-proxy-vls4w 1/1 Running 0 145m
kube-system pod/kube-scheduler-k8s-master 1/1 Running 0 145m
kube-system pod/weave-net-kgrqz 2/2 Running 0 144m
kube-system pod/weave-net-lfndv 2/2 Running 0 144m
kube-system pod/weave-net-vgpxs 2/2 Running 0 144m
(base) [root@k8s-master example]#
According to my understanding, dask-kubernetes is starting a distributed scheduler on master-node and not on kubernetes cluster as scheduler-pod so workerpod in unable to connect to dask-scheduler pod. Please correct me if that's not the case..
By default dask-kubernetes
starts a scheduler within your Python session. So the workers must be able to send traffic to wherever you are running Python.
You can specify deploy_mode='remote'
to have dask-kubernetes
launch the scheduler within a pod. But your local Python session will still need to be able to connect to the service that it creates (a LoadBalancer by default).
Thanks for reply Jacob. I tried with deploy_mode='remote', I can see scheduler created as dask-root service and worker pod is showing status as completed but it is not showing any result and output is showing errors as below:
(base) [root@k8s-master example]# kubectl get nodes,service,pods --all-namespaces -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
node/k8s-master Ready master 18m v1.16.3 172.16.0.76 <none> Amazon Linux 2 4.14.154-128.181.amzn2.x86_64 docker://18.9.9
node/worker-node1 Ready worker 16m v1.16.2 172.16.0.31 <none> Amazon Linux 2 4.14.146-120.181.amzn2.x86_64 docker://18.9.9
node/worker-node2 Ready worker 16m v1.16.2 172.16.0.114 <none> Amazon Linux 2 4.14.146-120.181.amzn2.x86_64 docker://18.9.9
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
default service/dask-root-c69fb20b-d ClusterIP 10.104.149.32 <none> 8786/TCP,8787/TCP 11m dask.org/cluster-name=dask-root-c69fb20b-d,dask.org/component=scheduler
default service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 18m <none>
kube-system service/kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 18m k8s-app=kube-dns
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
default pod/workerpod 0/1 Completed 0 14m 10.44.0.1 worker-node1 <none> <none>
kube-system pod/coredns-5644d7b6d9-l5xgh 1/1 Running 0 18m 10.32.0.2 k8s-master <none> <none>
kube-system pod/coredns-5644d7b6d9-wr5cz 1/1 Running 0 18m 10.32.0.3 k8s-master <none> <none>
kube-system pod/etcd-k8s-master 1/1 Running 0 17m 172.16.0.76 k8s-master <none> <none>
kube-system pod/kube-apiserver-k8s-master 1/1 Running 0 17m 172.16.0.76 k8s-master <none> <none>
kube-system pod/kube-controller-manager-k8s-master 1/1 Running 0 17m 172.16.0.76 k8s-master <none> <none>
kube-system pod/kube-proxy-p5khx 1/1 Running 0 16m 172.16.0.114 worker-node2 <none> <none>
kube-system pod/kube-proxy-ss464 1/1 Running 0 18m 172.16.0.76 k8s-master <none> <none>
kube-system pod/kube-proxy-w8st5 1/1 Running 0 16m 172.16.0.31 worker-node1 <none> <none>
kube-system pod/kube-scheduler-k8s-master 1/1 Running 0 17m 172.16.0.76 k8s-master <none> <none>
kube-system pod/weave-net-g4xsq 2/2 Running 0 17m 172.16.0.76 k8s-master <none> <none>
kube-system pod/weave-net-hd54z 2/2 Running 1 16m 172.16.0.114 worker-node2 <none> <none>
kube-system pod/weave-net-pjw8x 2/2 Running 1 16m 172.16.0.31 worker-node1 <none> <none>
Output of dask array code:
(base) [root@k8s-master example]# cat nohup.out
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 222, in connect
_raise(error)
File "/root/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 205, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://dask-root-c69fb20b-d.default:8786' after 10 s: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "dask_example.py", line 5, in <module>
cluster = KubeCluster.from_yaml('worker-spec_2.yml', deploy_mode="remote")
File "/root/miniconda3/lib/python3.7/site-packages/dask_kubernetes/core.py", line 566, in from_yaml
return cls.from_dict(d, **kwargs)
File "/root/miniconda3/lib/python3.7/site-packages/dask_kubernetes/core.py", line 528, in from_dict
return cls(make_pod_from_dict(pod_spec), **kwargs)
File "/root/miniconda3/lib/python3.7/site-packages/dask_kubernetes/core.py", line 380, in __init__
super().__init__(**self.kwargs)
File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/spec.py", line 242, in __init__
self.sync(self._start)
File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/cluster.py", line 162, in sync
return sync(self.loop, func, *args, **kwargs)
File "/root/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 334, in sync
raise exc.with_traceback(tb)
File "/root/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 318, in f
result[0] = yield future
File "/root/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
File "/root/miniconda3/lib/python3.7/site-packages/dask_kubernetes/core.py", line 500, in _start
await super()._start()
File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/spec.py", line 273, in _start
await super()._start()
File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/cluster.py", line 59, in _start
comm = await self.scheduler_comm.live_comm()
File "/root/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 637, in live_comm
connection_args=self.connection_args,
File "/root/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 231, in connect
_raise(error)
File "/root/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 205, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://dask-root-c69fb20b-d.default:8786' after 10 s: Timed out trying to connect to 'tcp://dask-root-c69fb20b-d.default:8786' after 10 s: [Errno -2] Name or service not known
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 222, in connect
_raise(error)
File "/root/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 205, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://dask-root-c69fb20b-d.default:8786' after 10 s: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.7/contextlib.py", line 130, in __exit__
self.gen.throw(type, value, traceback)
File "/root/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 186, in ignoring
yield
File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/spec.py", line 574, in close_clusters
cluster.close(timeout=10)
File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/cluster.py", line 83, in close
return self.sync(self._close, callback_timeout=timeout)
File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/cluster.py", line 162, in sync
return sync(self.loop, func, *args, **kwargs)
File "/root/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 334, in sync
raise exc.with_traceback(tb)
File "/root/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 318, in f
result[0] = yield future
File "/root/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
File "/root/miniconda3/lib/python3.7/site-packages/distributed/deploy/spec.py", line 368, in _close
await self.scheduler_comm.close(close_workers=True)
File "/root/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 675, in send_recv_from_rpc
comm = await self.live_comm()
File "/root/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 637, in live_comm
connection_args=self.connection_args,
File "/root/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 231, in connect
_raise(error)
File "/root/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 205, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://dask-root-c69fb20b-d.default:8786' after 10 s: Timed out trying to connect to 'tcp://dask-root-c69fb20b-d.default:8786' after 10 s: [Errno -2] Name or service not known
2019-12-02 16:25:35,780 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa51b34e950>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/v1/namespaces/default/pods?labelSelector=foo%3Dbar%2Cdask.org%2Fcluster-name%3Ddask-root-c69fb20b-d%2Cuser%3Droot%2Capp%3Ddask
WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa51b34e950>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/v1/namespaces/default/pods?labelSelector=foo%3Dbar%2Cdask.org%2Fcluster-name%3Ddask-root-c69fb20b-d%2Cuser%3Droot%2Capp%3Ddask
2019-12-02 16:25:35,781 WARNING Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa52a9cc990>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/v1/namespaces/default/pods?labelSelector=foo%3Dbar%2Cdask.org%2Fcluster-name%3Ddask-root-c69fb20b-d%2Cuser%3Droot%2Capp%3Ddask
WARNING:urllib3.connectionpool:Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa52a9cc990>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/v1/namespaces/default/pods?labelSelector=foo%3Dbar%2Cdask.org%2Fcluster-name%3Ddask-root-c69fb20b-d%2Cuser%3Droot%2Capp%3Ddask
2019-12-02 16:25:35,781 WARNING Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa52a9cc3d0>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/v1/namespaces/default/pods?labelSelector=foo%3Dbar%2Cdask.org%2Fcluster-name%3Ddask-root-c69fb20b-d%2Cuser%3Droot%2Capp%3Ddask
WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa52a9cc3d0>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/v1/namespaces/default/pods?labelSelector=foo%3Dbar%2Cdask.org%2Fcluster-name%3Ddask-root-c69fb20b-d%2Cuser%3Droot%2Capp%3Ddask
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.7/site-packages/urllib3/connection.py", line 159, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw)
File "/root/miniconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 80, in create_connection
raise err
File "/root/miniconda3/lib/python3.7/site-packages/urllib3/util/connection.py", line 70, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/root/miniconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 343, in _make_request
self._validate_conn(conn)
File "/root/miniconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 839, in _validate_conn
conn.connect()
File "/root/miniconda3/lib/python3.7/site-packages/urllib3/connection.py", line 301, in connect
conn = self._new_conn()
File "/root/miniconda3/lib/python3.7/site-packages/urllib3/connection.py", line 168, in _new_conn
self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7fa52a9ccfd0>: Failed to establish a new connection: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.7/weakref.py", line 648, in _exitfunc
f()
File "/root/miniconda3/lib/python3.7/weakref.py", line 572, in __call__
return info.func(*info.args, **(info.kwargs or {}))
File "/root/miniconda3/lib/python3.7/site-packages/dask_kubernetes/core.py", line 623, in _cleanup_resources
pods = core_api.list_namespaced_pod(namespace, label_selector=format_labels(labels))
File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 12372, in list_namespaced_pod
(data) = self.list_namespaced_pod_with_http_info(namespace, **kwargs)
File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 12472, in list_namespaced_pod_with_http_info
collection_formats=collection_formats)
File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 334, in call_api
_return_http_data_only, collection_formats, _preload_content, _request_timeout)
File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 168, in __call_api
_request_timeout=_request_timeout)
File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 355, in request
headers=headers)
File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 231, in GET
query_params=query_params)
File "/root/miniconda3/lib/python3.7/site-packages/kubernetes/client/rest.py", line 205, in request
headers=headers)
File "/root/miniconda3/lib/python3.7/site-packages/urllib3/request.py", line 68, in request
**urlopen_kw)
File "/root/miniconda3/lib/python3.7/site-packages/urllib3/request.py", line 89, in request_encode_url
return self.urlopen(method, url, **extra_kw)
File "/root/miniconda3/lib/python3.7/site-packages/urllib3/poolmanager.py", line 324, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/root/miniconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 667, in urlopen
**response_kw)
File "/root/miniconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 667, in urlopen
**response_kw)
File "/root/miniconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 667, in urlopen
**response_kw)
File "/root/miniconda3/lib/python3.7/site-packages/urllib3/connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "/root/miniconda3/lib/python3.7/site-packages/urllib3/util/retry.py", line 399, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='localhost', port=443): Max retries exceeded with url: /api/v1/namespaces/default/pods?labelSelector=foo%3Dbar%2Cdask.org%2Fcluster-name%3Ddask-root-c69fb20b-d%2Cuser%3Droot%2Capp%3Ddask (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa52a9ccfd0>: Failed to establish a new connection: [Errno 111] Connection refused'))
ERROR:asyncio:Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7fa53f3a4d10>
ERROR:asyncio:Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7fa52b28ab90>
ERROR:asyncio:Unclosed connector
connections: ['[(<aiohttp.client_proto.ResponseHandler object at 0x7fa52aa1b670>, 1140.130852326)]']
connector: <aiohttp.connector.TCPConnector object at 0x7fa52b286dd0>
workerpod logs:
(base) [root@k8s-master example]# kubectl logs pod/workerpod
+ '[' gcc ']'
...
...
Successfully installed distributed-2.8.1+20.gf15abc58 fastparquet-0.3.2 llvmlite-0.30.0 numba-0.46.0 thrift-0.13.0
+ exec dask-scheduler --idle-timeout '5 minutes'
distributed.scheduler - INFO - -----------------------------------------------
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: pip install jupyter-server-proxy
distributed.scheduler - INFO - Local Directory: /tmp/scheduler-wb_uyvs8
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://10.44.0.1:8786
distributed.scheduler - INFO - dashboard at: :8787
distributed.scheduler - INFO - Scheduler closing...
distributed.scheduler - INFO - Scheduler closing all comms
distributed.scheduler - INFO - End scheduler at 'tcp://10.44.0.1:8786'
(base) [root@k8s-master example]#
Do I need to create Cluster role binding for communication to execute this dask array code on native Kube8s and dask-kubernetes installation example?
Given the age of this issue I'm going to close it out as unresolved. Apologies that we never got to the bottom of this.
Hi all there, thanks so much for such a great project. I am trying to run the example provided on https://kubernetes.dask.org/en/latest/ link but while running dask array example its is showing worker pod is in pending state and python code is looping through below error.
Pod status on kube8s
Error Message