NTHU-LSALAB / KubeShare

Share GPU between Pods in Kubernetes
Apache License 2.0
193 stars 42 forks source link

kubeshare-node-daemon pod in gpu node is crashed #1

Closed Iamlovingit closed 4 years ago

Iamlovingit commented 4 years ago

hi, I followed the document to install the kubeshare, but the daemon pod in gpu node can not run. My docker version is

nvidia-docker  version
NVIDIA Docker: 2.2.2
Client: Docker Engine - Community
 Version:           19.03.4
 API version:       1.40
 Go version:        go1.12.10
 Git commit:        9013bf583a
 Built:             Fri Oct 18 15:53:51 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.4
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.10
  Git commit:       9013bf583a
  Built:            Fri Oct 18 15:52:23 2019
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.10
  GitCommit:        b34a5c8af56e510852c35414db4c1f4fa6172339
 nvidia:
  Version:          1.0.0-rc8+dev
  GitCommit:        3e425f80a8c931f88e6d94a8c831b9d5aa481657
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

Is my docker version too high? which version can run?

ncy9371 commented 4 years ago

Hi, could you provide the logs of both kubeshare-device-manager and crashed kubeshare-node-daemon?

kubectl -n kube-system logs kubeshare-device-manager
kubectl -n kube-system logs $(kubectl -n kube-system get pod -o name -l lsalab=kubeshare-node-daemon --field-selector 'status.phase=Failed') -c config-client

I had updated the README for more clarifying that docker and nvidia-docker2 version should < 19, because KubeShare restricts the GPU devices access in containers by adding a environment variable 'NVIDIA_VISIBLE_DEVICES' (https://github.com/NVIDIA/nvidia-container-runtime).

Iamlovingit commented 4 years ago

kubectl -n kube-system logs kubeshare-device-manager

Thanks for your reply. kubeshare-device-manager log

W0305 07:11:22.105403       1 client_config.go:543] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0305 07:11:22.128064       1 controller.go:89] Creating event broadcaster
I0305 07:11:22.128238       1 controller.go:106] Setting up event handlers
I0305 07:11:22.128279       1 controller.go:148] Starting SharePod controller
I0305 07:11:22.128289       1 controller.go:151] Waiting for informer caches to sync
I0305 07:11:22.128392       1 reflector.go:150] Starting reflector *v1.SharePod (30s) from pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105
I0305 07:11:22.128410       1 reflector.go:185] Listing and watching *v1.SharePod from pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105
I0305 07:11:22.128450       1 reflector.go:150] Starting reflector *v1.Pod (30s) from pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105
I0305 07:11:22.128475       1 reflector.go:185] Listing and watching *v1.Pod from pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105
I0305 07:11:22.157507       1 controller.go:417] Processing object: kube-state-metrics-5cb5c6986b-48crt
I0305 07:11:22.157540       1 controller.go:417] Processing object: kube-scheduler-k8s-master
I0305 07:11:22.157548       1 controller.go:417] Processing object: kubeshare-node-daemon-42n68
I0305 07:11:22.157558       1 controller.go:417] Processing object: hdfsmodel-flower-shawn-123-predictor-default-gpsgr-deploymklch4
I0305 07:11:22.157569       1 controller.go:417] Processing object: model-deploy-rnhkt-deployment-59c7594758-clslg
I0305 07:11:22.157576       1 controller.go:417] Processing object: speaker-99k5w
I0305 07:11:22.157582       1 controller.go:417] Processing object: speaker-d5ppj
I0305 07:11:22.157589       1 controller.go:417] Processing object: kubeshare-device-manager
I0305 07:11:22.157595       1 controller.go:417] Processing object: autoscaler-hpa-6746679d8d-vfn9h
I0305 07:11:22.157604       1 controller.go:417] Processing object: controller-6cd646bb5b-kpfsh
I0305 07:11:22.157676       1 controller.go:417] Processing object: eventing-controller-5d98dc9989-pk9zm
I0305 07:11:22.157709       1 controller.go:417] Processing object: elasticsearch-logging-0
I0305 07:11:22.157740       1 controller.go:417] Processing object: calico-node-vlns6
I0305 07:11:22.157762       1 controller.go:417] Processing object: kube-proxy-xdfsr
I0305 07:11:22.157789       1 controller.go:417] Processing object: config-map-hm57m-deployment-6b4f596bc4-kgbt5
I0305 07:11:22.157809       1 controller.go:417] Processing object: node-exporter-znmxp
I0305 07:11:22.157827       1 controller.go:417] Processing object: kube-controller-manager-k8s-master
I0305 07:11:22.157837       1 controller.go:417] Processing object: nfsmodel-flower-shawn-1-predictor-default-kqbr9-deployment4whpx
I0305 07:11:22.157879       1 controller.go:417] Processing object: kibana-logging-b5d75f556-vn2p2
I0305 07:11:22.157889       1 controller.go:417] Processing object: nfs-model-manager-sta5371918badac8f63d802cab9ff00049-deplopsjzq
I0305 07:11:22.157903       1 controller.go:417] Processing object: node-exporter-qgv6p
I0305 07:11:22.157911       1 controller.go:417] Processing object: zipkin-54b7b4cb87-glj7p
I0305 07:11:22.157918       1 controller.go:417] Processing object: kube-proxy-kcdns
I0305 07:11:22.157927       1 controller.go:417] Processing object: kube-proxy-z7jrj
I0305 07:11:22.157933       1 controller.go:417] Processing object: calico-node-hrq9k
I0305 07:11:22.157939       1 controller.go:417] Processing object: kubeshare-node-daemon-rqhkz
I0305 07:11:22.157946       1 controller.go:417] Processing object: istio-pilot-7c9d4ff745-jw4dg
I0305 07:11:22.157958       1 controller.go:417] Processing object: etcd-k8s-master
I0305 07:11:22.157965       1 controller.go:417] Processing object: networking-istio-b54f6b445-sj66x
I0305 07:11:22.157976       1 controller.go:417] Processing object: nvidia-device-plugin-daemonset-fph6c
I0305 07:11:22.157984       1 controller.go:417] Processing object: account-server-f5wnl-deployment-545c7489f6-ngdd9
I0305 07:11:22.157993       1 controller.go:417] Processing object: cluster-local-gateway-67c89b5dd9-2tkzw
I0305 07:11:22.158000       1 controller.go:417] Processing object: node-exporter-427zt
I0305 07:11:22.158013       1 controller.go:417] Processing object: kubeshare-scheduler
I0305 07:11:22.158025       1 controller.go:417] Processing object: nvidia-device-plugin-daemonset-mdbnw
I0305 07:11:22.158037       1 controller.go:417] Processing object: activator-5cb7c7f5b-jnhq2
I0305 07:11:22.158044       1 controller.go:417] Processing object: kubeshare-node-daemon-244bv
I0305 07:11:22.158054       1 controller.go:417] Processing object: imc-dispatcher-67dd977c6d-4vc7b
I0305 07:11:22.158062       1 controller.go:417] Processing object: model-manager-62pc9-deployment-6bfd5fd46b-4zvtx
I0305 07:11:22.158071       1 controller.go:417] Processing object: pytorch-cifar10-batching-predictor-default-dhhzk-deploymenkpflt
I0305 07:11:22.158081       1 controller.go:417] Processing object: istio-ingressgateway-75bd464c5d-42k88
I0305 07:11:22.158086       1 controller.go:417] Processing object: elasticsearch-logging-1
I0305 07:11:22.158094       1 controller.go:417] Processing object: controller-6d4c5459f6-5zvbq
I0305 07:11:22.158099       1 controller.go:417] Processing object: webhook-9474b654b-fscbq
I0305 07:11:22.158106       1 controller.go:417] Processing object: kube-apiserver-k8s-master
I0305 07:11:22.158111       1 controller.go:417] Processing object: imc-controller-676ccb778-8zrhh
I0305 07:11:22.158118       1 controller.go:417] Processing object: prometheus-system-1
I0305 07:11:22.158127       1 controller.go:417] Processing object: autoscaler-df6489797-9djjt
I0305 07:11:22.158134       1 controller.go:417] Processing object: calico-node-ms86k
I0305 07:11:22.158147       1 controller.go:417] Processing object: speaker-4ks66
I0305 07:11:22.158161       1 controller.go:417] Processing object: kfserving-controller-manager-0
I0305 07:11:22.158169       1 controller.go:417] Processing object: sources-controller-b8774f9cc-hjfjp
I0305 07:11:22.158175       1 controller.go:417] Processing object: coredns-5c98db65d4-k8c99
I0305 07:11:22.158181       1 controller.go:417] Processing object: coredns-5c98db65d4-mpxgk
I0305 07:11:22.158189       1 controller.go:417] Processing object: prometheus-system-0
I0305 07:11:22.158196       1 controller.go:417] Processing object: calico-kube-controllers-56cd854695-skt9j
I0305 07:11:22.158203       1 controller.go:417] Processing object: eventing-webhook-69d44558dd-f528k
I0305 07:11:22.158209       1 controller.go:417] Processing object: grafana-79bcd7c778-jkcsd
I0305 07:11:22.228425       1 shared_informer.go:227] caches populated
I0305 07:11:22.228521       1 controller.go:164] Starting workers
I0305 07:11:22.228534       1 controller.go:170] Started workers
I0305 07:11:22.228568       1 config.go:52] Start listening on 0.0.0.0:9797...
I0305 07:11:22.228873       1 config.go:64] Waiting for clients...
I0305 07:11:22.812127       1 controller.go:417] Processing object: kubeshare-device-manager
I0305 07:11:29.571366       1 controller.go:417] Processing object: kubeshare-node-daemon-rqhkz
I0305 07:11:52.130469       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:11:52.157599       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:11:59.749234       1 controller.go:417] Processing object: kubeshare-scheduler
I0305 07:12:22.130626       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:12:22.157877       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:12:52.130770       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:12:52.158146       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:13:22.130936       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:13:22.158408       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:13:28.347664       1 controller.go:417] Processing object: nfs-model-manager-sta5371918badac8f63d802cab9ff00049-deplopsjzq
I0305 07:13:29.409090       1 controller.go:417] Processing object: nfs-model-manager-sta5371918badac8f63d802cab9ff00049-deplopsjzq
I0305 07:13:41.545758       1 controller.go:417] Processing object: nfs-model-manager-sta5371918badac8f63d802cab9ff00049-deplopsjzq
I0305 07:13:52.131077       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:13:52.158696       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:14:21.494790       1 controller.go:417] Processing object: hdfsmodel-flower-shawn-123-predictor-default-gpsgr-deploymklch4
I0305 07:14:22.131234       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:14:22.159018       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:14:23.546408       1 controller.go:417] Processing object: hdfsmodel-flower-shawn-123-predictor-default-gpsgr-deploymklch4
I0305 07:14:35.697780       1 controller.go:417] Processing object: hdfsmodel-flower-shawn-123-predictor-default-gpsgr-deploymklch4
I0305 07:14:52.131382       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:14:52.159298       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:15:22.131532       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:15:22.159574       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:15:52.131696       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:15:52.159874       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:16:22.131864       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:16:22.160162       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:16:52.132006       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:16:52.160448       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:17:22.132165       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:17:22.160727       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:17:52.132326       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:17:52.161037       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:18:22.132499       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:18:22.161328       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:18:31.332341       1 controller.go:417] Processing object: nfs-model-manager-sta5371918badac8f63d802cab9ff00049-deplopsjzq
I0305 07:18:32.395458       1 controller.go:417] Processing object: nfs-model-manager-sta5371918badac8f63d802cab9ff00049-deplopsjzq
I0305 07:18:43.548932       1 controller.go:417] Processing object: nfs-model-manager-sta5371918badac8f63d802cab9ff00049-deplopsjzq
I0305 07:18:52.132638       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:18:52.161612       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:19:22.132812       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:19:22.161871       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:19:23.131723       1 reflector.go:418] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: Watch close - *v1.SharePod total 0 items received
I0305 07:19:30.853420       1 controller.go:417] Processing object: hdfsmodel-flower-shawn-123-predictor-default-gpsgr-deploymklch4
I0305 07:19:31.870540       1 controller.go:417] Processing object: hdfsmodel-flower-shawn-123-predictor-default-gpsgr-deploymklch4
I0305 07:19:43.695402       1 controller.go:417] Processing object: hdfsmodel-flower-shawn-123-predictor-default-gpsgr-deploymklch4
I0305 07:19:52.132983       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:19:52.162163       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:20:22.133150       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:20:22.162464       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:20:52.133284       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:20:52.162714       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:21:04.158569       1 reflector.go:418] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: Watch close - *v1.Pod total 15 items received
I0305 07:21:22.133417       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:21:22.162938       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:21:52.133590       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:21:52.163229       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:22:22.133762       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:22:22.163544       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:22:26.086090       1 controller.go:417] Processing object: kubeshare-node-daemon-42n68
I0305 07:22:52.133903       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:22:52.163858       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:23:22.134059       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:23:22.164112       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:23:42.736320       1 controller.go:417] Processing object: kubeshare-node-daemon-244bv
I0305 07:23:43.730258       1 controller.go:417] Processing object: nfs-model-manager-sta5371918badac8f63d802cab9ff00049-deplopsjzq
I0305 07:23:44.795910       1 controller.go:417] Processing object: nfs-model-manager-sta5371918badac8f63d802cab9ff00049-deplopsjzq
I0305 07:23:52.134216       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:23:52.164357       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:23:55.551845       1 controller.go:417] Processing object: nfs-model-manager-sta5371918badac8f63d802cab9ff00049-deplopsjzq
I0305 07:24:22.134348       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:24:22.164661       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:24:38.926482       1 controller.go:417] Processing object: hdfsmodel-flower-shawn-123-predictor-default-gpsgr-deploymklch4
I0305 07:24:39.961426       1 controller.go:417] Processing object: hdfsmodel-flower-shawn-123-predictor-default-gpsgr-deploymklch4
I0305 07:24:52.134486       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:24:52.164928       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:24:52.709432       1 controller.go:417] Processing object: hdfsmodel-flower-shawn-123-predictor-default-gpsgr-deploymklch4
I0305 07:25:19.133655       1 reflector.go:418] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: Watch close - *v1.SharePod total 0 items received
I0305 07:25:22.134629       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:25:22.165200       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:25:51.203176       1 controller.go:417] Processing object: kubeshare-node-daemon-rqhkz
I0305 07:25:52.134775       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:25:52.165465       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:25:52.242963       1 controller.go:417] Processing object: kubeshare-node-daemon-rqhkz
I0305 07:26:12.636110       1 controller.go:417] Processing object: kubeshare-node-daemon-rqhkz
I0305 07:26:15.689546       1 controller.go:417] Processing object: kube-proxy-z7jrj
I0305 07:26:22.134934       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:26:22.165745       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:26:23.690228       1 controller.go:417] Processing object: kubeshare-node-daemon-rqhkz
I0305 07:26:25.511487       1 controller.go:417] Processing object: kubeshare-node-daemon-rqhkz
I0305 07:26:52.135091       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:26:52.166022       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:26:55.055653       1 controller.go:417] Processing object: kubeshare-node-daemon-rqhkz
I0305 07:26:56.689009       1 controller.go:417] Processing object: kube-proxy-z7jrj
I0305 07:27:06.690567       1 controller.go:417] Processing object: kubeshare-node-daemon-rqhkz
I0305 07:27:19.512421       1 controller.go:417] Processing object: kubeshare-node-daemon-rqhkz
I0305 07:27:22.135222       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:27:22.166272       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:27:22.689577       1 controller.go:417] Processing object: kube-proxy-z7jrj
I0305 07:27:35.687819       1 controller.go:417] Processing object: kube-proxy-z7jrj
I0305 07:27:50.062727       1 controller.go:417] Processing object: kubeshare-node-daemon-rqhkz
I0305 07:27:52.135381       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:27:52.166553       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:28:03.689683       1 controller.go:417] Processing object: kubeshare-node-daemon-rqhkz
I0305 07:28:22.135537       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:28:22.166838       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:28:25.160464       1 reflector.go:418] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: Watch close - *v1.Pod total 22 items received
I0305 07:28:32.688537       1 controller.go:417] Processing object: kube-proxy-z7jrj
I0305 07:28:33.441485       1 controller.go:417] Processing object: kubeshare-node-daemon-rqhkz
I0305 07:28:45.688251       1 controller.go:417] Processing object: kube-proxy-z7jrj
I0305 07:28:45.983992       1 controller.go:417] Processing object: nfs-model-manager-sta5371918badac8f63d802cab9ff00049-deplopsjzq
I0305 07:28:47.042833       1 controller.go:417] Processing object: nfs-model-manager-sta5371918badac8f63d802cab9ff00049-deplopsjzq
I0305 07:28:52.135677       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:28:52.167136       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:28:57.883220       1 controller.go:417] Processing object: kubeshare-node-daemon-rqhkz
I0305 07:29:11.688815       1 controller.go:417] Processing object: kubeshare-node-daemon-rqhkz
I0305 07:29:22.135825       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:29:22.167416       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:29:45.751590       1 controller.go:417] Processing object: hdfsmodel-flower-shawn-123-predictor-default-gpsgr-deploymklch4
I0305 07:29:46.791622       1 controller.go:417] Processing object: hdfsmodel-flower-shawn-123-predictor-default-gpsgr-deploymklch4
I0305 07:29:52.135974       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:29:52.167713       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:30:00.690528       1 controller.go:417] Processing object: hdfsmodel-flower-shawn-123-predictor-default-gpsgr-deploymklch4
I0305 07:30:08.688195       1 controller.go:417] Processing object: kube-proxy-z7jrj
I0305 07:30:21.456837       1 controller.go:417] Processing object: kubeshare-node-daemon-rqhkz
I0305 07:30:22.136143       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:30:22.167994       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:30:22.687831       1 controller.go:417] Processing object: kube-proxy-z7jrj
I0305 07:30:52.031076       1 controller.go:417] Processing object: kubeshare-node-daemon-rqhkz
I0305 07:30:52.136298       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:30:52.168292       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:31:06.688057       1 controller.go:417] Processing object: kubeshare-node-daemon-rqhkz
I0305 07:31:22.136444       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:31:22.168580       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:31:52.136619       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:31:52.168876       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:32:22.136768       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:32:22.169160       1 reflector.go:268] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: forcing resync
I0305 07:32:50.135465       1 reflector.go:418] pkg/mod/k8s.io/client-go@v0.17.2/tools/cache/reflector.go:105: Watch close - *v1.SharePod total 0 items received

deamon log

2020/03/05 09:08:32 Loading NVML
Iamlovingit commented 4 years ago

Hi, could you provide the logs of both kubeshare-device-manager and crashed kubeshare-node-daemon? kubectl -n kube-system logs kubeshare-device-manager kubectl -n kube-system logs $(kubectl -n kube-system get pod -o name -l lsalab=kubeshare-node-daemon --field-selector 'status.phase=Failed') -c config-client

I had updated the README for more clarifying that docker and nvidia-docker2 version should < 19, because KubeShare restricts the GPU devices access in containers by adding a environment variable 'NVIDIA_VISIBLE_DEVICES' (https://github.com/NVIDIA/nvidia-container-runtime).

if the problem is dcoker version, I'll try re-install my docker later.

ncy9371 commented 4 years ago

Hi, but I think the problem should happened after creating a SharePod if the docker version is wrong. I should add a docker version check at initializing phase in the future. BTW, the node-daemon log you provided above seems like daemon was at initializing step. I think it will print some messages if NVML failed. Could you check if there is something more in daemon log? thanks!

Iamlovingit commented 4 years ago

Sorry for getting back to you late.

I have checked all three containers of kubeshare pod. There is some message in gemini-scheduler:

kubectl logs -n kube-system kubeshare-node-daemon-rqhkz -c gemini-scheduler
/usr/bin/nvidia-smi
[launcher] scheduler started on 0.0.0.0:49901
[launcher] scheduler started on 0.0.0.0:49902
2020-03-05 07:25:50.950682 Gemini I/ There are 0 clients in the system...
2020-03-05 07:25:50.950933 Gemini I/ Monitor thread created.
2020-03-05 07:25:50.950958 Gemini I/ Waiting for incoming connection
2020-03-05 07:25:50.950985 Gemini I/ Watching '/kubeshare/scheduler/config'.
[launcher] scheduler started on 0.0.0.0:49904
2020-03-05 07:25:50.952255 Gemini I/ There are 0 clients in the system...
2020-03-05 07:25:50.952538 Gemini I/ Monitor thread created.
2020-03-05 07:25:50.952560 Gemini I/ Waiting for incoming connection
2020-03-05 07:25:50.952584 Gemini I/ Watching '/kubeshare/scheduler/config'.
[launcher] scheduler started on 0.0.0.0:49903
2020-03-05 07:25:50.953586 Gemini I/ There are 0 clients in the system...
2020-03-05 07:25:50.953821 Gemini I/ Monitor thread created.
2020-03-05 07:25:50.953843 Gemini I/ Waiting for incoming connection
2020-03-05 07:25:50.953865 Gemini I/ Watching '/kubeshare/scheduler/config'.
2020-03-05 07:25:50.955070 Gemini I/ There are 0 clients in the system...
2020-03-05 07:25:50.955294 Gemini I/ Monitor thread created.
2020-03-05 07:25:50.955309 Gemini I/ Waiting for incoming connection
2020-03-05 07:25:50.955347 Gemini I/ Watching '/kubeshare/scheduler/config'.

At last, which docker and nvidia-docker version is your suggestion? Version 18 is ok?

ncy9371 commented 4 years ago

Hi, the log from container gemini-scheduler looks good. The latest supported docker and nvidia-docker2 version for NVIDIA GPU device plugin is 18.09.7.

Iamlovingit commented 4 years ago

Hi, the log from container gemini-scheduler looks good. The latest supported docker and nvidia-docker2 version for NVIDIA GPU device plugin is 18.09.7.

Thanks, I install a lower docker, now it can work! Close.