d2iq-archive / kubernetes-mesos

A Kubernetes Framework for Apache Mesos
636 stars 92 forks source link

nil pointer dereference error in mesos procurement.go #768

Closed ravilr closed 8 years ago

ravilr commented 8 years ago

@sttts @jdef version: v0.7.2-v1.1.5

looks like there arises some kind of racy condition when then are pods using nodeSelector and mesos slave attributes getting exposed as k8s node labels.

I0205 02:16:14.065151 6249 errorhandler.go:59] Error scheduling k8s-router-mv96c: No suitable offers for pod/task; retrying I0205 02:16:14.165429 6249 queuer.go:164] attempting to yield a pod E0205 02:16:15.065531 6249 util.go:82] Recovered from panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) /var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/pkg/util/util.go:76 /var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/pkg/util/util.go:64 /usr/local/go/src/runtime/asm_amd64.s:402 /usr/local/go/src/runtime/panic.go:387 /usr/local/go/src/runtime/panic.go:42 /usr/local/go/src/runtime/sigpanic_unix.go:26 /var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/contrib/mesos/pkg/scheduler/podtask/procurement.go:132 /var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/contrib/mesos/pkg/scheduler/podtask/procurement.go:96 /var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/contrib/mesos/pkg/scheduler/podtask/procurement.go:108

:11 /var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/contrib/mesos/pkg/scheduler/components/algorithm/podschedulers/fcfs.go:100 /var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/contrib/mesos/pkg/scheduler/components/scheduler.go:101 /var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/contrib/mesos/pkg/offers/offers.go:472 /var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/contrib/mesos/pkg/offers/offers.go:508 /var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/contrib/mesos/pkg/runtime/util.go:115 /var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/contrib/mesos/pkg/runtime/util.go:116 /usr/local/go/src/runtime/asm_amd64.s:2232
ravilr commented 8 years ago

apart from the nil pointer deference of *api.node labels, also observing that the node registrator doesn't seem to add back a mesos slave host, which went offline and came back after some time, into k8s node registry. the slave host gets deleted from k8s api on slave down event: I0205 02:03:05.514811 6249 service.go:710] deleting node "s1.www.com" from registry but never seems to get registered back when it comes up and starts offering offers again. Had to restart the k8sm-scheduler to recover from this.

ravilr commented 8 years ago

also seeing this in controller-manager logs: E0205 23:53:43.761186 1 statusupdater.go:68] Error listing slaves without kubelet: Get http://master1.www.com:5050/state: dial tcp 1.1.1.1:5050: connection refused

mesos cluster version : 0.24. but, looks like we fall back to /state.json from code, so above message is harmless ?

jdef commented 8 years ago

yes, it should be falling back to state.json

jdef commented 8 years ago

Pretty sure that the problem is that the Fit func is passing a nil and the procurement funcs aren't checking for that:

$ find contrib/mesos -type f -exec grep -Hn -e 'Fit(' \{\} \;
contrib/mesos/pkg/scheduler/components/scheduler.go:102:                                return !task.Has(podtask.Launched) && ps.Fit(task, offer, nil)
contrib/mesos/pkg/scheduler/components/algorithm/podschedulers/types.go:40:     Fit(*podtask.T, *mesosproto.Offer, *api.Node) bool
contrib/mesos/pkg/scheduler/components/algorithm/podschedulers/fcfs.go:99:func (fps *fcfsPodScheduler) Fit(t *podtask.T, offer *mesosproto.Offer, n *api.Node) bool {
jdef commented 8 years ago

fixed https://github.com/kubernetes/kubernetes/pull/20936

ravilr commented 8 years ago

pulled in the above fix and it seems to be working in my cluster running couple of pods with nodeSelector, without any panics.

jdef commented 8 years ago

awesome - thanks for verifying!

On Tue, Feb 9, 2016 at 9:12 PM, ravilr notifications@github.com wrote:

pulled in the above fix and it seems to be working in my cluster running couple of pods with nodeSelector, without any panics.

— Reply to this email directly or view it on GitHub https://github.com/mesosphere/kubernetes-mesos/issues/768#issuecomment-182169393 .

jdef commented 8 years ago

cherry-picked into 0.7.3, removing label