Closed ravilr closed 8 years ago
apart from the nil pointer deference of *api.node labels, also observing that the node registrator doesn't seem to add back a mesos slave host, which went offline and came back after some time, into k8s node registry. the slave host gets deleted from k8s api on slave down event: I0205 02:03:05.514811 6249 service.go:710] deleting node "s1.www.com" from registry but never seems to get registered back when it comes up and starts offering offers again. Had to restart the k8sm-scheduler to recover from this.
also seeing this in controller-manager logs: E0205 23:53:43.761186 1 statusupdater.go:68] Error listing slaves without kubelet: Get http://master1.www.com:5050/state: dial tcp 1.1.1.1:5050: connection refused
mesos cluster version : 0.24. but, looks like we fall back to /state.json from code, so above message is harmless ?
yes, it should be falling back to state.json
Pretty sure that the problem is that the Fit
func is passing a nil
and the procurement funcs aren't checking for that:
$ find contrib/mesos -type f -exec grep -Hn -e 'Fit(' \{\} \;
contrib/mesos/pkg/scheduler/components/scheduler.go:102: return !task.Has(podtask.Launched) && ps.Fit(task, offer, nil)
contrib/mesos/pkg/scheduler/components/algorithm/podschedulers/types.go:40: Fit(*podtask.T, *mesosproto.Offer, *api.Node) bool
contrib/mesos/pkg/scheduler/components/algorithm/podschedulers/fcfs.go:99:func (fps *fcfsPodScheduler) Fit(t *podtask.T, offer *mesosproto.Offer, n *api.Node) bool {
pulled in the above fix and it seems to be working in my cluster running couple of pods with nodeSelector, without any panics.
awesome - thanks for verifying!
On Tue, Feb 9, 2016 at 9:12 PM, ravilr notifications@github.com wrote:
pulled in the above fix and it seems to be working in my cluster running couple of pods with nodeSelector, without any panics.
— Reply to this email directly or view it on GitHub https://github.com/mesosphere/kubernetes-mesos/issues/768#issuecomment-182169393 .
cherry-picked into 0.7.3, removing label
@sttts @jdef version: v0.7.2-v1.1.5
looks like there arises some kind of racy condition when then are pods using nodeSelector and mesos slave attributes getting exposed as k8s node labels.
I0205 02:16:14.065151 6249 errorhandler.go:59] Error scheduling k8s-router-mv96c: No suitable offers for pod/task; retrying I0205 02:16:14.165429 6249 queuer.go:164] attempting to yield a pod E0205 02:16:15.065531 6249 util.go:82] Recovered from panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) /var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/pkg/util/util.go:76 /var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/pkg/util/util.go:64 /usr/local/go/src/runtime/asm_amd64.s:402 /usr/local/go/src/runtime/panic.go:387 /usr/local/go/src/runtime/panic.go:42 /usr/local/go/src/runtime/sigpanic_unix.go:26 /var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/contrib/mesos/pkg/scheduler/podtask/procurement.go:132 /var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/contrib/mesos/pkg/scheduler/podtask/procurement.go:96 /var/builds/workspace/69819-v3-component/BUILD_CONTAINER/rhel7/label/DOCKER-LOW/app_root/kubernetes-1.1.5-v0.7.2/_output/local/go/src/k8s.io/kubernetes/contrib/mesos/pkg/scheduler/podtask/procurement.go:108