d2iq-archive / kubernetes-mesos

A Kubernetes Framework for Apache Mesos
636 stars 92 forks source link

master election of k8sm scheduler's broken in v0.7.2-1.1.5 #787

Closed ravilr closed 8 years ago

ravilr commented 8 years ago

@jdef @s-urbaniak

in case of multiple k8sm scheduler instances, all of them are being registered for master election with the 'id', leading to death of spiral of all scheduler instances.

I0211 23:02:18.638476 25929 service.go:586] registering for election at /mesos/k8sm/framework/Kubernetes/leader with id 14732d8d4c8e1382_k8sm-executor

previously, each scheduler instances were getting their own uid (with same executor group) and that being used in master election: https://github.com/mesosphere/kubernetes/blob/v0.7.0-v1.1.1/contrib/mesos/pkg/scheduler/service/service.go#L562

But, this seems to have changed since below pr: https://github.com/kubernetes/kubernetes/pull/15775

jdef commented 8 years ago

thanks so much for reporting this!

jdef commented 8 years ago

TODO: as part of this ticket, update the test plan docs to validate scheduler HA isn't broken --> #751

ravilr commented 8 years ago

for our use case of running an instance of scheduler on three different vm/host's, using os.Hostname() as the etcd election key's value, has been working fine.

 contrib/mesos/pkg/scheduler/service/service.go 
-       log.Infof("registering for election at %v with id %v", path, eid.GetValue())
-       go election.Notify(election.NewEtcdMasterElector(etcdClient), path, eid.GetValue(), srv, nil)
+       hostname, err := os.Hostname()
+       if err != nil {
+           log.Fatalf("Failed to get hostname: %v", err)
+       }
+       log.Infof("registering for election at %v with id %v", path, hostname)
+              go election.Notify(election.NewEtcdMasterElector(etcdClient), path, hostname, srv, nil)
jdef commented 8 years ago

fixed here: https://github.com/kubernetes/kubernetes/pull/21768