GoogleCloudPlatform / k8s-multicluster-ingress

kubemci: Command line tool to configure L7 load balancers using multiple kubernetes clusters
Apache License 2.0
376 stars 68 forks source link

Availability when killing all pods in a cluster #228

Closed jonatanwulcan closed 4 years ago

jonatanwulcan commented 4 years ago

Hey, I'm playing around with kubemci to figure out if it's a good match for the product I'm currently working on. I tried the zone-printer demo and then tried out manually going in and delete the pod that was running in the cluster closest to me.

The result was that the service went down until the pod had restarted. Is this expected behaviour? I was hoping the the traffic would fail over to another cluster.

nikhiljindal commented 4 years ago

Yes traffic should fail over to another cluster. Maybe the pod was restarted before GCLB detected that the pod was down?

Can you try changing the health check configuration so that it detects failures faster? You cannot use kubemci to modify it, but can use gcloud or Google Cloud Console directly to update the Health check created by kubemci. https://github.com/GoogleCloudPlatform/k8s-multicluster-ingress/issues/135 has some relevant discussion about this.

Many customers run multiple replicas in their cluster to mitigate this issue. Setting up Cluster autoscaling and Pod autoscaling will help as well.

jonatanwulcan commented 4 years ago

Thanks for your reply Nikhil. I'll look into updating the health check configuration and I'll report back if this solves the problem.

How fast can I expect failover to happen when a cluster goes down?

Also I was wondering about cluster auto scaling and kubemci. Since you're recommending it I suppose it's supported. How fast will GCLB discover new nodes added to the cluster by the auto scaler?

jonatanwulcan commented 4 years ago

I tried out changing the health check configuration. I set it to 5s interval 5s timeout. Fail on 1 consecutive and succeed on 1 consecutive.

For others reading this. You can find the health check configuration in google cloud console under Compute Engine -> Health Checks.

Works just as expected now! Thanks for the help!