cloudfoundry / cf-k8s-networking

building a cloud foundry without gorouter....
Apache License 2.0
32 stars 17 forks source link

Add ADR proposal for handling stopped app routing #46

Closed christianang closed 4 years ago

christianang commented 4 years ago

Summary of changes

Add an ADR proposal for how to handle routing to stopped applications. No decision has been made yet. We are opening this PR for comments on the best way forward to solve this problem.

Link to ADR

Related Issue(s)

https://github.com/cloudfoundry/cf-for-k8s/issues/158

Additional Context

https://www.pivotaltracker.com/story/show/172504850

@cloudfoundry/cf-networking @cloudfoundry/cf-capi

cf-gitbot commented 4 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/173088152

The labels on this github issue will be updated when the story is started.

rosenhouse commented 4 years ago

Thanks all. This one is tricky, and I learned a lot reading through this discussion.

I'm supportive of the idea of starting with Proposal 1 (where the Route is more of an intermediate representation of app state), and iterating towards a model where it is fully declarative of high-level intent, once App status is easier to discover from the k8s api.

rosenhouse commented 4 years ago

Though we said crashed apps are out of scope here, won't we be facing a similar regression there as well?

Suppose someone has Route R with destination apps A and B. A is healthy. B is a broken build, and all instance crash on startup.

christianang commented 4 years ago

Though we said crashed apps are out of scope here, won't we be facing a similar regression there as well?

Possibly, I've been assuming that crashed pods react the same way as a non-existent pod w.r.t a service. However, reading about readiness probes makes me believe it may act differently. I think it is probably worth some experimenting to see if they do act similarly.

tcdowney commented 4 years ago

One thing for any @cloudfoundry/cf-capi folk reviewing this: I think if we go through with Proposal 1 it might be worth revisiting the decision to defer the CCDB->k8s Route bulk sync work.

Proposal 1 will require Cloud Controller to update Kubernetes Route resources more often -- whenever a Cloud Controller Process has a state transition. Previously they would have only been updated on calls that explicitly dealt with Cloud Controller Routes (e.g. cf create-route, cf delete-route, cf map-route, cf unmap-route, and server side manifests). Now we have to consider cf restart, cf stop, cf start, cf push, etc. since they all tweak Process state.

Also, since the obvious way of implementing this on Cloud Controller involves the ProcessObserver which gets invoked in "After Commit" hooks I think it's more likely we'll see cases where CCDB updates have been made, but the Kubernetes update has failed. A lot of the implementation decisions in that part of the Cloud Controller code were made with the assumption that the Cloud Controller/Diego bulk sync would make things eventually consistent.

tl;dr I think we're going to encounter more Route consistency issues and will need the bulk sync.

christianang commented 4 years ago

Though we said crashed apps are out of scope here, won't we be facing a similar regression there as well?

Back to this question. I ran a couple experiments to understand how crashed apps are handled at the moment.

Scenario: Pushed two apps mapped to the same route. Crashed one app. Result: Intermittent 503s Reason: The Virtual Service results in two Services as destinations and the Virtual Service balances traffic between them even if one Service results in 503s.

Scenario: Push single app with one route and multiple instances. Crash one app instance. Result: No 503s. Only 200s. Reason: The Virtual Service results in one Service as the destination and can take advantage of Readiness Probes on the pod.

So this is also a problem that will need addressing, that wouldn't be solved by anything proposed so far here. Perhaps routing to crashed apps and stopped apps can be solved in the same way, but not sure at the moment. However, I do agree with @jenspinney that it is probably worth solving crashed apps first as solution may solve or fit into how to solve routing to stopped apps.

XanderStrike commented 4 years ago

We're gonna call this blocked on #173159123

XanderStrike commented 4 years ago

We're gonna go to GA with https://github.com/cloudfoundry/cf-for-k8s/issues/158 as a known issue and keep our ears open for feedback from users on if/how we want this handled.

cf-gitbot commented 4 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/173538495

The labels on this github issue will be updated when the story is started.

loewenstein commented 4 years ago

Did we consider using Envoy's health checking mechanism? There is an open issue for Istio that has some solution proposed although most of the issue focuses on external services.

BTW why does Istio passive health checking logic not help in this case? I'd expect not routing traffic to pods that are not healthy or not even existing would be covered by default.

XanderStrike commented 4 years ago

Envoy's health checking can reject endpoints in the cluster for being unhealthy, it cannot reject an entire cluster and redistribute traffic in a weighted_clusters set. Envoy choses the cluster before checking which endpoints in the cluster are available. For this to work for us we'd have to contribute to Envoy to flip that process, and I'm not sure they'd go for it. From their perspective, asking Envoy to route to a cluster with no available endpoints is more of a misconfiguration than something the proxy should be expected to solve.

KauzClay commented 4 years ago

We are going to close this PR as we believe the solution is outdated. Following comments in this story, we believe this isn't an immediate priority either.

We can either make a new PR or reopen this one when we get to it in the future.