Use a reverse proxy to avoid routes/ingress creation at workspace startup

l0rd commented 5 years ago

Description

In the past 2 years of running Che in production we have seen that OpenShift routes do not always fit our needs (we need to bring-up 3 or more routes in a few seconds for every user workspace). The same applies to Kubernetes Ingresses (still in Beta).

Thus the need to investigate alternatives. One proposal is to pre-create one workspace route/ingress for Che server and link it to a reverse proxy that will route all the workspaces traffic (e.g. re-use JWT proxy, envoy or traefik).

That would allow us to:

limit the number or route/ingress Che needs
create the route before a user ask to start a workspace

If this approach is validated we could divide the work in 4 steps

[ ] Analyse if the already existent JWT Proxy would satisfy our needs
[ ] All workspaces inbound and outbound traffic should be routed through the JWT Proxy
[ ] Share one JWT Proxy instance amongst all the workspaces of a single user
[ ] Share one JWT Proxy instance amongst all the Che users of a Kube cluster

UPDATE Another important use case is associated to this issue: allow running Che workspaces on OpenShift (where single-host strategy is not available) when using wildcard SSL certificates is not possible.

Implementation

[x] [Study] Possible implementations for single-host strategy on OpenShift https://github.com/eclipse/che/issues/16702
[ ] Implement a single host version of the happy-path test https://github.com/eclipse/che/issues/16842
- [x] ~~Reuse openshift-router~~ - Impossible
- [x] POC for Traefik - https://github.com/metlos/che-singlehost-poc
- [x] POC for Traefik using CRDs - https://github.com/skabashnyuk/openshift-traefik
- [x] POC for HAProxy - https://github.com/sparkoo/che-singlehost-haproxy-POC
- [x] POC for nginx - #16883
- [x] Investigate a controller able to sync configmaps and send a signal to a gateway process - https://github.com/metlos/cm-bump
- [x] Investigate the best way for hot-reloading the configuration in gateway
- [x] External controller - #16886
- [x] In-container controller - #16887
- [x] ~~Sidecar controller~~ - not working on default Openshift 3.x due to shared process namespace being switched off by default
- [x] Testable POCs
- [x] Traefik - #16888
- [x] nginx - should be part of #16883
- [x] HAProxy - the tests are being implemented using this POC
- [x] Envoy - #17182
- [x] Do performance tests - #16889
- [x] Run perf tests for Envoy #17243
- [x] Che Server implementation
- [x] Enable singlehost strategy to not be ingress-bound #17059
- [x] Implement gateway-based singlehost strategy for Kubernetes #17060
- [x] Implement gateway-based singlehost strategy for OpenShift #17061
- [ ] ~Implement cookie path rewriting in Traefik :question: #17062~
- [x] Finalize our config-fetching controller #17063
- [x] Update Helm chart to support gateway-based singlehost mode #17064
- [x] Update the che-operator to support gateway-based singlehost mode #17065
- [x] Document how to deploy gateway singlehost with Chectl #17525
- [x] Keycloak behind the gateway #17809
- [x] Ensure Configubump tool has all necessary CQs open | https://github.com/eclipse/che/issues/17568
- [x] Configubump tool produces incorrect configuration in some cases | https://github.com/eclipse/che/issues/17567

l0rd commented 5 years ago

cc @gorkem

che-bot commented 5 years ago

Issues go stale after 180 days of inactivity. lifecycle/stale issues rot after an additional 7 days of inactivity and eventually close.

Mark the issue as fresh with /remove-lifecycle stale in a new comment.

If this issue is safe to close now please do so.

Moderators: Add lifecycle/frozen label to avoid stale mode.

skabashnyuk commented 5 years ago

/remove-lifecycle stale

benoitf commented 5 years ago

should it go into a backlog ?

l0rd commented 5 years ago

We should put it into a backlog during priotization

che-bot commented 4 years ago

Issues go stale after 180 days of inactivity. lifecycle/stale issues rot after an additional 7 days of inactivity and eventually close.

Mark the issue as fresh with /remove-lifecycle stale in a new comment.

If this issue is safe to close now please do so.

Moderators: Add lifecycle/frozen label to avoid stale mode.

l0rd commented 4 years ago

/remove-lifecycle stale

metlos commented 4 years ago

We've started implementing the performance tests for the individual POCs. Take a look at https://github.com/che-incubator/che-gateway-poc.

metlos commented 4 years ago

In the above mentioned POC repository, we now have 3 POCs implemented:

haproxy-scripted - this is a vanilla haproxy image operated by oc commands from the test scripts
nginx-custom-image - this is a custom image using our cm-bump utility and nginxinc/nginx-unprivileged official image of nginx
traefik-sidecar - this is combination of traefik and cm-bump (we don't require a custom image in this case because traefik can watch for config changes on its own)

We're working on haproxy-custom-image which is very similar to nginx-custom-image only with haproxy as the gateway solution. This is to be able to quantify the effect of a custom controller vs externally executed commands.

We're also working on the testsuite. We're developing a number of load test scenarios (https://github.com/che-incubator/che-gateway-poc/tree/master/test#testcases). We have not yet started websocket and cookie handling tests which we are going to start once the haproxy-custom-image poc is implemented.

skabashnyuk commented 4 years ago

@l0rd one of the POCs that we have is CR based traefik https://github.com/skabashnyuk/openshift-traefik. And @metlos raised concern about cluster roles and cluster role bindings https://github.com/skabashnyuk/openshift-traefik/blob/master/001-rbac.yaml. How big this problem for us. Can we afford as a requirement for this feature to have traefik + all necessary roles to be able to read CR. WDYT? CC @benoitf @davidfestal

metlos commented 4 years ago

My main concern there is that a) we're creating a pod with cluster-wide permissions and b) we're creating cluster-wide "generic" CRDs (i.e. Traefik-specified CRDs like IngressRoute) that are only meant for our usage.

So in another words, with the Traefik CRDs we're creating a new routing facility in the whole cluster, not just for our usage.

l0rd commented 4 years ago

For the cluster-wide permission that's ok imo under 2 conditions: it should be optional (i.e. if you do not have enough privileges you can still use Che but you need to stick with multi-host) and it should be deployed via a separate operator (so that Che Operator won't need extra privileges).

For IngressRoute isn't it possible to use Ingress with traefik specific annotations instead?

jfaltermeier commented 4 years ago

Hi, sorry for side-tracking a bit. I have a question about the scope of this ticket. When looking at workspace startup times recently I noticed that a lot of time is taken between the ingress creation and its update. Once the ingress is available the workspace is then scaled up.

Would this ticket help to avoid waiting for the ingress to update because it is pre-created already?

sparkoo commented 4 years ago

@jfaltermeier as a side-effect, yes it will help. We will have only one Ingress for whole Che and will do routing to workspaces ourselves, so we can do it more effectively than cluster.

metlos commented 4 years ago

We have concluded the performance tests. I have created a number of subtasks to guide us through the implementation and referenced them in the description of this epic.

We have not yet chosen the gateway solution though, because there was no clear winner. I have sent out an email to the Che-dev mailing list detailing our current thinking and progress.

metlos commented 4 years ago

Note that we have concluded our testing of the candidate solutions for a reverse proxy. We chose Traefik and will commence the implementation with #17063 - making our Rust-based POC a fully maintained controller written in Go.

To read more about the selection process and reasoning behind the choice of Traefik, please read through https://www.eclipse.org/lists/che-dev/msg03828.html

sparkoo commented 4 years ago

all issues in the scope of this epic closed. related single-host issues will be solved separately.

eclipse-che / che

Use a reverse proxy to avoid routes/ingress creation at workspace startup #12914

Description

Implementation