Aptomi / k8s-app-engine

Application delivery engine for k8s
Apache License 2.0
163 stars 19 forks source link

Need a much better solution for long-running actions such as getting endpoints #332

Closed romangithub1024 closed 6 years ago

romangithub1024 commented 6 years ago

RIght now it sometimes takes up to 20 minutes to fully bring up the stack consisting of consul + wordpress + jenkins + prometheus + grafana, as well as retrieve endpoints.

There are several problems with this approach: 1) 15-minute timeout for getting endpoints is a very long time. Because apply for revisions is sequential, other applies will be waiting until this one completes. Not acceptable, as "policy apply --wait" will just keep timing out for other users. 2) When applier waits for endpoints, sometimes connection to k8s times out and dies. 3) If endpoints are not available (e.g. PVC claim fails -> pod fails), it will affect users submitting subsequent changes to the policy. As applier will get stuck on every revision, trying to re-retrieve the endpoints.

Possibly, we need to separate endpoints from apply and make them a part of "readiness" phase. Definitely requires some thinking.

a-1rv51cf01fgrk-consul-0                                         1/1       Running   0          21m
a-1rv51cf01fgrk-consul-1                                         1/1       Running   0          15m
a-1rv51cf01fgrk-consul-2                                         1/1       Running   0          13m
a-282956ms8shvi-mariadb-66cbbb464c-tx6pl                         1/1       Running   0          21m
a-282956ms8shvi-wordpress-66f948b978-94z42                       1/1       Running   0          21m
a-8f9vjqbag2g40-jenkins-6ccdcdc577-rs7pz                         1/1       Running   0          21m
a-99914m1q44qlo-prometheus-alertmanager-848d97d5fd-88lmq         2/2       Running   0          21m
a-99914m1q44qlo-prometheus-kube-state-metrics-545749c464-n2fpt   1/1       Running   0          21m
a-99914m1q44qlo-prometheus-node-exporter-f62tj                   1/1       Running   0          21m
a-99914m1q44qlo-prometheus-pushgateway-7cf65f5b54-rsnw2          1/1       Running   0          21m
a-99914m1q44qlo-prometheus-server-6ffdf9948f-lhddq               2/2       Running   0          21m
a-fpta50ubo1ojo-grafana-554bbc5585-m7bjt                         1/1       Running   0          21m
romangithub1024 commented 6 years ago

remove retry.Do() from endpoints action if it fails, we will retry it during the next enforcement