Open Shaked opened 4 years ago
Thanks @Shaked , yes this does makes sense, and this is exactly what we would recommend setting up manually.
Regrading timeouts, let me find out what we use internally, as we have never encountered timeout issues due to the load balancer...
One last remark, I think we should also add the trains-prod-example.com
suffix as a parameter, since all prefixes are fixed, it makes sense to export the only part that changes form one deployment to another.
What do you think?
Thanks @Shaked , yes this does makes sense, and this is exactly what we would recommend setting up manually.
Great, I'll create a PR for that. I guess it should be under the trains-server-k8s
repository, right?
Regrading timeouts, let me find out what we use internally, as we have never encountered timeout issues due to the load balancer...
Great. Not sure why we faced it, but I added this yesterday:
nginx.ingress.kubernetes.io/proxy-connect-timeout: "300"
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
I haven't experienced any timeouts yet, but might as well be because I didn't played with it much.
One last remark, I think we should also add the trains-prod-example.com suffix as a parameter, since all prefixes are fixed, it makes sense to export the only part that changes form one deployment to another. What do you think?
Yea that makes a lot of sense actually, so we can support 2 different cases:
Either developers could use ingress.host=trains-prod.example.com
which will automatically be appended to all 3 app
, api
and files
or if they, for some reason, would rather have different hosts, they could use ingress.app_host=trains-prod.example.com
, ingress.api_host=else-prod.example.com
and ingress.files_host=something-else-prod.example.com
Not sure if the 2nd option is even needed, but I don't mind to add it.
What do you think?
@bmartinn
I have an update regarding the timeouts. Now I'm seeing other 50x errors, such as 502, 503 (504 disappeared for now).
nginx LB
shows:ku default logs -f steely-mule-nginx-ingress-controller-74d54f944f-9sxbz --since 20m | grep -v "HTTP/1.1\" 200" | grep -v '.well-known'
10.240.0.5 - [10.240.0.5] - - [07/Jan/2020:15:26:27 +0000] "GET /v2.1/events.add_batch HTTP/1.1" 503 198 "-" "python-requests/2.22.0" 1298 0.000 [trains-apiserver-service-8008] - - - - f863f3c72ec17b139f2074a16f8bff04
10.240.0.5 - [10.240.0.5] - - [07/Jan/2020:15:26:27 +0000] "GET /v2.1/tasks.get_by_id HTTP/1.1" 503 198 "-" "python-requests/2.22.0" 1289 0.000 [trains-apiserver-service-8008] - - - - 5872faf664f5f08d45d2c4a9402f637c
10.240.0.5 - [10.240.0.5] - - [07/Jan/2020:15:26:35 +0000] "GET /v2.1/events.add_batch HTTP/1.1" 503 198 "-" "python-requests/2.22.0" 1298 0.000 [trains-apiserver-service-8008] - - - - 9c1dfc69dca4cfecf6abb150f938c827
10.240.0.5 - [10.240.0.5] - - [07/Jan/2020:15:26:35 +0000] "GET /v2.1/tasks.get_by_id HTTP/1.1" 503 198 "-" "python-requests/2.22.0" 1289 0.000 [trains-apiserver-service-8008] - - - - d09b50fd480c22fafcfa93ecff0f377d
[07/Jan/2020:15:26:39 +0000]TCP200000.000
2020/01/07 15:26:51 [warn] 18318#18318: *137904157 a client request body is buffered to a temporary file /tmp/client-body/0000003046, client: 10.240.0.5, server: api.trains-stage.example.com, request: "GET /v2.1/events.add_batch HTTP/1.1", host: "api.trains-stage.example.com"
ku trains logs -f apiserver-75fc489669-x9k76 --since 20m | grep -vi 'returned 200'
[2020-01-07 15:41:49,579] [8] [INFO] [trains.non_responsive_tasks_watchdog] Starting cleanup cycle for running tasks last updated before 2020-01-07 13:41:49.579258
[2020-01-07 15:41:49,581] [8] [INFO] [trains.non_responsive_tasks_watchdog] Done
kubectl -n trains get pods
NAME READY STATUS RESTARTS AGE
apiserver-75fc489669-x9k76 1/1 Running 11 14d
[2020-01-07 15:24:15,477] [8] [INFO] [trains.updates] TRAINS-SERVER new version available: upgrade to v0.13.0 is recommended!
[2020-01-07 15:24:16,662] [8] [INFO] [trains.service_repo] Returned 200 for tasks.get_by_id in 80ms
[2020-01-07 15:24:18,460] [8] [INFO] [trains.service_repo] Returned 200 for users.get_preferences in 3ms
[2020-01-07 15:24:18,753] [8] [INFO] [trains.service_repo] Returned 200 for tasks.ping in 3ms
[2020-01-07 15:24:18,783] [8] [INFO] [trains.service_repo] Returned 200 for tasks.get_by_id in 10ms
[2020-01-07 15:24:18,840] [8] [INFO] [trains.service_repo] Returned 200 for users.get_current_user in 4ms
[2020-01-07 15:24:18,991] [8] [INFO] [trains.service_repo] Returned 200 for projects.get_all_ex in 6ms
[2020-01-07 15:24:19,606] [8] [INFO] [trains.service_repo] Returned 200 for projects.get_all_ex in 2ms
[2020-01-07 15:24:20,551] [8] [INFO] [trains.service_repo] Returned 200 for tasks.get_all_ex in 396ms
[2020-01-07 15:24:20,562] [8] [INFO] [trains.service_repo] Returned 200 for tasks.get_all_ex in 413ms
/opt/trains/wrapper.sh: line 28: 8 Killed python3 server.py
Maybe it's related to the timeouts as well? What am I missing?
Note: the main reason I haven't upgraded to v0.13.0 is because of my previous Azure FlexVolume PR https://github.com/allegroai/trains-server-k8s/pull/2
Thank you!
Hi @Shaked ,
Regarding timeouts, our defaults are also 5min on all three connection types, and it seems stable on our setups...
The 50x error codes, I think, are a byproduct of the pod restarts, which we think are derived from k8s memory limit configuration. This is why on v0.13.0 we increased the memory limit, and to be honest I think we should be more generous with that. I suggest you set it at 500M and check if the errors/restarts continue.
p.s.
Great, I'll create a PR for that. I guess it should be under the trains-server-k8s repository, right?
Yes please :smile:
Note: the main reason I haven't upgraded to v0.13.0 ...
With the 0.13 release things got delayed, but they promised to get the FlexVolume PR merged in the next couple of days, so I'm hoping you can upgrade soon :)
Hey @bmartinn
Regarding timeouts, our defaults are also 5min on all three connection types, and it seems stable on our setups...
The 50x error codes, I think, are a byproduct of the pod restarts, which we think are derived from k8s memory limit configuration. This is why on v0.13.0 we increased the memory limit, and to be honest I think we should be more generous with that. I suggest you set it at 500M and check if the errors/restarts continue.
I'm going to try this ASAP.
Yes please 😄
PR is available https://github.com/allegroai/trains-server-k8s/pull/3
With the 0.13 release things got delayed, but they promised to get the FlexVolume PR merged in the next couple of days, so I'm hoping you can upgrade soon :)
Merged :)
Awesome! I'll make sure we see to it :)
Hey,
In order to connect between the services (app, files, api) and our nginx LB, we use an ingress that looks like this:
This integrates with our certificate manager (letsencrypt) as well.
I was thinking, would it make sense to add a PR that supports something like:
Which will automatically support the above YAML?
Thank you Shaked
EDIT:
It seems like it's important to add
timeouts
otherwise nginx LB might return a 504 timeout sometimes: