Subdomains and ingress - Githubissues

Shaked commented 4 years ago

Hey,

In order to connect between the services (app, files, api) and our nginx LB, we use an ingress that looks like this:

# Source: trains/templates/ingress.yaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: release-name-trains
  labels:
    app.kubernetes.io/name: trains
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/managed-by: Tiller
  annotations:
    certmanager.k8s.io/cluster-issuer: letsencrypt-prod
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "180"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "180"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "180"

spec:
  tls:
    - hosts:
        - "app.trains-prod.example.com"
        - "files.trains-prod.example.com"
        - "api.trains-prod.example.com"
      secretName: tls-secret-prod.app.trains-prod-example.com
  rules:
    - host: "app.trains-prod.example.com"
      http:
        paths:
          - path: /
            backend:
              serviceName: webserver-service
              servicePort: 80
    - host: "api.trains-prod.example.com"
      http:
        paths:
          - path: /
            backend:
              serviceName: apiserver-service
              servicePort: 8008
    - host: "files.trains-prod.example.com"
      http:
        paths:
          - path: /
            backend:
              serviceName: fileserver-service
              servicePort: 8081

This integrates with our certificate manager (letsencrypt) as well.

I was thinking, would it make sense to add a PR that supports something like:

helm install.... --set ingress.enabled=true --set ingress.annotations=THE_ANNOTATION --set ingress.app_host=app.trains-prod.example.com  --set ingress.files_host=files.trains-prod.example.com --set ingress.api_host=api.trains-prod.example.com

Which will automatically support the above YAML?

Thank you Shaked

EDIT:

It seems like it's important to add timeouts otherwise nginx LB might return a 504 timeout sometimes:

bmartinn commented 4 years ago

Thanks @Shaked , yes this does makes sense, and this is exactly what we would recommend setting up manually.

Regrading timeouts, let me find out what we use internally, as we have never encountered timeout issues due to the load balancer...

One last remark, I think we should also add the trains-prod-example.com suffix as a parameter, since all prefixes are fixed, it makes sense to export the only part that changes form one deployment to another.

What do you think?

Shaked commented 4 years ago

Thanks @Shaked , yes this does makes sense, and this is exactly what we would recommend setting up manually.

Great, I'll create a PR for that. I guess it should be under the trains-server-k8s repository, right?

Regrading timeouts, let me find out what we use internally, as we have never encountered timeout issues due to the load balancer...

Great. Not sure why we faced it, but I added this yesterday:

    nginx.ingress.kubernetes.io/proxy-connect-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"

I haven't experienced any timeouts yet, but might as well be because I didn't played with it much.

One last remark, I think we should also add the trains-prod-example.com suffix as a parameter, since all prefixes are fixed, it makes sense to export the only part that changes form one deployment to another. What do you think?

Yea that makes a lot of sense actually, so we can support 2 different cases:

Either developers could use ingress.host=trains-prod.example.com which will automatically be appended to all 3 app, api and files or if they, for some reason, would rather have different hosts, they could use ingress.app_host=trains-prod.example.com, ingress.api_host=else-prod.example.com and ingress.files_host=something-else-prod.example.com

Not sure if the 2nd option is even needed, but I don't mind to add it.

What do you think?

Shaked commented 4 years ago

@bmartinn

I have an update regarding the timeouts. Now I'm seeing other 50x errors, such as 502, 503 (504 disappeared for now).

Looking into `nginx LB` shows:

ku default logs -f steely-mule-nginx-ingress-controller-74d54f944f-9sxbz --since 20m | grep -v "HTTP/1.1\" 200" | grep -v '.well-known'

10.240.0.5 - [10.240.0.5] - - [07/Jan/2020:15:26:27 +0000] "GET /v2.1/events.add_batch HTTP/1.1" 503 198 "-" "python-requests/2.22.0" 1298 0.000 [trains-apiserver-service-8008] - - - - f863f3c72ec17b139f2074a16f8bff04
10.240.0.5 - [10.240.0.5] - - [07/Jan/2020:15:26:27 +0000] "GET /v2.1/tasks.get_by_id HTTP/1.1" 503 198 "-" "python-requests/2.22.0" 1289 0.000 [trains-apiserver-service-8008] - - - - 5872faf664f5f08d45d2c4a9402f637c
10.240.0.5 - [10.240.0.5] - - [07/Jan/2020:15:26:35 +0000] "GET /v2.1/events.add_batch HTTP/1.1" 503 198 "-" "python-requests/2.22.0" 1298 0.000 [trains-apiserver-service-8008] - - - - 9c1dfc69dca4cfecf6abb150f938c827
10.240.0.5 - [10.240.0.5] - - [07/Jan/2020:15:26:35 +0000] "GET /v2.1/tasks.get_by_id HTTP/1.1" 503 198 "-" "python-requests/2.22.0" 1289 0.000 [trains-apiserver-service-8008] - - - - d09b50fd480c22fafcfa93ecff0f377d
[07/Jan/2020:15:26:39 +0000]TCP200000.000
2020/01/07 15:26:51 [warn] 18318#18318: *137904157 a client request body is buffered to a temporary file /tmp/client-body/0000003046, client: 10.240.0.5, server: api.trains-stage.example.com, request: "GET /v2.1/events.add_batch HTTP/1.1", host: "api.trains-stage.example.com"

ku trains logs -f apiserver-75fc489669-x9k76 --since 20m | grep -vi 'returned 200'
[2020-01-07 15:41:49,579] [8] [INFO] [trains.non_responsive_tasks_watchdog] Starting cleanup cycle for running tasks last updated before 2020-01-07 13:41:49.579258
[2020-01-07 15:41:49,581] [8] [INFO] [trains.non_responsive_tasks_watchdog] Done

API server failed and restarted 11 times

kubectl -n trains get pods
NAME                                    READY   STATUS      RESTARTS   AGE
apiserver-75fc489669-x9k76              1/1     Running     11         14d

Using --previous

[2020-01-07 15:24:15,477] [8] [INFO] [trains.updates] TRAINS-SERVER new version available: upgrade to v0.13.0 is recommended!
[2020-01-07 15:24:16,662] [8] [INFO] [trains.service_repo] Returned 200 for tasks.get_by_id in 80ms
[2020-01-07 15:24:18,460] [8] [INFO] [trains.service_repo] Returned 200 for users.get_preferences in 3ms
[2020-01-07 15:24:18,753] [8] [INFO] [trains.service_repo] Returned 200 for tasks.ping in 3ms
[2020-01-07 15:24:18,783] [8] [INFO] [trains.service_repo] Returned 200 for tasks.get_by_id in 10ms
[2020-01-07 15:24:18,840] [8] [INFO] [trains.service_repo] Returned 200 for users.get_current_user in 4ms
[2020-01-07 15:24:18,991] [8] [INFO] [trains.service_repo] Returned 200 for projects.get_all_ex in 6ms
[2020-01-07 15:24:19,606] [8] [INFO] [trains.service_repo] Returned 200 for projects.get_all_ex in 2ms
[2020-01-07 15:24:20,551] [8] [INFO] [trains.service_repo] Returned 200 for tasks.get_all_ex in 396ms
[2020-01-07 15:24:20,562] [8] [INFO] [trains.service_repo] Returned 200 for tasks.get_all_ex in 413ms
/opt/trains/wrapper.sh: line 28:     8 Killed                  python3 server.py

Maybe it's related to the timeouts as well? What am I missing?

Note: the main reason I haven't upgraded to v0.13.0 is because of my previous Azure FlexVolume PR https://github.com/allegroai/trains-server-k8s/pull/2

Thank you!

bmartinn commented 4 years ago

Hi @Shaked ,

Regarding timeouts, our defaults are also 5min on all three connection types, and it seems stable on our setups...

The 50x error codes, I think, are a byproduct of the pod restarts, which we think are derived from k8s memory limit configuration. This is why on v0.13.0 we increased the memory limit, and to be honest I think we should be more generous with that. I suggest you set it at 500M and check if the errors/restarts continue.

p.s.

Great, I'll create a PR for that. I guess it should be under the trains-server-k8s repository, right?

Yes please :smile:

Note: the main reason I haven't upgraded to v0.13.0 ...

With the 0.13 release things got delayed, but they promised to get the FlexVolume PR merged in the next couple of days, so I'm hoping you can upgrade soon :)

Shaked commented 4 years ago

Hey @bmartinn

Regarding timeouts, our defaults are also 5min on all three connection types, and it seems stable on our setups...

The 50x error codes, I think, are a byproduct of the pod restarts, which we think are derived from k8s memory limit configuration. This is why on v0.13.0 we increased the memory limit, and to be honest I think we should be more generous with that. I suggest you set it at 500M and check if the errors/restarts continue.

I'm going to try this ASAP.

Yes please 😄

PR is available https://github.com/allegroai/trains-server-k8s/pull/3

With the 0.13 release things got delayed, but they promised to get the FlexVolume PR merged in the next couple of days, so I'm hoping you can upgrade soon :)

Merged :)

bmartinn commented 4 years ago

Awesome! I'll make sure we see to it :)

allegroai / clearml-server-helm

Subdomains and ingress #3

Looking into `nginx LB` shows:

API server failed and restarted 11 times

Using --previous

allegroai / clearml-server-helm

Subdomains and ingress #3

Looking into nginx LB shows:

API server failed and restarted 11 times

Using --previous

Looking into `nginx LB` shows: