DataONEorg / k8s-cluster

Documentation on the DataONE Kubernetes cluster
Apache License 2.0
2 stars 1 forks source link

Config k8s for ports 80/443 #16

Closed gothub closed 2 years ago

gothub commented 2 years ago

This ticket is a continuation of the discussion from https://github.com/DataONEorg/gnis-deployment/issues/5.

The NGINX Ingress Controller will be configured and tested to expose ports 80/443 to external traffic (external to the cluster).

gothub commented 2 years ago

The NGINX Ingress Controller was installed from helm and the hello-kubernetes app was started on the dev k8s, after the existing NGINX was uninstalled.

This test was a replication of the test @mbjones did with Docker desktop, as described in the GNIS repo issue that was mentioned above.

Starting NGINX and the app in this way made available the NGINX Ingress Controller as type 'loadBalancer' at ports 80:31199/TCP,443:31964/TCP as shown by the kubectl get pods,services command:

NAMESPACE          NAME                                            READY   STATUS    RESTARTS          AGE     IP               NODE                NOMINATED NODE   READINESS GATES
default            pod/ingress-nginx-controller-748d8ff6c7-m4sz7   1/1     Running   0                 4m28s   192.168.3.138    docker-dev-ucsb-1   <none>           <none>
hello-kubernetes   pod/hello-deployment-77975745cc-9sq7l           1/1     Running   0                 2m13s   192.168.50.157   docker-dev-ucsb-2   <none>           <none>
hello-kubernetes   pod/hello-deployment-77975745cc-vlbqv           1/1     Running   0                 2m13s   192.168.3.178    docker-dev-ucsb-1   <none>           <none>
hello-kubernetes   pod/hello-deployment-77975745cc-whrt5           1/1     Running   0                 2m13s   192.168.50.186   docker-dev-ucsb-2   <none>           <none>

NAMESPACE          NAME                                         TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE     SELECTOR
default            service/ingress-nginx-controller             LoadBalancer   10.106.90.173    <pending>     80:31199/TCP,443:31964/TCP   4m29s   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx
default            service/ingress-nginx-controller-admission   ClusterIP      10.96.219.179    <none>        443/TCP                      4m29s   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx
default            service/kubernetes                           ClusterIP      10.96.0.1        <none>        443/TCP                      611d    <none>
hello-kubernetes   service/hello-service                        ClusterIP      10.96.203.197    <none>        80/TCP                       2m13s   app=hellodemo

The NGINX loadBalancer service remains in the pending state (EXTERNAL_IP).

The files, including a log file, are available at https://github.com/DataONEorg/k8s-cluster/tree/feature-%2316-nginx-load-balancer/control-plane/nginx-load-balancer

An external connection can be made to k8s via port 31964 but not 443:

On avatar (remote MBP):
avatar:~ slaughter$ curl -k https://api.test.dataone.org:31964
Hello Kubernetes!avatar:~ slaughter$

avatar:Desktop slaughter$ curl -k https://api.test.dataone.org
curl: (35) LibreSSL SSL_connect: SSL_ERROR_SYSCALL in connection to api.test.dataone.org:443
gothub commented 2 years ago

Using the hostPort approach (mentioned in GNIS issue 5) may not be the solution that is chosen for DataONE k8s, but it will be summarized here for future reference. Here is the general outline:

gothub commented 2 years ago

If anyone has any additional ideas on how to solve this problem, please detail them here.

amoeba commented 2 years ago

Hey @gothub, thanks for continuing to try getting this working.

I gleaned form the above that we're close but that the state in your above output is the current sticking point. I wanted to help but didn't want to fiddle with the NCEAS clusters (in case I broke something) so I decided to try setting up a cluster from scratch with a couple of DigitalOcean droplets.

I think it ended up working and I was able get things to where I could deploy a basic webapp and route traffic to it over a subdomain I own with a working LetsEncrypt cert. I was also able to test that requests were being load balanced across pods and across nodes.

I wrote up my steps and ran through them twice to check for errors. I put them in a gist: https://gist.github.com/amoeba/07f713bb4fca8b74fbc0d314bf9e70d6.

Maybe we could go over this next week to see if it could help us get our cluster working. Let me know.

gothub commented 2 years ago

@bryce thanks for looking into this. How does Digital Ocean handle the routing from port 443 to the ingress controller? I had a look at the DO tutorial and it's not clear to me how this is done.

amoeba commented 2 years ago

As far as I can tell, the ingress controller is just binding port 80/443 on the Node (droplet), making any request like http(s)://$NODE_EXTERNAL_IP:{80,443} route to the ingress controller. DO Droplets are pretty much analogous to our static VMs at NCEAS so I don't think there's any magic going on.

amoeba commented 2 years ago

Just to test that my steps didn't rely on anything DigitalOcean offered that we didn't have on our end, I ran through my steps on my my NCEAS development vm (neutral-cat). Aside from some hiccups that weren't really related, I think things are set up and working. See https://neutral-cat.nceas.ucsb.edu (I'll be firewalling this host off soon so you probably should be vpn'd to visit).

gothub commented 2 years ago

@bryce difference between the neutral-cat k8s and DataONE k8s that may be significant:

I'm not sure how significant these differences are, but my point is:

amoeba commented 2 years ago

neutral-cat is running a DigitalOcean version of k8s, not the kubeadm installed k8s on the D1 k8s

Could you explain this a bit more? I didn't install DO managed k8s, I just installed kubeadm on an empty Ubuntu VM. And I re-did the test on my NCEAS VM which was also a kubeadm install.

Re: nginx ingress controller was installed with --set controller.publishService.enabled=true, I wasn't sure but I see this in the docs for that Chart:

if true, the controller will set the endpoint records on the ingress objects to reflect those on the service

There's a similar flag for the nginx-ingress-controller itself,

publish-service: Service fronting the Ingress controller. Takes the form "namespace/name". When used together with update-status, the controller mirrors the address of this service's endpoints to the load-balancer status of all Ingress objects it satisfies.

So is that doing this to my Ingress?

status:
  loadBalancer:
    ingress:
    - ip: 128.111.85.32
amoeba commented 2 years ago

After talking with @mbjones and @gothub on Slack, we decided I'd try to replicate what I had done on my own VM on the Dev cluster's control plane VM.

I...

  1. Applied the already-written Service def for the NGINX Ingress Controller at https://github.com/DataONEorg/k8s-cluster/blob/main/control-plane/nginx-ingress-controller/controller-service.yaml
  2. Added the following section by kubectl edit'ing the Service def
    externalIPs:
    - 128.111.85.190
  3. Removed the host port declaration from the NGINX Ingress Controller Deployment and verified the host ports are no longer showing as bound
    ; k describe --namespace=ingress-nginx deployments ingress-nginx-controller | grep -i port
        Ports:       80/TCP, 443/TCP
        Host Ports:  0/TCP, 0/TCP

I think thing are working but would love @gothub to check it over. I also left up the hello-kubernetes deployment I set up before and deployed it at https://api.test.dataone.org just to test that load balancing was happening.

gothub commented 2 years ago

The helm command that replicates the externalIP config that @amoeba used to get port 80/443 working is:

helm upgrade --install ingress-nginx ingress-nginx \
  --version=4.0.6 \
  --repo https://kubernetes.github.io/ingress-nginx \
  --namespace ingress-nginx --create-namespace \
  --set controller.service.type=ClusterIP \
  --set controller.service.externalIPs={128.111.85.190} \
  --set controller.ingressClassResource.default=true \
  --set controller.ingressClassResource.name=nginx \
  --set controller.admissionWebhooks.enabled=false

With this installed on dev k8s, https://stage.gnis-ld.org/data is working and provides a valid LE cert, so the ingress/secret is setup correctly.

Also https://api.test.dataone.org/quality is available, but is not providing a valid LE cert. This may be due to a config problem or cert-manager - more debugging needed for this.

mbjones commented 2 years ago

w00t 🎉 🥇 🎉

amoeba commented 2 years ago

Great to see, @gothub, thanks. I'm really glad the Helm chart had enough CLI flags for our use case.

amoeba commented 2 years ago

@mbjones dropped a note in Slack today to have us check about source IP preservation. Apparently ESS-DIVE has been bit by this before. It looks like, by default, source IPs don't get preserved and this means that applications only see a cluster IP address as the source IP of incoming HTTP requests. I'm not sure if this is an immediate problem for the services we're running today but it's almost guaranteed to be a problem in the future (ie if we want to deploy Metacat or if a service wants to do usage tracking or rate limiting).

I did a test on my development cluster that's running nginx-ingress-controller and I'm seeing a cluster IP instead of the real source IP when I hit a pod running its own version of nginx:

::ffff:10.32.0.8 - - [20/Nov/2021:01:16:47 +0000] "GET /favicon.ico HTTP/1.1" 404 150 "https://neutral-cat.nceas.ucsb.edu/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36"

@mbjones mentioned two links we might start on,

@gothub maybe you were already planning on figuring this out along with what you're already doing but I thought I'd drop this note for posterity.

amoeba commented 2 years ago

I tested the Local externalTrafficPolicy config mentioned in https://kubernetes.io/docs/tasks/access-application-cluster/create-external-load-balancer/#preserving-the-client-source-ip and confirmed it works: (1) I see the real request source ID and (2) traffic is distributed over the available pods. You can see this in action at (VPN required) https://neutral-cat.nceas.ucsb.edu/. Let me know if you don't see your IP listed.

They spookily mention that setting the policy to Local:

risks potentially imbalanced traffic spreading.

I don't understand this risk but I do know how to use R, so I ran just tested it. Over 1000 requests, I get this spread across three pods:

> table(result)
hello-kubernetes-6c4fb97c77-4kbxr hello-kubernetes-6c4fb97c77-kh28b hello-kubernetes-6c4fb97c77-pwl5w 
                              334                               334                               332 

Looks pretty even to me so maybe Local mode isn't a big deal?

mbjones commented 2 years ago

Great sleuthing and nice empirical demo, @amoeba . Seems like a good way to start for our config.

gothub commented 2 years ago

@bryce - could you post your k8s service yaml that allowed you to retain the source IP?

amoeba commented 2 years ago

Here's my full NIC service spec:

services_nginx-ingress-ingress-nginx-controller.yml ``` apiVersion: v1 kind: Service metadata: annotations: meta.helm.sh/release-name: nginx-ingress meta.helm.sh/release-namespace: default creationTimestamp: "2021-11-08T21:28:42Z" labels: app.kubernetes.io/component: controller app.kubernetes.io/instance: nginx-ingress app.kubernetes.io/managed-by: Helm app.kubernetes.io/name: ingress-nginx app.kubernetes.io/version: 1.0.4 helm.sh/chart: ingress-nginx-4.0.6 name: nginx-ingress-ingress-nginx-controller namespace: default resourceVersion: "1626286" uid: 910d7eb1-c5ce-4826-b835-8d36a21b3537 spec: clusterIP: 10.100.93.5 clusterIPs: - 10.100.93.5 externalIPs: - 128.111.85.32 externalTrafficPolicy: Local internalTrafficPolicy: Cluster ipFamilies: - IPv4 ipFamilyPolicy: SingleStack ports: - appProtocol: http name: http nodePort: 31400 port: 80 protocol: TCP targetPort: http - appProtocol: https name: https nodePort: 31178 port: 443 protocol: TCP targetPort: https selector: app.kubernetes.io/component: controller app.kubernetes.io/instance: nginx-ingress app.kubernetes.io/name: ingress-nginx sessionAffinity: None type: NodePort status: loadBalancer: {} ```
gothub commented 2 years ago

The fix to preserve the source IP didn't work in the test I ran recently. The NGINX ingress controller service definition was reconfigured to use

...
spec:
  type: NodePort
  externalIPs:
    - 128.111.85.190
  externalTrafficPolicy: Local       # <-- added this line for the test
  ipFamilyPolicy: SingleStack       # <-- added this line for the test
  ipFamilies:                                   # <-- added this line for the test
    - IPv4                                        # <-- added this line for the test
...

When the service was restarted with these additions, the ingress-nginx-controller pod stopped receiving traffic. The externalP is set to the IP of docker-dev-ucsb-1, but currently only one instance of ingress-nginx is running on docker-dev-ucsb-2.

Restarting the service with externalTrafficPolicy commented out results in traffic being routed to the NIC again.

It's possible that packets are being dropped because ingress-nginx is not running on docker-dev-ucsb-1, as mentioned here:

The recommended way to preserve the source IP in a NodePort setup is to set the value of the externalTrafficPolicy field of the ingress-nginx Service spec to Local (example).

Warning

This setting effectively drops packets sent to Kubernetes nodes which are not running any instance of the NGINX Ingress controller. Consider assigning NGINX Pods to specific nodes in order to control on what nodes the NGINX Ingress controller should be scheduled or not scheduled.

I'll run a test during k8s dev scheduled down time to confirm this. ingress-nginx will be configured to be scheduled to run on docker-dev-ucsb-1, then we can see if the externalTrafficPolicy=Local fix will route traffic to it.

The above link mentioned other methods to preserver sourceIP that may be easier to maintain than changing the traffic policy.

gothub commented 2 years ago

BTW - ports 80/443 are now active on both dev k8s and prod. I'll writeup docs on how this was configured.

amoeba commented 2 years ago

Thanks for trying it out and for the report @gothub.