azavea / noaa-hydro-data

NOAA Phase 2 Hydrological Data Processing
11 stars 3 forks source link

Add Terraform EKS deployment #54

Closed jpolchlo closed 2 years ago

jpolchlo commented 2 years ago

This PR contributes Terraform code to set up a Kubernetes cluster via EKS. There are many implementation details to setting up EKS clusters, it seems, so this should be considered a first swing at bat. There's an excellent chance that this contribution can be greatly simplified.

Notable features here include the use of Karpenter for cluster autoscaling. This is AWS-specific (for now), but more featureful than the baseline k8s autoscaler. Notably, Karpenter will select instance sizes to best match the deployed tasks.

At the time of initial submission, this PR is still pretty WIP-y. Tasks to be completed:

The fruits of this PR can be observed and interacted with at https://jupyter.noaa.azavea.com. Anyone with an Azavea Google login can sign in. External parties can be set up via Cognito if there is a legitimate request.

(To limit the scope of this PR, I'm going to push off some of the objectives into other issues.

Closes #20 Closes #21 Closes #62 Closes #63 Closes #64 Closes #65

jpolchlo commented 2 years ago

Want to note that I saw a problem destroying this infrastructure. Things went very sideways when it appeared that there were nodes created by Karpener (that are not part of the managed node groups set up in the EKS cluster definitions). These held on to network interfaces that were associated with a subnet and security group that could then not be deleted. I'll need to seek a solution for this.

lewfish commented 2 years ago

I ran the ./scripts/cibuild on Mac OS and got the following error:

$ ./scripts/cibuild
Building terraform
[+] Building 1.4s (9/9) FINISHED
 => [internal] load build definition from Dockerfile.terraform                                                                                                                                                   0.0s
 => => transferring dockerfile: 47B                                                                                                                                                                              0.0s
 => [internal] load .dockerignore                                                                                                                                                                                0.0s
 => => transferring context: 2B                                                                                                                                                                                  0.0s
 => [internal] load metadata for quay.io/azavea/terraform:1.0.0                                                                                                                                                  1.2s
 => [1/6] FROM quay.io/azavea/terraform:1.0.0@sha256:79b9e6abcd72d456eb54db4fec90e6a014ff13f4a0c0e078384627f6efeafa0b                                                                                            0.0s
 => CACHED [2/6] RUN apk add --update docker openrc                                                                                                                                                              0.0s
 => CACHED [3/6] RUN rc-update add docker boot                                                                                                                                                                   0.0s
 => CACHED [4/6] RUN cd /tmp &&     curl -LO "https://dl.k8s.io/release/v1.23.6/bin/linux/amd64/kubectl" &&     install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl &&     rm kubectl                 0.0s
 => CACHED [5/6] RUN cd /tmp &&     curl -LO "https://get.helm.sh/helm-v3.8.2-linux-amd64.tar.gz" &&     tar -zxf "helm-v3.8.2-linux-amd64.tar.gz" &&     install -o root -g root -m 0755 linux-amd64/helm /usr  0.0s
 => ERROR [6/6] RUN echo "USER=lewfish, GID=20, UID=501" &&     addgroup -g 20 lewfish &&     adduser -u 501 -G lewfish -D -s /bin/bash lewfish                                                                  0.2s
------
 > [6/6] RUN echo "USER=lewfish, GID=20, UID=501" &&     addgroup -g 20 lewfish &&     adduser -u 501 -G lewfish -D -s /bin/bash lewfish:
#9 0.176 USER=lewfish, GID=20, UID=501
#9 0.188 addgroup: gid '20' in use
------
executor failed running [/bin/sh -c echo "USER=$USER, GID=$GID, UID=$UID" &&     addgroup -g $GID $USER &&     adduser -u $UID -G $USER -D -s /bin/bash $USER]: exit code: 1
ERROR: Service 'terraform' failed to build : Build failed

I'm trying to debug this myself, but I'm not sure why we need to run the addgroup command.

jpolchlo commented 2 years ago

Depending on the system, the addgroup might be necessary. On my Linux system, my GID is 1000, which doesn't exist in the container. Evidently, macOS is happy assigning a low numbered group to a user, which conflicts with the system groups in the Alpine Linux-based container in use here. Ultimately, the solution would be to not create the group if it exists. I pulled in @moradology on this, and we observed the same problem. Simple fixes weren't entirely successful, since they caused some problems with the subsequent adduser command. Still working on this.

jpolchlo commented 2 years ago

This PR is probably in a reasonable state to merge. The cibuild/docker image problems on MacOS have not been solved, but should be moved to a separate issue, since they haven't been preventing continued work. Also note that the Dockerfile and docker-compose here will possibly serve as a template for how to fix the issue here. But because I have no access to a mac, I can't push on incorporating these fixes and testing to see if they fix the problem.

Summing up: this "works", the outstanding script issue needs more time, but it's not worth holding up this merge.

jpolchlo commented 2 years ago

I can put up an issue to update the README.