POC: Set up Lagoon on EKS and get a demo of the CMS up and running #6674

As a CMS engineer, I would like to validate that Lagoon will be sufficient for our needs so that we can begin to evaluate the value and cost-savings that Lagoon potentially offers.

Acceptance Criteria

Can we size this?

jefflbrauer commented 2 years ago

jefflbrauer commented 2 years ago

Lagoon Core

Lagoon Core

Screen Shot 2021-10-27 at 6 53 03 AM

I'm not sure what the ANSI art is trying to be. If I can get a source image, I'll try to create a clearer one.

Upon requesting (HTTPS doesn't work, because reasons), I get:

Screen Shot 2021-10-27 at 6 47 16 AM

Which is actually success (at this point)! That means that the routing and ingress are working properly. Next I'll fight with Keycloak, I guess.

I think that's fixable by overriding the keycloakAPIURL in the values file.

        - name: KEYCLOAK_API
          {{- if .Values.keycloakAPIURL }}
          value: {{ .Values.keycloakAPIURL | quote }}
          {{- else }}
          value: https://{{ index .Values.keycloak.ingress.hosts 0 "host" }}/auth
          {{- end }}

And it works:

Screen Shot 2021-10-27 at 8 19 45 AM

Well, sorta:

Screen Shot 2021-10-27 at 8 20 55 AM

Probably because this:

Screen Shot 2021-10-27 at 8 22 24 AM

So let's override this other URL:

        - name: GRAPHQL_API
          {{- if .Values.lagoonAPIURL }}
          value: {{ .Values.lagoonAPIURL | quote }}
          {{- else }}
          value: https://{{ index .Values.api.ingress.hosts 0 "host" }}/graphql
          {{- end }}


Screen Shot 2021-10-27 at 8 28 32 AM
Harbor... just... kinda worked, I guess.


Harbor... just... kinda worked, I guess.

Screen Shot 2021-10-27 at 8 48 50 AM

I updated Lagoon-Core with the Harbor admin password, but obv didn't update the Gist.

Screen Shot 2021-10-27 at 9 11 31 AM

Lagoon Remote

I created a values.yaml file for Lagoon Remote and deployed the helm chart. It, uh, appears to have deployed successfully:

Screen Shot 2021-10-27 at 9 13 29 AM

I mean, who knows what it's actually doing, but I'll burn that bridge when I come to it.

ndouglas commented 2 years ago


SSH access to the lagoon-core-ssh service is required to access Lagoon through the CLI. I thought the service had launched correctly, but upon closer inspection found that it was in Pending state. After debugging some with Eric and Elijah, we found this page, which had the answer:

    # This annotation is only required if you are creating an internal facing ELB. Remove this annotation to create public facing ELB. "true"

After editing this into the service, the NLB seemed reachable via SSH from CMS-Test Dev:

sh-4.2$ telnet 22
Connected to
Escape character is '^]'.
telnet> quit
Connection closed.

The issue from here is that this isn't cleanly accessible from our local machines. A solution is probably straightforward for someone better versed in SOCKS and so forth. I'm currently messing with ProxyJump/ProxyCommand in SSH trying to get this working 🤔

This was the magic necessary to be able to connect (not login) from my local machine.

Host lagoon
    User lagoon
    ProxyCommand ssh -q -A dsva@vetsgov-dev-jumpbox-govwest-1b  nc %h %p

From here I can generate a token. However, it appears that Lagoon CLI doesn't use ~/.ssh/config but attempts to login to the specified hostname directly, e.g. doing a DNS lookup and stuff. This might require upstream patches.

ndouglas commented 2 years ago


I thought Lagoon-CLI used the Go SSH client library, but upon closer inspection it seemed to use the SSH CLI. Then, upon still closer inspection, it only seemed to use the SSH CLI under certain circumstances.

After discussing this with Elijah, Eric, and Cameron, we figured that a good course of action would be to modify the Lagoon CLI to support SOCKS5 or ProxyJump/ProxyCommand or something. Elijah opened an issue.

This morning, I did some tentative work in that direction. Then I started getting itchy and changed the SSH generated command for the codepath that I was fairly sure was never executed, and -- it started working 😕

diff --git a/pkg/lagoon/ssh/main.go b/pkg/lagoon/ssh/main.go
index 3b6e013..23a1ee2 100644
--- a/pkg/lagoon/ssh/main.go
+++ b/pkg/lagoon/ssh/main.go
@@ -120,7 +120,7 @@ func RunSSHCommand(lagoon map[string]string, sshService string, sshContainer str

 // GenerateSSHConnectionString .
 func GenerateSSHConnectionString(lagoon map[string]string, service string, container string) string {
-   connString := fmt.Sprintf("ssh -t -o \"UserKnownHostsFile=/dev/null\" -o \"StrictHostKeyChecking=no\" -p %v %s@%s", lagoon["port"], lagoon["username"], lagoon["hostname"])
+   connString := fmt.Sprintf("ssh -o \"ProxyCommand=ssh -q -A dsva@vetsgov-dev-jumpbox-govwest-1b  nc %%h %%p\" -t -o \"UserKnownHostsFile=/dev/null\" -o \"StrictHostKeyChecking=no\" -p %v %s@%s", lagoon["port"], lagoon["username"], lagoon["hostname"])
    if service != "" {
        connString = fmt.Sprintf("%s service=%s", connString, service)


🔔nathan.douglas@Belmore:~/Projects/lagoon-cli$ ./lagoon-cli login
Error: Post "": dial tcp: lookup no such host

So we need that SOCKS5 proxy to cover everything.

But Go can import proxy information from an HTTP_PROXY environment variable, so:

🔔nathan.douglas@Belmore:~/Projects/lagoon-cli$ export HTTP_PROXY="socks5://"
🔔nathan.douglas@Belmore:~/Projects/lagoon-cli$ ./lagoon-cli login
Token fetched and saved.
🔔nathan.douglas@Belmore:~/Projects/lagoon-cli$ ./lagoon-cli whoami
ID                                      EMAIL                       FIRSTNAME   LASTNAME    SSHKEYS 
2015e338-4c55-44f5-8217-25f77af81937   Nathan      Douglas     2   


So at this point my work is unblocked and I can go find some new obstacle to slam into at high speed.

But... why does it work? At this point in my engineering career, nothing makes me more suspicious than something that Just Worksâ„¢. I did not bleed enough, I did not suffer enough for this to work.

So I git stashed my changes, rebuilt the CLI, and... it still worked. I changed to a new tab (without the exported HTTP_PROXY variable), re-ran it, and... it still worked. I removed the token completely, re-ran, and... it worked.

Something is rotten in the state of Denmark. – Shakespeare Hamlet 1.4.???

After some poking around, I think that the answer was just to change my SSH connection info for Lagoon:

current: lagoon-dev
default: lagoon-dev
    port: "32222"
    token: ""
    version: ""
    kibana: ""
    port: "22"
    token: <lemme 'lone>
    version: v2.1.0
updatecheckdisable: false
environmentfromdirectory: false

Then export the HTTP_PROXY. Then things seem to work and we can continue on our quest.

I still don't really understand why this works. resolves to 10.247.x.y via dig. This address can't be pinged. It can, however, be SSH'ed to. So this sounds like an OSI layer thing. I suspect that there's some SOCKS5 setting somewhere that's getting picked up, but I don't know where it is.

Fun With GraphQL

The next step is to play with Lagoon via GraphQL. Unfortunately:

Screen Shot 2021-10-28 at 10 16 27 AM

GraphiQL doesn't expose any sort of SOCKS proxy configuration.


Fortunately, this is precisely the sort of suffering I've come to expect in engineering.


With this command:

curl -g \
  --socks5-hostname \
  -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <my-token>" \
  -d '{"query":"query allProjects {allProjects {name } }"}' \

I received the expected response:


With the following query:

curl -g \
  --socks5-hostname \
  -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <lagoon-token>" \
  -d '{"query": "mutation addKubernetes {\r\n  addKubernetes(input:\r\n  {\r\n    name: \"lagoon-dev\",\r\n    consoleUrl: \"https:\/\/\",\r\n    token: \"<kubernetes-build-deploy-token>\",\r\n    routerPattern: \"${environment}.${project}\"\r\n  }){id}\r\n}"}' \

I got the following:


Which might also be a sign that things are working. I'm not 100% on the legitimacy of that build-deploy token, though. My kubectl isn't working for some reason, and so I looked up the token in Lens and base64 decoded it. If there's a permission failure after this point, I might need to set that token to the base64 encoded value instead, or something like that.

Next is creating the project:

🔔nathan.douglas@Belmore:~/Projects/lagoon-cli$ lagoon add project --gitUrl git:// --openshift 1 --productionEnvironment lagoon-dev --branches "^(master|main|VACMS-6674.*)$"
Result: success
Project Name: lagoon-dev

and it's visible upon login:

Screen Shot 2021-10-28 at 12 29 11 PM

I added this deploy key:

Screen Shot 2021-10-28 at 12 32 46 PM

and deployed:

Screen Shot 2021-10-28 at 12 36 30 PM

but alas:

Screen Shot 2021-10-28 at 12 37 17 PM Screen Shot 2021-10-28 at 12 37 25 PM

This might be failing because the logs pods are still in ImagePullBackoff:

Screen Shot 2021-10-28 at 12 38 07 PM

So it might be time for More Fun With Kubernetesâ„¢.

EDIT: Nope, just should've supplied a git:// URL instead of SSH. Sorry, hadn't read about the deploy key yet. Just making it up as I go.

That made it further:

Screen Shot 2021-10-28 at 12 47 20 PM

but without logs, my ability to figure out wut's going on is obv limited, so I probably need to fix the root issue there.

ndouglas commented 2 years ago

Fun With Lagoon, Kubernetes, Docker, RDS, IDK What

So why are the logs (and only the logs) in ImagePullBackoff?

The first obstacle along the way is that kubectl stopped working. After some poking around, it appears that the same HTTP_PROXY env var that lets lagoon-cli work actually breaks kubectl EKS access. I'll press on, switching back and forth between tabs.

But now that I can kubectl, I can look at the failures a little more closely.

🔔nathan.douglas@Belmore:~/Projects/content-build$ kubectl get pods --all-namespaces | grep lagoon-build
lagoon-dev-master          lagoon-build-wl8fej                                            0/1     Error              0          24m
lagoon                     lagoon-remote-lagoon-build-deploy-bfb74bf4-mrf66               2/2     Running            0          28h
🔔nathan.douglas@Belmore:~/Projects/content-build$ kubectl describe pod -n lagoon-dev-master lagoon-build-wl8fej
Name:                 lagoon-build-wl8fej
Namespace:            lagoon-dev-master
  Type    Reason     Age   From                                                      Message
  ----    ------     ----  ----                                                      -------
  Normal  Scheduled  24m   default-scheduler                                         Successfully assigned lagoon-dev-master/lagoon-build-wl8fej to
  Normal  Pulling    24m   kubelet,  Pulling image "uselagoon/kubectl-build-deploy-dind:latest"
  Normal  Pulled     24m   kubelet,  Successfully pulled image "uselagoon/kubectl-build-deploy-dind:latest" in 1.14009574s
  Normal  Created    24m   kubelet,  Created container lagoon-build
  Normal  Started    24m   kubelet,  Started container lagoon-build
🔔nathan.douglas@Belmore:~/Projects/content-build$ kubectl logs -n lagoon-dev-master lagoon-build-wl8fej
Agent pid 33
Identity added: /home/.ssh/key (/home/.ssh/key)
+ set -eo pipefail
+ set -o noglob
++ cat /var/run/secrets/
+ NAMESPACE=lagoon-dev-master
+ REGISTRY_REPOSITORY=lagoon-dev-master
++ cat /lagoon/version
+ set +x
+ '[' false == true ']'
+ '[' branch == pullrequest ']'
+ /kubectl-build-deploy/scripts/ git:// origin/master
+ set -eo pipefail
+ REMOTE=git://
+ REF=origin/master
+ git init .
hint: Using 'master' as the name for the initial branch. This default branch name
hint: is subject to change. To configure the initial branch name to use in all
hint: of your new repositories, which will suppress this warning, call:
hint:   git config --global init.defaultBranch <name>
hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
hint: 'development'. The just-created branch can be renamed via this command:
hint:   git branch -m <name>
Initialized empty Git repository in /kubectl-build-deploy/git/.git/
+ git config remote.origin.url git://
+ git fetch --depth=10 --tags --progress git:// '+refs/heads/*:refs/remotes/origin/*'
fatal: unable to connect to[0:]: errno=Operation timed out

Hmm. So it looks kinda like there's an outgoing networking issue.

When consulted, Eric and Elijah nodded sadly and explained that outgoing requests to port SSH are dropped by the TIC. And although GitHub can be SSH'ed to on port 443 this violates the spirit of TIC law and would get me yelled at.

Two solutions are:

A decision on the latter probably isn't possible until Monday, so I'm kinda blocked here.

I think I'll go back and see if I can get HTTPS cloning to work. IDK why it wouldn't, but it didn't before.

EDIT: Yeah, no, definitely still doesn't work.

ndouglas commented 2 years ago

I'm blocked on moving much forward by the outgoing Git/SSH issue, but I can move forward with other things...


I created an EFS filesystem dsva-vagov-lagoon-dev-cms-efs and created it with the following command:

helm upgrade --install --create-namespace --namespace lagoon-efs-provisioner -f efs-provisioner-values.yaml  lagoon-efs-provisioner stable/efs-provisioner

This created a storage class with the name lagoon-bulk. Easy enough.

That does nothing to unblock me with regard to Git/SSH, though, and I still need to figure out a couple things:

ndouglas commented 2 years ago

The ops team, in office hours, confirmed our suspicions that this restriction on outbound SSH is pretty legit. As such, this PoC is blocked.

We have a number of options for moving forward (h/t Cameron for typing them up):

ndouglas commented 2 years ago

Roundabout Approaches


A few of the options above could actually be addressed. I've addressed them, sorta, and will discuss.

Modifying the Lagoon Build Deploy Image

No one has responded yet to my discussion thread about git cloning via HTTPS. However, even if they had, it wouldn't work because the DHS is MITMing the TLS.

I forked the Lagoon service images, rebuilt the kubectl image, pushed it to Docker Hub, modified the derivative kubectl-build-deploy-dind image to insert the cert, rebuilt the image, and pushed it to Docker Hub.

The second half of that was to actually alter the Lagoon configuration to use the new Docker image. I injected the override into the remote-values.yaml and updated the lagoon-remote deployment, but unfortunately the keycloak pods went into ImagePullBackoff because we'd hit the Docker pull limit.

Modifying the codebase

There are some changes that need to be made to the CMS codebase as part of a move to Lagoon. I made them in #6867, although I have no way of testing them.

ndouglas commented 2 years ago

Per this exchange:

Screen Shot 2021-11-02 at 12 52 22 PM

There is no way to move forward with this PoC.

ndouglas commented 2 years ago

So got some responses on this discussion thread saying that although the build-deploy pipeline was implemented with Git/SSH in mind, that was mostly to accommodate GitHub deploy keys and that there was no real hard reason that HTTPS cloning should not work.

Who's to blame?

The Tick

I mentioned that I'd injected the TIC TLS cert into a Docker container, at which point Toby pointed out that I was using an older Dockerfile -- a great catch which undoubtedly would save me some frustration.

So going back to Docker to build the new image:

#17 20.15   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
#17 20.16                                  Dload  Upload   Total   Spent    Left  Speed
100 38.3M  100 38.3M    0     0  17.7M      0  0:00:02  0:00:02 --:--:-- 17.7M
#17 22.33   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
#17 22.33                                  Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (6) Could not resolve host:

I ran into this issue which seems to plague Docker for Mac. I don't want to upgrade Docker for Mac because that's caused issues with Lando in the past. Fortunately, I have about sixty LXC containers with Docker installed, so I'll just SSH into one of them and build the image and push it from there.

Well, then SSH is hanging. I can't SSH into any of said containers, or anything else on my network. SSH works with everything else on my network... except my work computer.


I attempted to find a solution for a few minutes, but being pressed for time I ended up just switching computers, SSHing into my work computer from my personal computer, grabbing the updated Dockerfile, then SSHing into an LXC container to build the kubectl and kubectl-build-deploy-dind images. After adding my SSH pub key for that machine to GitHub. And docker logining.

The CMS project's URL is HTTPS, so I can attempt to deploy the branch PR to see where my PR (see #6867 ) fails:

🔔nathan.douglas@Belmore:~/Projects/lagoon-stuff$ lagoon deploy branch -p lagoon-dev -b VACMS-6674-lagoon
✔ Yes

Now I can log into Lagoon UI because Keycloak is running because it's no longer in ImagePullBackoff because it didn't have Docker Hub credentials.


Screen Shot 2021-11-03 at 8 17 16 AM


It's taking longer to fail than it has before. Which is, technically, progress.

🔔nathan.douglas@Belmore:~/Projects/$ kubectl get pods --all-namespaces | grep lagoon-build

lagoon-dev-vacms-6674-lagoon   lagoon-build-vkn4ha                                            0/1     Error               0          3m12s
lagoon                         lagoon-remote-lagoon-build-deploy-758bd85997-vdkth             2/2     Running             0          18h
🔔nathan.douglas@Belmore:~/Projects/$ kubectl logs -n lagoon-dev-vacms-6674-lagoon lagoon-build-vkn4ha
HEAD is now at 040fe2b7 Fix webroot.
+ git submodule update --init --recursive --jobs=6
+ [[ -n '' ]]
+ '[' '!' -f .lagoon.yml ']'
++ cat .lagoon.yml
++ shyaml get-value environment_variables.git_sha false
+ '[' true == true ']'
++ git rev-parse HEAD
+ LAGOON_GIT_SHA=040fe2b79034c9f31832b64bd9281d9188df7973
+ set +x
User "lagoon/kubernetes.default.svc" set.
Cluster "kubernetes.default.svc" set.
Context "default/lagoon/kubernetes.default.svc" created.
Switched to context "default/lagoon/kubernetes.default.svc".
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
Error response from daemon: Get x509: certificate is valid for *, *, *,, not

So something is requesting Harbor, but doing so via HTTPS and not HTTP. Since I've not specified HTTPS anywhere, this would appear to be an issue with a script somewhere.

After doing so, I don't see any commands issued after WARNING! Using --password via the CLI is insecure. Use --password-stdin., which is a Docker error message. So I think the HTTPS error is from Docker attempting to log in to Harbor and failing to do so because of the certificate.

Why? Well, Docker requires some additional configuration for insecure registries -- configuration that I don't believe the kubectl-build-deploy-dind scripts perform. So I gotta do that.

The problem is that I think since this is built around Docker-in-Docker that we're using insecure registries as specified by the host, not by the container. So I think this might be doomed to fail.

So the only way to move forward at this point, AFAICT, is to add the insecure registry for Harbor to the `/etc/docker/daemon.json` file, add that to a custom AMI, and recreate the EKS cluster using that AMI. As far as I can tell. So I'm blocked again.
ndouglas commented 1 year ago

LOL, I remember none of this.

Summarizing significant issues that I encountered in this PoC:

  1. Lagoon attempts to clone repos via SSH, but SSH outbound is blocked by the TIC. We need to keep Lagoon Core in the vagov-utility VPC and corresponding EKS cluster, which means outgoing connections transit the TIC. This means Lagoon needs to support HTTPS cloning.

  2. Lagoon's Docker image builds target Harbor and that's viewed as "insecure" from the perspective of EKS at this time, and we don't have authority to modify the relevant settings in the EKS cluster. So we might conceivably need to set request and justify settings modifications to support using Harbor insecurely or do the necessary work to make Harbor secure from the perspective of the EKS cluster.