Closed cweagans closed 2 years ago
Can we size this?
Hey team! Please add your planning poker estimate with ZenHub @ElijahLynn @indytechcook @ndouglas @olivereri @timcosgrove
Please add your planning poker estimate with ZenHub @cweagans
I'm not sure what the ANSI art is trying to be. If I can get a source image, I'll try to create a clearer one.
Upon requesting http://api.lagoon-dev.cms.va.gov/ (HTTPS doesn't work, because reasons), I get:
Which is actually success (at this point)! That means that the routing and ingress are working properly. Next I'll fight with Keycloak, I guess.
I was able to log in to Keycloak (:tada:) and set an email for the user (which I set to my A6 email).
I did not initially configure the email server settings -- hoping that might be beyond the scope of this PoC.
EDIT: Removed SES stuff -- unnecessary AFAICT.
I enabled the "Forgot Password" functionality and attempted to access the UI. However, I was greeted only with "Not Authenticated / Please wait while we log you in..." and nothing ever happened. I'm not sure if this is because of a lack of TLS (it shouldn't be).
Oh:
I think that's fixable by overriding the keycloakAPIURL in the values file.
- name: KEYCLOAK_API
{{- if .Values.keycloakAPIURL }}
value: {{ .Values.keycloakAPIURL | quote }}
{{- else }}
value: https://{{ index .Values.keycloak.ingress.hosts 0 "host" }}/auth
{{- end }}
And it works:
Well, sorta:
Probably because this:
So let's override this other URL:
- name: GRAPHQL_API
{{- if .Values.lagoonAPIURL }}
value: {{ .Values.lagoonAPIURL | quote }}
{{- else }}
value: https://{{ index .Values.api.ingress.hosts 0 "host" }}/graphql
{{- end }}
And:
LoadBalancer
-type service, which should provision a Network Load Balancer, but since we're in dev
and not utility
it might not necessarily be reachable. So the Lagoon CLI might take a couple more steps to get operative. Indeed, I can't see any NLBs allocated for the dev cluster aside from a jumpbox NLB, and that makes me think I should stop and work on something else until someone more familiar with the system gets online.Harbor... just... kinda worked, I guess.
I updated Lagoon-Core with the Harbor admin password, but obv didn't update the Gist.
I created a values.yaml file for Lagoon Remote and deployed the helm chart. It, uh, appears to have deployed successfully:
I mean, who knows what it's actually doing, but I'll burn that bridge when I come to it.
SSH access to the lagoon-core-ssh
service is required to access Lagoon through the CLI. I thought the service had launched correctly, but upon closer inspection found that it was in Pending state. After debugging some with Eric and Elijah, we found this page, which had the answer:
# This annotation is only required if you are creating an internal facing ELB. Remove this annotation to create public facing ELB.
service.beta.kubernetes.io/aws-load-balancer-internal: "true"
After editing this into the service, the NLB seemed reachable via SSH from CMS-Test Dev:
sh-4.2$ telnet dsva-vagov-dev-jumpbox-nlb-d30d20cd3ae50f82.elb.us-gov-west-1.amazonaws.com 22
Trying 10.247.96.216...
Connected to dsva-vagov-dev-jumpbox-nlb-d30d20cd3ae50f82.elb.us-gov-west-1.amazonaws.com.
Escape character is '^]'.
SSH-2.0-OpenSSH_7.4
^C^]
telnet> quit
Connection closed.
The issue from here is that this isn't cleanly accessible from our local machines. A solution is probably straightforward for someone better versed in SOCKS and so forth. I'm currently messing with ProxyJump/ProxyCommand in SSH trying to get this working 🤔
This was the magic necessary to be able to connect (not login) from my local machine.
Host lagoon
HostName internal-a5db579a60ddc4d94bd3bdd6cde40ef9-1394069038.us-gov-west-1.elb.amazonaws.com
User lagoon
ProxyCommand ssh -q -A dsva@vetsgov-dev-jumpbox-govwest-1b nc %h %p
From here I can generate a token. However, it appears that Lagoon CLI doesn't use ~/.ssh/config
but attempts to login to the specified hostname directly, e.g. doing a DNS lookup and stuff. This might require upstream patches.
I thought Lagoon-CLI used the Go SSH client library, but upon closer inspection it seemed to use the SSH CLI. Then, upon still closer inspection, it only seemed to use the SSH CLI under certain circumstances.
After discussing this with Elijah, Eric, and Cameron, we figured that a good course of action would be to modify the Lagoon CLI to support SOCKS5 or ProxyJump/ProxyCommand or something. Elijah opened an issue.
This morning, I did some tentative work in that direction. Then I started getting itchy and changed the SSH generated command for the codepath that I was fairly sure was never executed, and -- it started working 😕
diff --git a/pkg/lagoon/ssh/main.go b/pkg/lagoon/ssh/main.go
index 3b6e013..23a1ee2 100644
--- a/pkg/lagoon/ssh/main.go
+++ b/pkg/lagoon/ssh/main.go
@@ -120,7 +120,7 @@ func RunSSHCommand(lagoon map[string]string, sshService string, sshContainer str
// GenerateSSHConnectionString .
func GenerateSSHConnectionString(lagoon map[string]string, service string, container string) string {
- connString := fmt.Sprintf("ssh -t -o \"UserKnownHostsFile=/dev/null\" -o \"StrictHostKeyChecking=no\" -p %v %s@%s", lagoon["port"], lagoon["username"], lagoon["hostname"])
+ connString := fmt.Sprintf("ssh -o \"ProxyCommand=ssh -q -A dsva@vetsgov-dev-jumpbox-govwest-1b nc %%h %%p\" -t -o \"UserKnownHostsFile=/dev/null\" -o \"StrictHostKeyChecking=no\" -p %v %s@%s", lagoon["port"], lagoon["username"], lagoon["hostname"])
if service != "" {
connString = fmt.Sprintf("%s service=%s", connString, service)
}
Kinda:
🔔nathan.douglas@Belmore:~/Projects/lagoon-cli$ ./lagoon-cli login
Error: Post "http://api.lagoon-dev.cms.va.gov/graphql": dial tcp: lookup api.lagoon-dev.cms.va.gov: no such host
So we need that SOCKS5 proxy to cover everything.
But Go can import proxy information from an HTTP_PROXY
environment variable, so:
🔔nathan.douglas@Belmore:~/Projects/lagoon-cli$ export HTTP_PROXY="socks5://127.0.0.1:2001/"
🔔nathan.douglas@Belmore:~/Projects/lagoon-cli$ ./lagoon-cli login
Token fetched and saved.
🔔nathan.douglas@Belmore:~/Projects/lagoon-cli$ ./lagoon-cli whoami
ID EMAIL FIRSTNAME LASTNAME SSHKEYS
2015e338-4c55-44f5-8217-25f77af81937 nathan.douglas@agile6.com Nathan Douglas 2
🎉
So at this point my work is unblocked and I can go find some new obstacle to slam into at high speed.
But... why does it work? At this point in my engineering career, nothing makes me more suspicious than something that Just Worksâ„¢. I did not bleed enough, I did not suffer enough for this to work.
So I git stash
ed my changes, rebuilt the CLI, and... it still worked. I changed to a new tab (without the exported HTTP_PROXY variable), re-ran it, and... it still worked. I removed the token completely, re-ran, and... it worked.
Something is rotten in the state of Denmark. – Shakespeare Hamlet 1.4.???
After some poking around, I think that the answer was just to change my SSH connection info for Lagoon:
current: lagoon-dev
default: lagoon-dev
lagoons:
amazeeio:
graphql: https://api.lagoon.amazeeio.cloud/graphql
hostname: ssh.lagoon.amazeeio.cloud
ui: https://dashboard.amazeeio.cloud
kibana: https://logs.amazeeio.cloud/
port: "32222"
token: ""
version: ""
lagoon-dev:
graphql: http://api.lagoon-dev.cms.va.gov/graphql
hostname: internal-a5db579a60ddc4d94bd3bdd6cde40ef9-1394069038.us-gov-west-1.elb.amazonaws.com
ui: https://ui.lagoon-dev.cms.va.gov
kibana: ""
port: "22"
token: <lemme 'lone>
version: v2.1.0
updatecheckdisable: false
environmentfromdirectory: false
Then export the HTTP_PROXY. Then things seem to work and we can continue on our quest.
I still don't really understand why this works. internal-a5db579a60ddc4d94bd3bdd6cde40ef9-1394069038.us-gov-west-1.elb.amazonaws.com
resolves to 10.247.x.y
via dig
. This address can't be pinged. It can, however, be SSH'ed to. So this sounds like an OSI layer thing. I suspect that there's some SOCKS5 setting somewhere that's getting picked up, but I don't know where it is.
The next step is to play with Lagoon via GraphQL. Unfortunately:
GraphiQL doesn't expose any sort of SOCKS proxy configuration.
Fortunately, this is precisely the sort of suffering I've come to expect in engineering.
With this command:
curl -g \
--socks5-hostname 127.0.0.1:2001 \
-X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <my-token>" \
-d '{"query":"query allProjects {allProjects {name } }"}' \
http://api.lagoon-dev.cms.va.gov/graphql
I received the expected response:
{"data":{"allProjects":[]}}
With the following query:
curl -g \
--socks5-hostname 127.0.0.1:2001 \
-X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <lagoon-token>" \
-d '{"query": "mutation addKubernetes {\r\n addKubernetes(input:\r\n {\r\n name: \"lagoon-dev\",\r\n consoleUrl: \"https:\/\/4FE820642ABFA95BCB6854C69A1AF5A2.gr7.us-gov-west-1.eks.amazonaws.com\",\r\n token: \"<kubernetes-build-deploy-token>\",\r\n routerPattern: \"${environment}.${project}.lagoon-dev.cms.va.gov\"\r\n }){id}\r\n}"}' \
http://api.lagoon-dev.cms.va.gov/graphql
I got the following:
{"data":{"addKubernetes":{"id":1}}}
Which might also be a sign that things are working. I'm not 100% on the legitimacy of that build-deploy token, though. My kubectl
isn't working for some reason, and so I looked up the token in Lens and base64 decoded it. If there's a permission failure after this point, I might need to set that token to the base64 encoded value instead, or something like that.
Next is creating the project:
🔔nathan.douglas@Belmore:~/Projects/lagoon-cli$ lagoon add project --gitUrl git://github.com/department-of-veterans-affairs/va.gov-cms.git --openshift 1 --productionEnvironment lagoon-dev --branches "^(master|main|VACMS-6674.*)$"
Result: success
Project Name: lagoon-dev
GitURL: https://github.com/department-of-veterans-affairs/va.gov-cms.git
and it's visible upon login:
I added this deploy key:
and deployed:
but alas:
This might be failing because the logs pods are still in ImagePullBackoff:
So it might be time for More Fun With Kubernetesâ„¢.
EDIT: Nope, just should've supplied a git://
URL instead of SSH. Sorry, hadn't read about the deploy key yet. Just making it up as I go.
That made it further:
but without logs, my ability to figure out wut's going on is obv limited, so I probably need to fix the root issue there.
So why are the logs (and only the logs) in ImagePullBackoff?
The first obstacle along the way is that kubectl
stopped working. After some poking around, it appears that the same HTTP_PROXY
env var that lets lagoon-cli
work actually breaks kubectl
EKS access. I'll press on, switching back and forth between Terminal.app tabs.
But now that I can kubectl
, I can look at the failures a little more closely.
🔔nathan.douglas@Belmore:~/Projects/content-build$ kubectl get pods --all-namespaces | grep lagoon-build
lagoon-dev-master lagoon-build-wl8fej 0/1 Error 0 24m
lagoon lagoon-remote-lagoon-build-deploy-bfb74bf4-mrf66 2/2 Running 0 28h
🔔nathan.douglas@Belmore:~/Projects/content-build$ kubectl describe pod -n lagoon-dev-master lagoon-build-wl8fej
Name: lagoon-build-wl8fej
Namespace: lagoon-dev-master
<snip>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 24m default-scheduler Successfully assigned lagoon-dev-master/lagoon-build-wl8fej to ip-10-247-96-165.us-gov-west-1.compute.internal
Normal Pulling 24m kubelet, ip-10-247-96-165.us-gov-west-1.compute.internal Pulling image "uselagoon/kubectl-build-deploy-dind:latest"
Normal Pulled 24m kubelet, ip-10-247-96-165.us-gov-west-1.compute.internal Successfully pulled image "uselagoon/kubectl-build-deploy-dind:latest" in 1.14009574s
Normal Created 24m kubelet, ip-10-247-96-165.us-gov-west-1.compute.internal Created container lagoon-build
Normal Started 24m kubelet, ip-10-247-96-165.us-gov-west-1.compute.internal Started container lagoon-build
🔔nathan.douglas@Belmore:~/Projects/content-build$ kubectl logs -n lagoon-dev-master lagoon-build-wl8fej
Agent pid 33
Identity added: /home/.ssh/key (/home/.ssh/key)
+ set -eo pipefail
+ set -o noglob
+ REGISTRY=none.com
++ cat /var/run/secrets/kubernetes.io/serviceaccount/namespace
+ NAMESPACE=lagoon-dev-master
+ REGISTRY_REPOSITORY=lagoon-dev-master
++ cat /lagoon/version
+ LAGOON_VERSION=21.9.0
+ set +x
+ '[' false == true ']'
+ CI_OVERRIDE_IMAGE_REPO=
+ '[' branch == pullrequest ']'
+ /kubectl-build-deploy/scripts/git-checkout-pull.sh git://github.com/department-of-veterans-affairs/va.gov-cms.git origin/master
+ set -eo pipefail
+ REMOTE=git://github.com/department-of-veterans-affairs/va.gov-cms.git
+ REF=origin/master
+ git init .
hint: Using 'master' as the name for the initial branch. This default branch name
hint: is subject to change. To configure the initial branch name to use in all
hint: of your new repositories, which will suppress this warning, call:
hint:
hint: git config --global init.defaultBranch <name>
hint:
hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
hint: 'development'. The just-created branch can be renamed via this command:
hint:
hint: git branch -m <name>
Initialized empty Git repository in /kubectl-build-deploy/git/.git/
+ git config remote.origin.url git://github.com/department-of-veterans-affairs/va.gov-cms.git
+ git fetch --depth=10 --tags --progress git://github.com/department-of-veterans-affairs/va.gov-cms.git '+refs/heads/*:refs/remotes/origin/*'
fatal: unable to connect to github.com:
github.com[0: 192.30.255.112]: errno=Operation timed out
Hmm. So it looks kinda like there's an outgoing networking issue.
When consulted, Eric and Elijah nodded sadly and explained that outgoing requests to port SSH are dropped by the TIC. And although GitHub can be SSH'ed to on port 443 this violates the spirit of TIC law and would get me yelled at.
Two solutions are:
A decision on the latter probably isn't possible until Monday, so I'm kinda blocked here.
I think I'll go back and see if I can get HTTPS cloning to work. IDK why it wouldn't, but it didn't before.
EDIT: Yeah, no, definitely still doesn't work.
I'm blocked on moving much forward by the outgoing Git/SSH issue, but I can move forward with other things...
I created an EFS filesystem dsva-vagov-lagoon-dev-cms-efs
and created it with the following command:
helm upgrade --install --create-namespace --namespace lagoon-efs-provisioner -f efs-provisioner-values.yaml lagoon-efs-provisioner stable/efs-provisioner
This created a storage class with the name lagoon-bulk
. Easy enough.
That does nothing to unblock me with regard to Git/SSH, though, and I still need to figure out a couple things:
The ops team, in office hours, confirmed our suspicions that this restriction on outbound SSH is pretty legit. As such, this PoC is blocked.
We have a number of options for moving forward (h/t Cameron for typing them up):
A few of the options above could actually be addressed. I've addressed them, sorta, and will discuss.
No one has responded yet to my discussion thread about git cloning via HTTPS. However, even if they had, it wouldn't work because the DHS is MITMing the TLS.
I forked the Lagoon service images, rebuilt the kubectl
image, pushed it to Docker Hub, modified the derivative kubectl-build-deploy-dind
image to insert the cert, rebuilt the image, and pushed it to Docker Hub.
The second half of that was to actually alter the Lagoon configuration to use the new Docker image. I injected the override into the remote-values.yaml and updated the lagoon-remote
deployment, but unfortunately the keycloak pods went into ImagePullBackoff
because we'd hit the Docker pull limit.
There are some changes that need to be made to the CMS codebase as part of a move to Lagoon. I made them in #6867, although I have no way of testing them.
Per this exchange:
There is no way to move forward with this PoC.
So got some responses on this discussion thread saying that although the build-deploy pipeline was implemented with Git/SSH in mind, that was mostly to accommodate GitHub deploy keys and that there was no real hard reason that HTTPS cloning should not work.
Who's to blame?
I mentioned that I'd injected the TIC TLS cert into a Docker container, at which point Toby pointed out that I was using an older Dockerfile -- a great catch which undoubtedly would save me some frustration.
So going back to Docker to build the new image:
#17 20.15 % Total % Received % Xferd Average Speed Time Time Time Current
#17 20.16 Dload Upload Total Spent Left Speed
100 38.3M 100 38.3M 0 0 17.7M 0 0:00:02 0:00:02 --:--:-- 17.7M
#17 22.33 % Total % Received % Xferd Average Speed Time Time Time Current
#17 22.33 Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (6) Could not resolve host: github.com
I ran into this issue which seems to plague Docker for Mac. I don't want to upgrade Docker for Mac because that's caused issues with Lando in the past. Fortunately, I have about sixty LXC containers with Docker installed, so I'll just SSH into one of them and build the image and push it from there.
Well, then SSH is hanging. I can't SSH into any of said containers, or anything else on my network. SSH works with everything else on my network... except my work computer.
I attempted to find a solution for a few minutes, but being pressed for time I ended up just switching computers, SSHing into my work computer from my personal computer, grabbing the updated Dockerfile, then SSHing into an LXC container to build the kubectl
and kubectl-build-deploy-dind
images. After adding my SSH pub key for that machine to GitHub. And docker login
ing.
The CMS project's URL is HTTPS, so I can attempt to deploy the branch PR to see where my PR (see #6867 ) fails:
🔔nathan.douglas@Belmore:~/Projects/lagoon-stuff$ lagoon deploy branch -p lagoon-dev -b VACMS-6674-lagoon
✔ Yes
success
Now I can log into Lagoon UI because Keycloak is running because it's no longer in ImagePullBackoff
because it didn't have Docker Hub credentials.
And:
🎉
It's taking longer to fail than it has before. Which is, technically, progress.
🔔nathan.douglas@Belmore:~/Projects/va.gov-cms$ kubectl get pods --all-namespaces | grep lagoon-build
lagoon-dev-vacms-6674-lagoon lagoon-build-vkn4ha 0/1 Error 0 3m12s
lagoon lagoon-remote-lagoon-build-deploy-758bd85997-vdkth 2/2 Running 0 18h
🔔nathan.douglas@Belmore:~/Projects/va.gov-cms$ kubectl logs -n lagoon-dev-vacms-6674-lagoon lagoon-build-vkn4ha
<snip>
HEAD is now at 040fe2b7 Fix webroot.
+ git submodule update --init --recursive --jobs=6
+ [[ -n '' ]]
+ '[' '!' -f .lagoon.yml ']'
++ cat .lagoon.yml
++ shyaml get-value environment_variables.git_sha false
+ INJECT_GIT_SHA=true
+ '[' true == true ']'
++ git rev-parse HEAD
+ LAGOON_GIT_SHA=040fe2b79034c9f31832b64bd9281d9188df7973
+ REGISTRY_SECRETS=()
+ PRIVATE_REGISTRY_COUNTER=0
+ PRIVATE_REGISTRY_URLS=()
+ PRIVATE_DOCKER_HUB_REGISTRY=0
+ PRIVATE_EXTERNAL_REGISTRY=0
+ set +x
User "lagoon/kubernetes.default.svc" set.
Cluster "kubernetes.default.svc" set.
Context "default/lagoon/kubernetes.default.svc" created.
Switched to context "default/lagoon/kubernetes.default.svc".
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
Error response from daemon: Get https://harbor.lagoon-dev.cms.va.gov/v2/: x509: certificate is valid for *.ci.cms.va.gov, *.demo.cms.va.gov, *.tugboat.vfs.va.gov, tugboat.vfs.va.gov, not harbor.lagoon-dev.cms.va.gov
So something is requesting Harbor, but doing so via HTTPS and not HTTP. Since I've not specified HTTPS anywhere, this would appear to be an issue with a script somewhere.
After doing so, I don't see any commands issued after WARNING! Using --password via the CLI is insecure. Use --password-stdin.
, which is a Docker error message. So I think the HTTPS error is from Docker attempting to log in to Harbor and failing to do so because of the certificate.
Why? Well, Docker requires some additional configuration for insecure registries -- configuration that I don't believe the kubectl-build-deploy-dind
scripts perform. So I gotta do that.
The problem is that I think since this is built around Docker-in-Docker that we're using insecure registries as specified by the host, not by the container. So I think this might be doomed to fail.
LOL, I remember none of this.
Summarizing significant issues that I encountered in this PoC:
Lagoon attempts to clone repos via SSH, but SSH outbound is blocked by the TIC. We need to keep Lagoon Core in the vagov-utility
VPC and corresponding EKS cluster, which means outgoing connections transit the TIC. This means Lagoon needs to support HTTPS cloning.
Lagoon's Docker image builds target Harbor and that's viewed as "insecure" from the perspective of EKS at this time, and we don't have authority to modify the relevant settings in the EKS cluster. So we might conceivably need to set request and justify settings modifications to support using Harbor insecurely or do the necessary work to make Harbor secure from the perspective of the EKS cluster.
Description
As a CMS engineer, I would like to validate that Lagoon will be sufficient for our needs so that we can begin to evaluate the value and cost-savings that Lagoon potentially offers.
Acceptance Criteria
CMS Team
Please leave only the team that will do this work selected. If you're not sure, it's fine to leave both selected.
Platform CMS Team
Sitewide CMS Team
Related #6673