Add support for Bind Module Reference

nickumia-reisys commented 2 years ago

Getting error 'Module module.brokerpak-eks-terraform-bind contains provider configuration' in ssb broker

relates to https://github.com/gsa/datagov-deploy/issues/3696

mogul commented 2 years ago

Where we're at: As of the refactoring commits tonight in this branch and in the corresponding datagov-ssb repo branch, the modules no longer have any providers in them, the root module over them both creates appropriate providers for each module and injects them. Terraform validate approves! However, when I run terraform apply it's getting stuck because the deployment of csi-ebs driver and ingress-nginx helm-release are timing out.

This is almost certainly because those pods have daemonsets and cannot run on Fargate, and the MNG group is not available to schedule. We'll need to review this series of commits tomorrow to see what I b0rked up:

mogul commented 2 years ago

Where we're at:

Commenting out module.provision.module.eks' fargate_profiles configuration got everything working again.
- We'll leave that commented out for now, though in future we may or may not want it available, depending on whether we get everything we need from auto-scaling our managed-node groups.
Destroys are not clean.
- For some reason during destroy the helm provider makes use of the module.provision.kubernetes_service_account.admin service account we create in admin_account.tf, despite the helm provider explicitly not using that account when it's configured in provision-providers.tf.
- That service account loses the cluster-admin role binding almost immediately, so Helm fails because the account doesn't have access to secrets.
- We could make the four helm_release resources depend on that binding to get around this, but that's not solving the real problem: Why isn't the helm provider using the configured AWS creds and generated tokens!?
- If you work around that, destroy of the helm_release ingress_nginx is sometimes getting stuck in state uninstalling. 🙄 I discovered that the helm provider has a debug=true argument that will hopefully help us figure this out once and for all.
- And if you get around that, of course, the NLB and corresponding target groups never get deleted by alb_controller, which means the certificate doesn't get destroyed, which means the VPC can't be deleted.

mogul commented 2 years ago

The terraform development workspace (and corresponding AWS account for ssb-development) currently contains the result of running docker-compose --env-file=.env.${ENV_NAME}.secrets run --rm terraform apply -target=module.brokerpak-eks-terraform -var-file=terraform.${ENV_NAME}.tfvars.

Remove the -target to let it run to completion (eg setting up the user-provided service and running the ssb-solrcloud app). If we're satisfied with how this works, then technically we can roll it into staging and production workspaces... We don't expect to be trying to tear it down in those environments any time soon, and we can clean up by hand if needed.
Whether or not we do that, we should figure out how to get smooth destroys working, because we have to do that in any case once we get back to wrapping that operation in the broker.

mogul commented 2 years ago

I think we may finally be unable to avoid the provider configuration can't be deferred problem, and it's that undefined provider behavior that we're seeing here. 😞

[Edit: Nahhh, I think we can hold out a little longer.]

mogul commented 2 years ago

Destroys are not clean.

For some reason during destroy the helm provider makes use of the module.provision.kubernetes_service_account.admin service account we create in admin_account.tf, despite the helm provider explicitly not using that account when it's configured in provision-providers.tf.

That service account loses the cluster-admin role binding almost immediately, so Helm fails because the account doesn't have access to secrets.

We could make the four helm_release resources depend on that binding to get around this, but that's not solving the real problem: Why isn't the helm provider using the configured AWS creds and generated tokens!?

We figured this problem out.

mogul commented 2 years ago

Lots of teensy bugs fixed along the way to this point:

$ make demo-run
Testing aws-eks-service:raw:instance-bmogilefsky3:binding
To work directly with the instance:
export KUBECONFIG=/tmp/tmp.pAidiG73VS
export DOMAIN_NAME=instance-bmogilefsky3.ssb-dev.data.gov
Running tests...
Deploying the test fixture...
deployment.apps/deployment-2048 created
service/service-2048 created
ingress.networking.k8s.io/ingress-2048 created
Waiting 3 minutes for the workload to start and the DNS entry to be created...
Testing that the ingress is resolvable via SSL, and that it's properly pointing at the 2048 app...curl: (52) Empty reply from server
make: *** [Makefile:109: demo-run] Error 1

That's the next thing to look into.

mogul commented 2 years ago

Oh hey, IT PASSED! Maybe there was just some cruft left in place from my last local invocation of test with an instance of the same name that prevented it from passing locally.

GSA-TTS / datagov-brokerpak-eks

Add support for Bind Module Reference #79