Create a CI/CD workflow for contributions and periodic testing of the default branch

displague commented 4 years ago

This project is very complex, spanning the creation and provisioning of physical hardware, virtual networking, operating systems, licensed software installations, with Anthos and Kubernetes clusters eventually coming out of the 1h+ provisioning process.

In order to allow this project to evolve quickly while encouraging users to safely depend on this module, we must introduce continuous integration and continuous delivery practices.

PRs should be verifiable using a CI/CD that does the following things:

[x] builds the full environment
[x] tears down the full environment (including, and especially, when the tests fail)
[x] tear down (sweepers) should be run at the start of each build job to ensure that deletions are successful
- sweepers look for a predefined resource name prefix and/or the presence of well-known tags
- we could benefit from a generic all-purpose sweeper script, possibly leveraging the packet-cli here
- an all-purpose sweeper could be one that removes all resources from a given project (by name), the packet api does not allow for the deletion of a project until all resources are deleted within it.
[ ] the build pipeline must be limited to one build at a time
- so sweepers do not remove infrastructure needed for new tests
- so provisioned hardware is kept to a minimum
[x] Merges are blocked until the CI has completed successfully (this can be skipped in cases, like documentation changes)

Errors may be transparent to users in an otherwise successful Terraform build. The build phase must verify that Terraform and the resulting environment are in working order:

[x] Verifies that terraform apply succeeded
[x] Verifies that all scripts succeeded (this could be configured within each script and terraform provisioner error handling)
[x] Verifies that the ESXi environment is healthy
[ ] Verifies that the Anthos host environment is healthy
[ ] Verifies that Anthos guest environments are created successfully

For now, CD means:

[x] the default branch is also tested when updated.
[x] the terraform module is tagged and published so that users may safely and easily pin to a previous version of this project

Additionally, this requires:

[x] Tagged releases
[x] Opinionated pinning of the project dependencies
- For example, the Packet Provider version should be pinned in each release. We can trust that pessimistic pinning on the project is possible, that all breaking changes will result in a new major version.
- [x] Renaming the project to publish it as a module (terraform-packet-anthos or terraform-packet-google-anthos)

displague commented 4 years ago

There are some light Packet TF / CI examples here:

dfong commented 4 years ago

how can i contribute to this enhancement?

displague commented 4 years ago

@dfong That would be great. Do you have experience configuring GitHub Workflows? There are a few other Terraform projects that I want to get wired up with this same level of testing, but none of them are at the level described in this issues, yet.

I plan to learn from those experiences, here or in the other projects, and take the learning to the other projects.

dfong commented 4 years ago

i have used travis once, in a simple project, that's it. i have a lot of general experience but not with github CI.

i have some preliminary questions and suggestions.

what build tool to use? i think travis plus some shell scripts would be sufficient.
how will you handle billing? will packet provide an account and org that we can use?
how will you handle secrets and keys? i would suggest having a master encryption/decryption key, to be stored in the keystore. then all other secrets could be stored in encrypted form, as ordinary files in git.
how will you handle packet provisioning failures? in my own CI efforts, i am seeing random failure rate of 25% and higher. i would suggest automatic retries.
how will you handle google-anthos "random" failures? again it seems retries will be needed. if the terraform scripts can be made more idempotent, that would help too.
how will you handle capacity issues? again i think retries (along with automatically trying different facilities) will be necessary.
how will you handle multiple configurations? testing against a single configuration will not be sufficient. i would suggest trying to run several trials in parallel, otherwise it'll take too long.
will you do a full teardown of the cluster each time? in a prior meeting, you packet folks suggested (i think) trying to keep one VM active so it wouldn't have to be re-provisioned each time. i myself lack sufficient knowledge about google-anthos to implement this.

displague commented 4 years ago

what build tool to use? i think travis plus some shell scripts would be sufficient.

We've been using Github Workflows in recent projects. I would prefer to see us take this route here.

There is a fairly detailed example here: https://www.terraform.io/docs/github-actions/setup-terraform.html

how will you handle billing? will packet provide an account and org that we can use?

Since this is in a Packet org, CI in this environment would have a secret associated with a Packet authorized testing account. This account would have reduced access to resources based on the expected usage and the risks for abuse.

I believe that vSphere and Anthos keys would also need to be added (the same tokens that are needed in the terraform variables).

I believe one of the features of Github Workflows is that you can run the Workflow with your own secrets from your fork of the project.

how will you handle secrets and keys? i would suggest having a master encryption/decryption key, to be stored in the keystore. then all other secrets could be stored in encrypted form, as ordinary files in git.

https://docs.github.com/en/actions/configuring-and-managing-workflows/creating-and-storing-encrypted-secrets

how will you handle packet provisioning failures? in my own CI efforts, i am seeing random failure rate of 25% and higher. i would suggest automatic retries.

In CI systems like this, there tend to be flows that can be run after the build, regardless of the success of that build.

For every run, a new project should be created (the packet_project and random_pet resources should enable this.

In this project, terraform destroy would need to be run.

I don't trust that terraform destroy will capture every mishap. I expect that what we will want in the end is some packet-cli or a stand-alone packngo client command that will delete every resource in a project. In terraform providers, this is called a sweeper, and it deletes everything matching a certain naming prefix, tag, or project id.

Let's assume that an outage will occur somewhere in the path during a CI test. It is important for some process to delete any remnant resources in a future invocation.

We may also benefit by using a terraform backend, since the state of failed environments is otherwise locked in Github Workflows. (I don't know how accessible this is. In Jenkins, for example, you can access the files of each build within some retention period.)

how will you handle google-anthos "random" failures? again it seems retries will be needed. if the terraform scripts can be made more idempotent, that would help too.

For real-world use cases, idempotency would help us resolve problems like random failures. In a CI test, a random failure is a failure. I don't think we should retry without surfacing the build failure.

The tests will need to confirm that Terraform succeeds and that the provisioned environments (vsphere and anthos) can perform the tasks we expect of them. I think it would be safe to keep vsphere testing in a separate vsphere modules. The anthos project would only pin to modules of the packet-vsphere module that passed their tests.

how will you handle capacity issues? again i think retries (along with automatically trying different facilities) will be necessary.

The preflight script that checks for availability could help our tests fail early, for example: https://github.com/packet-labs/google-anthos/compare/master...displague:check-capacity-preflight?expand=1

Some other resources would need to depend on the "preflight-checks" resource so that it could be run first.

how will you handle multiple configurations? testing against a single configuration will not be sufficient. i would suggest trying to run several trials in parallel, otherwise it'll take too long.

We can configure this with different build parameters. Once we have one job working, we can tell GitHub Workflows to run more instances of that build, each with different parameters.

will you do a full teardown of the cluster each time? in a prior meeting, you packet folks suggested (i think) trying to keep one VM active so it wouldn't have to be re-provisioned each time. i myself lack sufficient knowledge about google-anthos to implement this.

I wouldn't want to keep any bare metal active when the tests are not active. If we need persistent storage, we can use an object storage bucket whose secrets reside in Github workflows. The anthos terraform module tests can use a persistent bucket to store the vsphere ISO, and unique bucket keys to store terraform state for each test.

displague commented 4 years ago

I got a little test repo working, https://github.com/displague/terraform-random-petflow. I don't require any secrets or cloud infrastructure here. I just wanted something to experiment with.

displague commented 4 years ago

I'll start by creating a PR to add the basic workflow, validating and checking the format of terraform files. We can build on this.

rainleander commented 3 years ago

Is it possible to get this sorted in the next week? If not, how can I help?

rainleander commented 3 years ago

How can I help with this?

displague commented 3 years ago

Hey @rainleander!

This project has dependencies on GCP creds, VMWare images, EM creds, and perhaps other things.

After #113, I think we can take on #122. We would be in a better position to setup CI/CD on this project at that time.

I've got a model for GitHub Workflows that I'm pushing through some of the other terraform-metal- modules.

https://github.com/equinix/terraform-metal-multiarch-k8s/pull/67 includes a lot of the steps we'll need for this.

We should get CI/CD running on https://github.com/equinix/terraform-metal-vsphere first since that project will be a dependency of this project. https://github.com/equinix/terraform-metal-vsphere/issues/9

displague commented 3 years ago

Closing this as most of the checkboxes have been checked.

We can continue to iron out E2E test success in #125.

equinix / terraform-equinix-metal-anthos-on-vsphere

Create a CI/CD workflow for contributions and periodic testing of the default branch #91