cncf-tags / green-reviews-tooling

Project Repository for the WG Green Reviews which is part of the CNCF TAG Environmental Sustainability
https://github.com/cncf/tag-env-sustainability/tree/main/working-groups/green-reviews
Apache License 2.0
22 stars 14 forks source link

[Action] Bootstrap Kubernetes cluster with IaC tooling #1

Closed nikimanoledaki closed 7 months ago

nikimanoledaki commented 11 months ago

Cluster API

We may want to use the Equinix Metal Cluster API Provider (CAPEM) for our cluster bootstrapping on the community cluster. Alternatives such as Ansible or Firecracker microVMs are being considered, to work with Falco's setup: https://github.com/cncf/tag-env-sustainability/issues/182

Requirements

The cluster requirements are listed in the design doc.

Equinix infrastructure access

This issue will help us know more about the kind of access that will be needed for individual contributors to the infra. Please see this for some of the available options, and follow up in that thread with any questions/issues.

Documentation

We should document this process as we go.

Development environment

Dev environment setup tracked in this issue: https://github.com/cncf-tags/green-reviews-tooling/issues/3

rossf7 commented 11 months ago

@nikimanoledaki I've been investigating using CAPI / CAPEM and I think us not having access to a permanent management cluster is a problem.

In the Equinix docs they have an alternative approach intended for management clusters using K3s managed by Pulumi. https://deploy.equinix.com/developers/guides/k3s-management-plane/

I've added more detail in the design doc. PTAL

Do you think this is a good direction?

nikimanoledaki commented 11 months ago

Update: @rossf7 and I discussed this and wrote a summary here. We will have more information after today's WG Green Reviews meeting with the Falco maintainers.

rossf7 commented 10 months ago

Update following the WG meeting and discussions I had since with @nikimanoledaki

We think an important factor is how we isolate between test runs to have accurate results.

We think we’ll need an IaC tool to manage the management cluster if we use CAPEM or the whole cluster if we don’t use CAPEM. At the moment we’re leaning toward Ansible for that. https://github.com/equinix/ansible-collection-metal

However we both think its important to continue work on the pipeline design. We can then design the cluster topology to support the pipeline rather than the other way round.

Lastly I added some notes to the design doc for https://github.com/cncf-tags/green-reviews-tooling/issues/1#issuecomment-1749304100 on K3s / Pulumi that are now outdated and would have been better here. 🤦‍♂️

I've removed them and updated the CAPI / CAPEM section. Sorry about that!

AntonioDiTuri commented 10 months ago

WG-meeting-recap:

We had thoughtful discussions and actionable steps directed towards the main objective of developing an end-to-end proof of concept concerning the Working Group Green Review. The core focus is on manually (no automation for the moment) measuring Falco using Kepler, an initiative encapsulated under this milestone.

In the doc design you can find a first draft of the workflow:

We also thought of testing Falco in two ways:

The roadmap ahead is well-defined with practical next steps, which will be documented soon as issues under the designated milestone. Niki will check how to give contributors access to the cluster. After that we that, we will start with testing and documenting the installation of every component of the workflow (kepler, falco, workload) Once the end-to-end PoC is done we will think how to do some nice automations.

Thanks @rossf7 for his invaluable documentation on manual Equinix cluster creation, open for insightful comments on his pull request. Thanks @nikimanoledaki niki for her exceptional coordination efforts, driving this project forward.

The team can’t wait for the initial measurements, let’s continue to collaborate and innovate as always!

dipankardas011 commented 10 months ago

If any help is needed I am happy to contribute 👍🏼

rossf7 commented 10 months ago

I added some more detail in the design doc.

We want to use an IaC tool we can run in a GitHub action. Ansible and OpenTofu have both been discussed. I'd be fine with using Ansible (although full disclosure I don't have much experience with it).

We will need to provision the control plane and worker nodes as Equinix servers and they have integrations for Ansible and Terraform.

For each server we need to configure user_data that will bootstrap

For provisioning Kubernetes we could use Kubeadm. Unless anyone can suggest a better approach?

@dipankardas011 help with this would be much appreciated. I think we first need to agree on the design. Would you like to work on that?

cc @nikimanoledaki @guidemetothemoon @leonardpahlke @AntonioDiTuri

dipankardas011 commented 10 months ago

@rossf7 sure I will give it a try My follow up is does this design any way different from what you expected?

dipankardas011 commented 10 months ago

also no sure, by the design. is it deciding on the Infrastructure code part or its just the diagrams?

rossf7 commented 10 months ago

@dipankardas011 If you would like to investigate how you would do the Infrastructure as Code part that would be great.

But please don't spend too much time on it until we've heard from the rest of the team.

I'm happy to help with the Equinix Metal integration as I've worked with their infra quite a bit.

dipankardas011 commented 10 months ago

Okay I will be creating a basic diagram of workflow

dipankardas011 commented 10 months ago

Should I create it on excalidraw or draw io Which one will be comfortable for you all?

dipankardas011 commented 10 months ago

Here is my iter 1 -> https://gitlab.com/dipankardas011/draw.io/-/tree/main/CNCF%20WG%20Green%20Review

AntonioDiTuri commented 10 months ago

I cannot access it. It says I don't have the permissions.

dipankardas011 commented 10 months ago

fixed the link

rossf7 commented 10 months ago

Hi @dipankardas011 thanks the diagram is looking good!

In the diagram OpenTofu (Terraform) is used to provision the Equinix servers and Ansible is used to provision Kubernetes with Kubeadm. Do you think we could use a single tool for both? Or are advantages to using separate tools?

For the GitOps part you have this described as "GitOps for CNCF projects". I think this should be "GitOps for pipeline components". Could you update that?

This is because we want to use Flux to manage the components that should always be running like Prometheus. The CNCF projects like Falco and any workload specific test workloads will be managed by the pipeline.

dipankardas011 commented 10 months ago

In the diagram OpenTofu (Terraform) is used to provision the Equinix servers and Ansible is used to provision Kubernetes with Kubeadm. Do you think we could use a single tool for both? Or are advantages to using separate tools?

What I have experienced we can add the script in user_data section when we provision infra(iac tools) And then configure then using ansible

I think this method involving 2 tools is good when a lot of times the infra needs to configure But in our case as it's mostly single time declaration We can add it to the user_data section

Also another issue I have seen that if error occurs in userdata section we don't get any signal like error occurred, just wanted to point that out

dipankardas011 commented 10 months ago

For the GitOps part you have this described as "GitOps for CNCF projects". I think this should be "GitOps for pipeline components". Could you update that?

Yes

dipankardas011 commented 10 months ago

For the GitOps part you have this described as "GitOps for CNCF projects". I think this should be "GitOps for pipeline components". Could you update that?

Yes

Updated!

rossf7 commented 10 months ago

What I have experienced we can add the script in user_data section when we provision infra(iac tools) And then configure then using ansible

I think this method involving 2 tools is good when a lot of times the infra needs to configure

Yes, exactly that, the script in the user_data can run the IaC tool. I agree using 2 tools makes sense providing we can use the Equinix Terraform module with OpenTofu.

Also another issue I have seen that if error occurs in userdata section we don't get any signal like error occurred, just wanted to point that out

Good catch 👍 we will need to handle that. We have some contacts at Equinix. So we can try asking them for some guidance if needed.

rossf7 commented 10 months ago

As suggested by @nikimanoledaki we could use this directory structure with the IaC code under infrastructure and the Kubernetes manifests under clusters managed by Flux.

├── infrastructure
│   └── equinix-metal
├── clusters
│   └── production

See https://github.com/cncf-tags/green-reviews-tooling/issues/5#issuecomment-1787024446

rossf7 commented 10 months ago

I did a spike to investigate this and I've created a WIP PR to get feedback https://github.com/cncf-tags/green-reviews-tooling/pull/6

Dipankar I think the original design you proposed to use OpenTofu / Terraform to manage the Equinix infra and Ansible to provision Kubernetes is good. I don't see a benefit to using Ansible to manage both.

OpenTofu have a GitHub Action that works well and I think does everything we need https://github.com/opentofu/setup-opentofu

I'm using an S3 bucket to store the state. It looks like we can request a S3 bucket and credentials via servicedesk?

@dipankardas011 Would you like to work on the Ansible playbook?

@nikimanoledaki @guidemetothemoon @leonardpahlke @AntonioDiTuri Please take a look at the PR when you have time.

Leo / Niki no worries if that is after KubeCon!

dipankardas011 commented 10 months ago

@dipankardas011 Would you like to work on the Ansible playbook?

Okay then we can use the user_data section 👍

rossf7 commented 10 months ago

Discussed with @wrkode and @dipankardas011 in the WG slack channel. We think there may be some advantages to using K3s instead of Kubeadm.

It makes it easier to provision the cluster and we could run the K3s steps in the user_data of the TF code so we wouldn't need Ansible. It is also a lighter distribution meaning the energy consumption of the cluster should be reduced.

The main challenge is we need to get the K3S_TOKEN from the control plane node and pass it to the worker nodes. Dipankar has experience doing this from working on ksctl which supports k3s https://github.com/kubesimplify/ksctl

wrkode commented 10 months ago

Discussed with @wrkode and @dipankardas011 in the WG slack channel. We think there may be some advantages to using K3s instead of Kubeadm.

It makes it easier to provision the cluster and we could run the K3s steps in the user_data of the TF code so we wouldn't need Ansible. It is also a lighter distribution meaning the energy consumption of the cluster should be reduced.

The main challenge is we need to get the K3S_TOKEN from the control plane node and pass it to the worker nodes. Dipankar has experience doing this from working on ksctl which supports k3s https://github.com/kubesimplify/ksctl

we can also use the k3s shell to up the cluster, this will pass tokens and stand-up the workers

leonardpahlke commented 10 months ago

I can take a look at this early next week.

nikimanoledaki commented 9 months ago

This should be unblocked once we get AWS access to use an S3 bucket: https://github.com/cncf-tags/green-reviews-tooling/issues/8

rossf7 commented 9 months ago

PR is updated with user_data to provision K8s with K3s added by @dipankardas011 Next step is installing Cilium for CNI using its Helm chart.

Once we have the AWS credentials for S3 we can add the secrets to the repo. There is an extra secret needed for the K3S_AGENT_TOKEN.

dipankardas011 commented 9 months ago

PR is updated with user_data to provision K8s with K3s added by @dipankardas011 Next step is installing Cilium for CNI using its Helm chart.

Once we have the AWS credentials for S3 we can add the secrets to the repo. There is an extra secret needed for the K3S_AGENT_TOKEN.

Helm install script for cilium https://docs.cilium.io/en/stable/installation/k8s-install-helm/

kvendingoldo commented 5 months ago

btw. you can also integrate tenv that support Terraform as well as OpenTofu (and Terragrunt :) ) in one tool. It allow you to simplify version management.