cncf-tags / green-reviews-tooling

Project Repository for the WG Green Reviews which is part of the CNCF TAG Environmental Sustainability
https://github.com/cncf/tag-env-sustainability/tree/main/working-groups/green-reviews
Apache License 2.0
22 stars 14 forks source link

[Automated/Tracking] Manage cluster components using a GitOps approach with Flux #5

Closed rossf7 closed 8 months ago

rossf7 commented 11 months ago

Cluster Management

We want to use a GitOps approach for the components running in the cluster using Flux. This is for the minimal set of components that should always be running to support the pipeline.

This is so it is

The pipeline is responsible for installing applications that are to be measured e.g Falco

Requirements

The components to be installed are listed in the design doc

Phase 1: Base-level cluster components (MVP)

Phase 2: Gather idle metrics for Falco

Phase 3: Gather load-test metrics

More may be added as we continue to develop the pipeline.

Documentation

We should document this process as we go.

nikimanoledaki commented 11 months ago

This is great, thank you for opening the issue & listing the initial components!

As an example, I used Flux to deploy Kepler in this repo: https://github.com/nikimanoledaki/sustainability-journey-with-gitops

I ran the following bootstrap command to install and bootstrap Flux and specify that it should reconcile the repo's clusters/ dir:

curl -s https://fluxcd.io/install.sh | sudo bash
flux bootstrap github --owner=$GITHUB_USER --repository=green-reviews-tooling --path=clusters

Thankfully past me documented the steps in the README! 👍

At the time there wasn't a Helm Chart so I used Kustomize to deploy the k8s manifests but we should change that to use the Helm Chart, as you said :)

Docs for bootstrapping Flux with a GitHub repo: https://fluxcd.io/flux/installation/bootstrap/github/

Note: We'll also need to export a GitHub token before running the GitHub command - will create and send it to you privately!

export GITHUB_TOKEN=<gh-token>
nikimanoledaki commented 11 months ago

We should also think about having multiple environments. Looking at the Flux docs on structuring repositories for guidance. Here are some ideas - they might not all be viable 🤔


Components/apps

Here is an initial idea that we can iterate on to deploy the individual components/apps:

├── apps
    ├── production
    └── development

apps/production would include:

apps/development could be for the manual pipeline that includes the above as well as Falco & the demo workload. In production, this would ideally be configured and maintained by the project maintainers.


Infrastructure & cluster provisioning

We could potentially add the cluster and/or infrastructure provisioning as well:

├── infrastructure
│   └── equinix-metal
├── clusters
│   └── production

I'm not sure how well that would work with Ansible and/or OpenTofu. Previously, Terraform worked with the Flux TF Controller, but I don't know if there is a similar integration with OpenTofu. I'm also not sure if Flux would be necessary with Ansible since that is already an IaC tool (but I have not worked with Ansible before so I'm not sure). Lots of questions here.


CNCF Projects

An idea for how we could deploy CNCF Projects:

├── cncf-projects
    ├── falco
    └── <next-project>

Each project could use Kustomize to point to the upstream configuration that is maintained by CNCF Project maintainers. However I'm not sure how/if that works with Ansible configuration. 🤔 The alternative would be to do the self-hosted GitHub Action runners that project maintainers can use directly.

AntonioDiTuri commented 11 months ago

I would be up for taking over this one. Can I get it assigned to me? @nikimanoledaki Should I ask you the github token? I wanted to ask what is the final output: a pull request with all the needed folder structure and the steps followed to install flux would do?

rossf7 commented 11 months ago

@nikimanoledaki I like that directory structure with the environments and cncf-projects.

Also +1 for having the IaC code under infrastructure I'll add a note to #1. The IaC code will need to bootstrap Flux so we might run into a chicken egg problem but it would be nice to use Flux if we can.

@AntonioDiTuri Thanks, I think a pull request would be good and then depending on where we are with the IaC issue we can see how to integrate both workstreams.

nikimanoledaki commented 11 months ago

Should I ask you the github token?

This is a good question. I'm not sure how we should manage this! There are risks if we use our own personal access tokens since the token needs repo-wide access. Any leak or sharing with other folks could give access to private repos that the user has access to.

A bot account could be an option. We would need to request this from the CNCF. Maybe there is one already.

Do you have any other ideas? 🤔

leonardpahlke commented 10 months ago

We should also think about having multiple environments

Would advocate for, for now all dev.(there is no production now)

leonardpahlke commented 10 months ago

We don't use personal access tokens in this project. We will go over the org. I will take a look at this after Kubecon

nikimanoledaki commented 10 months ago

We should also think about having multiple environments

Would advocate for, for now all dev.(there is no production now)

Regarding this - we currently do have the manual testing workflow (dev) and we will have the automated process later (prod). We could rename these environments if dev/prod is misleading to something like manual/automated. I think it's worth planning for both in our repository structure. What do you all think? :) Let me know if I may be missing or misunderstanding something.

nikimanoledaki commented 10 months ago

Created this issue to request a PAT and unblock this: https://github.com/cncf-tags/green-reviews-tooling/issues/7

nikimanoledaki commented 9 months ago

We have a fine-grained PAT - anyone who needs this can message @leonardpahlke or me (and the new leads soon!) 👍

nikimanoledaki commented 9 months ago

Heads-up that there is some progress on the Falco side thanks to @incertum to create the repo that will contain the Daemonset/ConfigMaps needed to deploy Falco: https://github.com/falcosecurity/evolution/issues/345

After that, we can add ./clusters/falco.yaml with the following:

---
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: falco-cncf-green-reviews-testing
  namespace: flux-system
spec:
  interval: 1m0s
  ref:
    branch: main
  url: https://github.com/falcosecurity/cncf-green-review-testing
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: falco-cncf-green-reviews-testing
  namespace: flux-system
spec:
  interval: 30m0s
  path: ./kustomize
  prune: true
  retryInterval: 2m0s
  sourceRef:
    kind: GitRepository
    name: falco-cncf-green-reviews-testing
  targetNamespace: falco
  timeout: 3m0s
  wait: true
dipankardas011 commented 8 months ago

Cluster Management

We want to use a GitOps approach for the components running in the cluster using Flux. This is for the minimal set of components that should always be running to support the pipeline.

This is so it is

  • Clear to all participants which components and versions are running in the cluster
  • Easier to contribute to technical tasks by submitting pull requests

The pipeline is responsible for installing applications that are to be measured e.g Falco

Requirements

The components to be installed are listed in the design doc

Phase 1: Base-level cluster components (MVP)

Phase 2: Gather idle metrics for Falco

Phase 3: Gather load-test metrics

More may be added as we continue to develop the pipeline.

Documentation

We should document this process as we go.

@rossf7 you might want to update the issue description for the cilium

nikimanoledaki commented 8 months ago

We can close this since it is mostly completed. We have the base cluster environment, which is our MVP.

There is an open PR for the microservice demo workload but holding off since we're going to do idle measurements first. Lastly, we can revisit the need for a load-testing tool later on.