cncf / toc

⚖️ The CNCF Technical Oversight Committee (TOC) is the technical governing body of the CNCF Foundation.
https://cncf.io
1.67k stars 630 forks source link

LitmusChaos proposal for SandBox #390

Closed umamukkara closed 4 years ago

umamukkara commented 4 years ago

Hi,

We would like to propose to donate LitmusChaos to CNCF as a SandBox project. We have been advised to follow the new process outlined here. The template we are following is outlined here.

Please consider this proposal and guide us through the process.

Authors:


Background

Link to TOC PR

Link to Presentation

Link to GitHub project

Project Goal

The goal of LitmusChoas project is to provide infrastructure toolset to do end to end chaos engineering in Kubernetes environments, in a cloud-native way. By making the practice of chaos engineering easy in a highly scaled environment involving many applications, Litmus helps developers and SREs take control of reliability of the systems. Litmus is bringing application developers and users (SREs) together through chaos hub. Developers submit their chaos experiments in the CI environments to the hub and SREs bring the chaos experiments to the deployments or operations.

Current Status

Chaos experiments are hosted on https://hub.litmuschaos.io. It is a central hub where the application developers or vendors share their chaos experiments so that their users can use them to increase the resilience of the applications in production.

Completed Roadmap items include:

Future Plans

In-Progress (Near-term)

Backlog

Project Scope

Clear project definition

The project aims to provide a chaos engineering framework - comprising of a chaos orchestrator, off-the-shelf chaos experiments for standard cloud native applications, and a central hub for forge collaboration between practitioners of chaos (SRE, DevOps engineers, Kubernetes developers).

Value-add to the CNCF ecosystem

Does the project have a clear value add to the current project ecosystem? How does it relate to other projects with overlapping capabilities?

Litmus adheres to the “Cloud Native” principles (as explained in this blog here). In summary:

Alignment with other CNCF projects

Does the project align and actively collaborate with other CNCF projects?

Litmus provides full-featured chaos experiments for most of the Kubernetes resources. Currently, there are eleven different generic ways to introduce and manage chaos on Kubernetes cluster. Apart from this through chaos hub, we bring application level chaos experients for other eco system projects such as CoreDNS, Kafka, OpenEBS. As part of its near-term roadmap, Litmus also is in the process of creating chaos charts for other sandbox/incubating/graduated CNCF projects. We are working with cncf-ci workgroup to include chaos stage in the CNCF projects. A PR is already in review for CoreDNS project.

Litmus chaos experiments are being extensively used as part of CI pipelines of the OpenEBS, a CNCF sandbox project that provides containerized storage solution for Kubernetes. Reference: https://openebs.ci/

Does the project require any specific versions of projects (or APIs) to interoperate? (e.g. K8s API, CSI, CNI, CRI)?

LitmusChaos mainly uses the Kubernetes API and also makes use of the Docker, Containerd APIs in case of some experiments.

Does the project augment or benefit other CNCF projects?

The project benefits other CNCF projects by helping harden their resiliency under various deployment conditions. Chaos engineering for CNCF projects becomes easy.

Anticipated use cases

Staging & Pre-Prod: Random Chaos:

CI/CD:

Kubernetes Upgrade Testing:

Chaos Engineering in production:

Alignment with SIG Reference Model

Does the project align with the SIG CNCF reference model and which capabilities does it require/provide at each level of the reference model.

High level architecture

Describe the overall architecture of the project. Feel free to add diagrams.

Litmus Project Architecture LitmusChaos project comes with the following components.

Litmus Project Architecture diagrams

Detailed architecture of the project is here.

Formal Requirements

Document that the project fulfills the requirements as documented in the CNCF graduation criteria for sandbox

Are there any anticipated issues with any of the criteria ? No.

Has the TOC been approached for sponsorship? If yes, which TOC members have agreed to sponsor the project? We have not approached any TOC members yet.

CNCF IP Policy

Becoming a sandbox project requires adoption of the CNCF IP Policy: https://github.com/cncf/foundation/blob/master/charter.md#11-ip-policy

The source code developed for the LitmusChaos Project is licensed under Apache 2.0.

FOSSA report for the project is maintained here.

Note: there is a grace period after becoming a sandbox period to enable projects to adopt the policy, however, some prep is required to ensure there are no major blockers.

Has the IP policy been reviewed?

List the repos for the project and their current license

Repo Name License
Litmus Apache License Version 2.0
Chaos Operator Apache License Version 2.0
Chaos Exporter Apache License Version 2.0
Chaos Charts Apache License Version 2.0
Documentation Apache License Version 2.0
Chaos Runner Apache License Version 2.0
Elves Apache License Version 2.0
Chaos Hub Apache License Version 2.0
Test Tools Apache License Version 2.0
E2E Apache License Version 2.0

List any dependent repos (upstream/downstream) that are required to build the project (including but not limited to libraries, commercial tools, plugins) None

What actions are required to be compliant with the IP policy? None.

Other Considerations

Please note, these are not gating criteria but rather to:

Cloud Native

Does the project meet the definition of Cloud Native? The CNCF charter states:

“Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure, and declarative APIs exemplify this approach.

“These techniques enable loosely coupled systems that are resilient, manageable, and observable. Combined with robust automation, they allow engineers to make high-impact changes frequently and predictably with minimal toil.”

Yes. This project meets the definition of Cloud Native.

Project and Code Quality

Are there any metrics around code quality? Are there good examples of code reviews? Are there enforced coding standards?

The code quality is measured by the using static code check tools such as GolangCI, GoReport while using BetterCodeHub score as an indicator of overall code quality.

What are the performance goals and results? What performance tradeoffs have been made? What is the resource cost?

There are no performance tradeoffs. Litmus adopts a Kubernetes native approach to chaos with custom resources and controllers, thereby reusing the Kubernetes infrastructure itself, hence causing no additional resource costs.

What is the CI/CD system? Are there code coverage metrics? What types of tests exist?

Apart from unit-tests, Litmus makes use of GinkGo based BDD framework to create e2e suites for testing the chaos framework (chaos experiments, operator, runner, exporter). These are executed via Gitlab pipelines at scheduled intervals, with Travis/CircleCI being used for build purposes.

Is there documentation?

Litmus has documentation explaining both user as well as developer workflows at the litmus-docs along with individual experiment guides.

How is it deployed?

The Litmus chaos operator is installed as a K8s deployment, along with the custom resource definitions (CRDs) for the ChaosEngine, ChaosExperiment and ChaosResult resources. The out-of-the-box chaos experiments are installed as CustomResource YAML manifests.

How is it orchestrated?

Orchestrated by Kubernetes natively. Litmus has a chaos-runner that runs in a container, an operator that is built using the operator-sdk and chaos experiments which are defined as Kubernetes custom resources.

How will the project benefit from acceptance into the CNCF?

Has a security assessment by the security SIG been done? If not, what is the status/progress of the assessment?

No. We have to approach the security SIG.

Promotion to Incubation

Open Governance

How are committers chosen?

How are architectural and roadmap decisions made?

How many decision makers are outside the sponsoring organization.

Adoption

Who are the current maintainers?

The adoptors file is managed here.

How long has the project been developed for?

Litmus started in May 2018, Litmuschaos org created in Apr 2019

Is there a commercial version of the project or a primary commercial sponsor ?

Is the project used in production? If so, please list some of the accounts.

MayaData

Does the project participate in a CNCF User Group?

No.

Vendor Independence

Is the project reasonably independent from the sponsoring vendor?

Yes

Are all communication channels and project resources hosted just for this project or with other CNCF projects/resources?

We use #litmus channel in Kubernetes Slack.

Is all code that is part of the project hosted and part of the CNCF managed orgs and repos?

Yes. All the code will be under CNCF managed orgs and repos.

Are all defaults for upstream reporting either unset or community hosted infrastructure (i.e. doesn’t point to vendor hosted SaaS control plane or analytics server for usage data)? Is all project naming independent of vendors?

Relevant Assets regarding vendor independence

https://litmuschaos.io https://docs.litmuschaos.io https://hub.litmuschaos.io https://github.com/litmuschaos https://twitter.com/litmuschaos

caniszczyk commented 4 years ago

Thanks for the detailed proposal, well done! @amye can you schedule a review, most likely SIG App Delivery

amye commented 4 years ago

Also, and I hate to do this, but can this be a PR instead of an issue?

umamukkara commented 4 years ago

Hi @amye , the PR is #391