We would like to propose to donate LitmusChaos to CNCF as a SandBox project. We have been advised to follow the new process outlined here. The template we are following is outlined here.
Please consider this proposal and guide us through the process.
Authors:
Uma Mukkara, @umamukkara, MayaData, Litmus Maintainer
The goal of LitmusChoas project is to provide infrastructure toolset to do end to end chaos engineering in Kubernetes environments, in a cloud-native way. By making the practice of chaos engineering easy in a highly scaled environment involving many applications, Litmus helps developers and SREs take control of reliability of the systems. Litmus is bringing application developers and users (SREs) together through chaos hub. Developers submit their chaos experiments in the CI environments to the hub and SREs bring the chaos experiments to the deployments or operations.
Current Status
Project releases: Litmus achieved GA status or 1.0 release in January 2020. Last latest release is 1.2. The project has made 16 releases so far.
Community status: Litmus has monthly contributor calls. The community notes is here. Other stats on the community:
630+ stars on GitHub
50+ contributors, incl.
MayaData
Intuit
Wipro
ArgoAI
Zebrium
Chaos experiments are hosted on https://hub.litmuschaos.io. It is a central hub where the application developers or vendors share their chaos experiments so that their users can use them to increase the resilience of the applications in production.
Completed Roadmap items include:
Declarative Chaos Intent via custom resources
Chaos Operator to orchestrate chaos experiments
Off the shelf / ready chaos experiments for general Kubernetes chaos
The project aims to provide a chaos engineering framework - comprising of a chaos orchestrator, off-the-shelf chaos experiments for standard cloud native applications, and a central hub for forge collaboration between practitioners of chaos (SRE, DevOps engineers, Kubernetes developers).
Value-add to the CNCF ecosystem
Does the project have a clear value add to the current project ecosystem? How does it relate to other projects with overlapping capabilities?
Litmus adheres to the “Cloud Native” principles (as explained in this blog here). In summary:
The life cycle management of Litmus happens using a chaos operator
Developer or SRE interacts with Litmus through declarative YAMLs. The chaos workflow can be completely automated in a highly scaled environment using GitOps
Chaos-runner that manages each chaos job is in a container and can survive node reboots
Any chaos logic which can run inside a container can be consumed through Litmus YAML files. So, anyone with a docker image that contains their chaos can make use of Litmus project to practice chaos engineering in their Kubernetes environment.
Alignment with other CNCF projects
Does the project align and actively collaborate with other CNCF projects?
Litmus provides full-featured chaos experiments for most of the Kubernetes resources. Currently, there are eleven different generic ways to introduce and manage chaos on Kubernetes cluster. Apart from this through chaos hub, we bring application level chaos experients for other eco system projects such as CoreDNS, Kafka, OpenEBS.
As part of its near-term roadmap, Litmus also is in the process of creating chaos charts for other sandbox/incubating/graduated CNCF projects. We are working with cncf-ci workgroup to include chaos stage in the CNCF projects. A PR is already in review for CoreDNS project.
Litmus chaos experiments are being extensively used as part of CI pipelines of the OpenEBS, a CNCF sandbox project that provides containerized storage solution for Kubernetes. Reference: https://openebs.ci/
Does the project require any specific versions of projects (or APIs) to interoperate? (e.g. K8s API, CSI, CNI, CRI)?
LitmusChaos mainly uses the Kubernetes API and also makes use of the Docker, Containerd APIs in case of some experiments.
Does the project augment or benefit other CNCF projects?
The project benefits other CNCF projects by helping harden their resiliency under various deployment conditions. Chaos engineering for CNCF projects becomes easy.
Anticipated use cases
Staging & Pre-Prod: Random Chaos:
Gamedays: The chaos experiments are typically executed in deployment environments hosting several microservices (either staging clusters that mimic production environments or production itself) as part of “Gamedays”, where a hypothesis is tested out, with the results either confirming the hypothesis and leading to fixes either in the infrastructure or the application software.
Continuous Chaos: Chaos experiments (as a background service) are continuously and randomly executed against production-like environments with the appropriate visualization and monitoring/observability setup to gauge system resilience against different chained failures, occurring at random instances.
CI/CD:
Organizations can add "Kuberentes generic" and “application specific” chaos experiments as part of a “chaos stage” in their CI pipelines thereby enabling a left shift in improving fault tolerance and failure response.
Kubernetes Upgrade Testing:
Kubernetes upgrades can cause changes in behaviour / response to component failures which can be exposed by performing chaos experiments against them.
Chaos Engineering in production:
Litmus plays important role to orchestrate chaos in production using GitOps.
Alignment with SIG Reference Model
Does the project align with the SIG CNCF reference model and which capabilities does it require/provide at each level of the reference model.
High level architecture
Describe the overall architecture of the project. Feel free to add diagrams.
Litmus Project Architecture
LitmusChaos project comes with the following components.
Chaos Operator
Chaos CRDs
Chaos Metrics exporter and
Chaos Hub (Which is a list of Chaos Customer Resources)
Note: there is a grace period after becoming a sandbox period to enable projects to adopt the policy, however, some prep is required to ensure there are no major blockers.
Has the IP policy been reviewed?
List the repos for the project and their current license
List any dependent repos (upstream/downstream) that are required to build the project (including but not limited to libraries, commercial tools, plugins)
None
What actions are required to be compliant with the IP policy?
None.
Other Considerations
Please note, these are not gating criteria but rather to:
Collect a standard set of information for each project
Provides a point in time capture of the state of the project which makes it easier to track progress at future reviews and / or promotion
Help projects to prepare for SIG and TOC presentation
Allow the SIG to review the project and perform due diligence for incubation
Provide the TOC with the information required to accept sponsorship of a project and/or votes
Identify and rectify any significant issues / blockers prior to presenting to the TOC and acceptance as a CNCF project
Cloud Native
Does the project meet the definition of Cloud Native? The CNCF charter states:
“Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure, and declarative APIs exemplify this approach.
“These techniques enable loosely coupled systems that are resilient, manageable, and observable. Combined with robust automation, they allow engineers to make high-impact changes frequently and predictably with minimal toil.”
Yes. This project meets the definition of Cloud Native.
Project and Code Quality
Are there any metrics around code quality? Are there good examples of code reviews? Are there enforced coding standards?
The code quality is measured by the using static code check tools such as GolangCI, GoReport while using BetterCodeHub score as an indicator of overall code quality.
What are the performance goals and results? What performance tradeoffs have been made? What is the resource cost?
There are no performance tradeoffs. Litmus adopts a Kubernetes native approach to chaos with custom resources and controllers, thereby reusing the Kubernetes infrastructure itself, hence causing no additional resource costs.
What is the CI/CD system? Are there code coverage metrics? What types of tests exist?
Apart from unit-tests, Litmus makes use of GinkGo based BDD framework to create e2e suites for testing the chaos framework (chaos experiments, operator, runner, exporter). These are executed via Gitlab pipelines at scheduled intervals, with Travis/CircleCI being used for build purposes.
Is there documentation?
Litmus has documentation explaining both user as well as developer workflows at the litmus-docs along with individual experiment guides.
How is it deployed?
The Litmus chaos operator is installed as a K8s deployment, along with the custom resource definitions (CRDs) for the ChaosEngine, ChaosExperiment and ChaosResult resources. The out-of-the-box chaos experiments are installed as CustomResource YAML manifests.
How is it orchestrated?
Orchestrated by Kubernetes natively. Litmus has a chaos-runner that runs in a container, an operator that is built using the operator-sdk and chaos experiments which are defined as Kubernetes custom resources.
How will the project benefit from acceptance into the CNCF?
This project will have a vendor neutral home. The project will generate interest in many CNCF project members to contribute application level chaos experiments to the chaos hub when it is accepted into CNCF.
CNCF projects themselves may adopt Litmus more activly when accepted into CNCF.
Has a security assessment by the security SIG been done? If not, what is the status/progress of the assessment?
No. We have to approach the security SIG.
Promotion to Incubation
Open Governance
How are committers chosen?
Based on the new chaos charts being contributed.
Based on the review and contribution history
How are architectural and roadmap decisions made?
Community sync-ups where feedback is solicited and roadmap decisions made
ROADMAP items are reviewed by maintainers before acceptance.
How many decision makers are outside the sponsoring organization.
Multiple (Independent contributors, Intuit, Wipro) apart from the primary sponsor MayaData.
Is the project used in production? If so, please list some of the accounts.
MayaData
Does the project participate in a CNCF User Group?
No.
Vendor Independence
Is the project reasonably independent from the sponsoring vendor?
Yes
Are all communication channels and project resources hosted just for this project or with other CNCF projects/resources?
We use #litmus channel in Kubernetes Slack.
Is all code that is part of the project hosted and part of the CNCF managed orgs and repos?
Yes. All the code will be under CNCF managed orgs and repos.
Are all defaults for upstream reporting either unset or community hosted infrastructure (i.e. doesn’t point to vendor hosted SaaS control plane or analytics server for usage data)? Is all project naming independent of vendors?
Hi,
We would like to propose to donate LitmusChaos to CNCF as a SandBox project. We have been advised to follow the new process outlined here. The template we are following is outlined here.
Please consider this proposal and guide us through the process.
Authors:
Background
Link to TOC PR
Link to Presentation
Link to GitHub project
Project Goal
The goal of LitmusChoas project is to provide infrastructure toolset to do end to end chaos engineering in Kubernetes environments, in a cloud-native way. By making the practice of chaos engineering easy in a highly scaled environment involving many applications, Litmus helps developers and SREs take control of reliability of the systems. Litmus is bringing application developers and users (SREs) together through chaos hub. Developers submit their chaos experiments in the CI environments to the hub and SREs bring the chaos experiments to the deployments or operations.
Current Status
Project releases: Litmus achieved GA status or 1.0 release in January 2020. Last latest release is 1.2. The project has made 16 releases so far.
Community status: Litmus has monthly contributor calls. The community notes is here. Other stats on the community:
Chaos experiments are hosted on https://hub.litmuschaos.io. It is a central hub where the application developers or vendors share their chaos experiments so that their users can use them to increase the resilience of the applications in production.
Completed Roadmap items include:
Future Plans
In-Progress (Near-term)
Backlog
Project Scope
Clear project definition
The project aims to provide a chaos engineering framework - comprising of a chaos orchestrator, off-the-shelf chaos experiments for standard cloud native applications, and a central hub for forge collaboration between practitioners of chaos (SRE, DevOps engineers, Kubernetes developers).
Value-add to the CNCF ecosystem
Does the project have a clear value add to the current project ecosystem? How does it relate to other projects with overlapping capabilities?
Litmus adheres to the “Cloud Native” principles (as explained in this blog here). In summary:
Alignment with other CNCF projects
Does the project align and actively collaborate with other CNCF projects?
Litmus provides full-featured chaos experiments for most of the Kubernetes resources. Currently, there are eleven different generic ways to introduce and manage chaos on Kubernetes cluster. Apart from this through chaos hub, we bring application level chaos experients for other eco system projects such as CoreDNS, Kafka, OpenEBS. As part of its near-term roadmap, Litmus also is in the process of creating chaos charts for other sandbox/incubating/graduated CNCF projects. We are working with cncf-ci workgroup to include chaos stage in the CNCF projects. A PR is already in review for CoreDNS project.
Litmus chaos experiments are being extensively used as part of CI pipelines of the OpenEBS, a CNCF sandbox project that provides containerized storage solution for Kubernetes. Reference: https://openebs.ci/
Does the project require any specific versions of projects (or APIs) to interoperate? (e.g. K8s API, CSI, CNI, CRI)?
LitmusChaos mainly uses the Kubernetes API and also makes use of the Docker, Containerd APIs in case of some experiments.
Does the project augment or benefit other CNCF projects?
The project benefits other CNCF projects by helping harden their resiliency under various deployment conditions. Chaos engineering for CNCF projects becomes easy.
Anticipated use cases
Staging & Pre-Prod: Random Chaos:
CI/CD:
Kubernetes Upgrade Testing:
Chaos Engineering in production:
Alignment with SIG Reference Model
Does the project align with the SIG CNCF reference model and which capabilities does it require/provide at each level of the reference model.
High level architecture
Describe the overall architecture of the project. Feel free to add diagrams.
Litmus Project Architecture LitmusChaos project comes with the following components.
Detailed architecture of the project is here.
Formal Requirements
Document that the project fulfills the requirements as documented in the CNCF graduation criteria for sandbox
Are there any anticipated issues with any of the criteria ? No.
Has the TOC been approached for sponsorship? If yes, which TOC members have agreed to sponsor the project? We have not approached any TOC members yet.
CNCF IP Policy
Becoming a sandbox project requires adoption of the CNCF IP Policy: https://github.com/cncf/foundation/blob/master/charter.md#11-ip-policy
The source code developed for the LitmusChaos Project is licensed under Apache 2.0.
FOSSA report for the project is maintained here.
Note: there is a grace period after becoming a sandbox period to enable projects to adopt the policy, however, some prep is required to ensure there are no major blockers.
Has the IP policy been reviewed?
List the repos for the project and their current license
List any dependent repos (upstream/downstream) that are required to build the project (including but not limited to libraries, commercial tools, plugins) None
What actions are required to be compliant with the IP policy? None.
Other Considerations
Please note, these are not gating criteria but rather to:
Cloud Native
Does the project meet the definition of Cloud Native? The CNCF charter states:
“Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure, and declarative APIs exemplify this approach.
“These techniques enable loosely coupled systems that are resilient, manageable, and observable. Combined with robust automation, they allow engineers to make high-impact changes frequently and predictably with minimal toil.”
Yes. This project meets the definition of Cloud Native.
Project and Code Quality
Are there any metrics around code quality? Are there good examples of code reviews? Are there enforced coding standards?
The code quality is measured by the using static code check tools such as GolangCI, GoReport while using BetterCodeHub score as an indicator of overall code quality.
What are the performance goals and results? What performance tradeoffs have been made? What is the resource cost?
There are no performance tradeoffs. Litmus adopts a Kubernetes native approach to chaos with custom resources and controllers, thereby reusing the Kubernetes infrastructure itself, hence causing no additional resource costs.
What is the CI/CD system? Are there code coverage metrics? What types of tests exist?
Apart from unit-tests, Litmus makes use of GinkGo based BDD framework to create e2e suites for testing the chaos framework (chaos experiments, operator, runner, exporter). These are executed via Gitlab pipelines at scheduled intervals, with Travis/CircleCI being used for build purposes.
Is there documentation?
Litmus has documentation explaining both user as well as developer workflows at the litmus-docs along with individual experiment guides.
How is it deployed?
The Litmus chaos operator is installed as a K8s deployment, along with the custom resource definitions (CRDs) for the ChaosEngine, ChaosExperiment and ChaosResult resources. The out-of-the-box chaos experiments are installed as CustomResource YAML manifests.
How is it orchestrated?
Orchestrated by Kubernetes natively. Litmus has a chaos-runner that runs in a container, an operator that is built using the operator-sdk and chaos experiments which are defined as Kubernetes custom resources.
How will the project benefit from acceptance into the CNCF?
Has a security assessment by the security SIG been done? If not, what is the status/progress of the assessment?
No. We have to approach the security SIG.
Promotion to Incubation
Open Governance
How are committers chosen?
How are architectural and roadmap decisions made?
How many decision makers are outside the sponsoring organization.
Adoption
Who are the current maintainers?
The adoptors file is managed here.
How long has the project been developed for?
Litmus started in May 2018, Litmuschaos org created in Apr 2019
Is there a commercial version of the project or a primary commercial sponsor ?
Is the project used in production? If so, please list some of the accounts.
MayaData
Does the project participate in a CNCF User Group?
No.
Vendor Independence
Is the project reasonably independent from the sponsoring vendor?
Yes
Are all communication channels and project resources hosted just for this project or with other CNCF projects/resources?
We use #litmus channel in Kubernetes Slack.
Is all code that is part of the project hosted and part of the CNCF managed orgs and repos?
Yes. All the code will be under CNCF managed orgs and repos.
Are all defaults for upstream reporting either unset or community hosted infrastructure (i.e. doesn’t point to vendor hosted SaaS control plane or analytics server for usage data)? Is all project naming independent of vendors?
Relevant Assets regarding vendor independence
https://litmuschaos.io https://docs.litmuschaos.io https://hub.litmuschaos.io https://github.com/litmuschaos https://twitter.com/litmuschaos