apache / shardingsphere-on-cloud

A collection of tools and best practices to take ShardingSphere into the cloud
Apache License 2.0
83 stars 28 forks source link

[DISCUSS] Introduce New CRD Chaos #290

Open moomman opened 1 year ago

moomman commented 1 year ago

Chaos CRD Design document

1、Background

It is necessary to introduce the automatic experiment flow of chaos into ss to enhance the toughness and failure recovery ability of ss.

2、Problem description

Chaos experiment should be automated to avoid the experimental environment, injection flow, verification of the duplication of work

2.1 Question 1: How to inject

How can specific failure scenarios be introduced into ss

2.2 Question 2: How to generate pressure

How would a large number of specified requests be sent to ss-proxy during a failure to simulate a real production environment

2.3 Question 3: How to verify the Result

During the experiment, how to collect relevant information and set the steady-state to prove whether the system is in steady-state

3、Technical research

Chaos Mesh or Litmus provides different kinds of chaos experiments, covering most usage scenarios. It only has the ability to inject faults, while experimental environments and verifying the influence of faults on steady state need to be repeated in each experiment. Therefore, we need to define our own crd to realize the automated experiment process for ss-proxy, and use kubebuilder to generate the skeleton code of crd

technology address
Chaos Mesh API definition https://github.com/chaos-mesh/chaos-mesh
kubebuilder https://github.com/kubernetes-sigs/kubebuilder
Litmus chaos https://litmuschaos.io/

4、Scheme design

4.1 Program summary

injection:

In order to solve the problem of how to inject faults into ss, the commonly used solution is pingCAP open source Chaos Mesh or Litmus Chaos, which provides a variety of common fault types, but for the construction of automated ss chaotic scenario flow, it can not be introduced directly because of its complexity and independence of configuration. Chaos Mesh has provided the corresponding API of all CRD resource definitions, which provides the possibility of simplifying the operation. We can abstract our own chaotic scenarios and interact with Chao Mesh to obtain experimental information. For the implementation of interaction, you can refer to Chaos Mesh's official Chaos DashBoard.

Generating pressure:

With regard to the configuration environment and pressure, you can use DistSQL to make a request to the ss-proxy, inject data into the environment, and use it as proof to verify the steady state.

Verification:

In the verification of steady state, we can grab the monitoring log to observe whether the CPU,NetWork IO fluctuates in the steady state, and verify the correctness of the previous request in the pressure phase by DistSQL.

4.2 Holistic design

image

ComputeNode ss-proxy, as an object for upstream service interaction, interacts with the downstream database
StorageNode Connect to the database of ss, the node that actually stores the data
Governance node Used to store status and configuration information in ComputeNode, such as logical libraries, logical tables, etc.
DistSQL It is a unique operating language of Apache ShardingSphere. It is used in exactly the same way as standard SQL and is used to provide SQL-level operational capabilities for incremental functionality.
proxy-environment A fully functional ss-proxy environment
Chaos APIs Different kinds of chaos experiments are provided, which are responsible for the actual injection and execution of faults.
ssChaos Controller Responsible for managing the created ssChaos resources

4.3 Function design

It is functionally divided into three parts: injection fault, voltage generation and fault; users can use related functions by defining cr declaration files

4.3.1 Feature list

4.4 CRD design

4.4.1 Spec

namespaces Specify namespaces
labelSelectors Specify selection label
annotationSelectors Specify comment
nodes Specify nodes
pods Specified as a namespace-pod name
nodeSelectors Select nodes with label

This part of the statement is in spec.podChaos A fault that defines the type of pod, and the action field declares the type of fault that is injected into pod

action Specify the fault type of pod, divided into podFailure,containerKill
podFailure.Duration Specify the effective time of the PodFailureAction
containerKill.containerNames Specify the container to be killed

Define faults of network type

Action Define chaos of network type, divided into delay,duplicate,corrupt,partition,loss
Duration Specify the duration of chaos
Direction It is used to specify the direction of network failure. When not specified, it defaults to to, which is divided into to (- > target), from (target < -), and both (<-> target).
target selector,Used to select target object
Source selector,Used to select source object
delay.latency
delay.correlation
delay.jitter
latency: Indicates the network latency
correlation: Indicates the correlation between the current latency and the previous one
jitter: Indicates the range of the network latency
loss.correlation
loss.loss
loss: Indicates the probability of packet loss
correlation: Indicates the correlation between the probability of current packet loss and the previous time's packet loss.
duplicate.correlation
duplicate.duplicate
correlation: Indicates the correlation between the probability of current packet duplicating
duplicate: Indicates the probability of packet duplicating
corrupt.corrupt
corrupt.correlation
corrupt: Indicates the probability of packet corruption
correlation: Indicates the correlation between the probability of current packet corruption and the previous time's packet corruption.
Configuration field of podchaos spec/mode <-----> selector.mode
spec/value <-----> selector.value
spec/pod/action <-----> specify .action
spec/pod/gracePeriod <-----> specify .gracePeriod
Configuration field of networkchaos spec/device <-----> .device
spec/targetDevice <-----> .targetDevice
spec/target/mode <-----> .selector.mode
spec/target/value <-----> .value
spec/network/action <-----> specify .action
spec/network/rate <-----> .bandwidth.rate
spec/network/limit <-----> .bandwidth.limit
spec/network/buffer <-----> .bandwidth.buffer
spec/network/peakrate <-----> .bandwidth.peakrate
spec/network/minburst <-----> .bandwidth.minburst
- Litmus chaos
Configuration field of podchaos - pod-delete
spec/random <-------> RANDOMNESS
- Container-kill
spec/signal <------> SIGNAL
spec/chaos_interval <-----> CHAOS_INTERVAL
Configuration field of networkchaos
Public field spec/action <----> .spec.experiments.name
spec/ramp_time <-----> RAMP_TIME
spec/duration <-------> TOTAL_CHAOS_DURATION
spec/sequence <-----> SEQUENCE
spec/lib_image <-----> LIB_IMAGE
spec/lib <----> LIB
spec/force <-----> FORCE
image

As shown in the above picture, the specific process is as follows: Steady state:

  1. Create a pressure job
  2. Collect the concerned contents in the metrics log, record and wait for the comparison with the fault metrics Failure:
  3. Create a chaos fault
  4. Collect metrics logs, compare them with steady state, and record the results in status Perform a job during steady state and a job during a fault. After the chao recovers, verify the execution result of the pressure job when the fault occurs and record it in the status

4.4.2 Status

Creating It means that chaos is in the creation stage and has not yet completed the injection.
AllRecovered Indicates that the environment has recovered from failure
Paused The experiment may be paused because the selected node does not exist. Consider whether there is a problem with the definition of crd.
AllInjected This stage indicates that the fault has been successfully injected into the environment.
Unknown Unknown status

4.4.3 Controller design

  1. Convert the ssChaos to apply to the fault type in chaos-mesh and create.
  2. status
ChaosCondition ChaosCondition `json:"chaosCondition"`
Phase          Phase          `json:"phase"`
Result         []Result       `json:"result"`
  1. Extended platform When you need to extend more API interfaces of chaos, the interfaces that need to be implemented for pod and network types are as follows:

About the get/set interface of chaos

type ChaosGetter interface {
   GetPodChaosByNamespacedName(context.Context, types.NamespacedName) (PodChaos, error)
   GetNetworkChaosByNamespacedName(context.Context, types.NamespacedName) (NetworkChaos, error)
}

type ChaosSetter interface {
}

About the update/create/New interface of chaos

type ChaosHandler interface {
   NewPodChaos(ssChao *v1alpha1.ShardingSphereChaos) chaos.PodChaos
   NewNetworkPodChaos(ssChao *v1alpha1.ShardingSphereChaos) chaos.NetworkChaos
   UpdateNetworkChaos(ctx context.Context, ssChaos *v1alpha1.ShardingSphereChaos, r client.Client, cur chaos.NetworkChaos) error
   UpdatePodChaos(ctx context.Context, ssChaos *v1alpha1.ShardingSphereChaos, r client.Client, cur chaos.PodChaos) error
   CreatePodChaos(ctx context.Context, r client.Client, podChao chaos.PodChaos) error
   CreateNetworkChaos(ctx context.Context, r client.Client, networkChao chaos.NetworkChaos) error
}

4.5 Expected

4.5.1 Expected effect

Create a definition yaml file for CR

apiVersion: shardingsphere.apache.org/v1alpha1
kind: ShardingSphereChaos
metadata:
  labels:
    app.kubernetes.io/name: shardingsphereChaos
  name: shardingspherechaos-lala
  namespace: verify-lit
  annotations:
    selector.chaos-mesh.org/mode: all
spec:
  podChaos:
    selector:
      labelSelectors:
        app.kubernetes.io/component: zookeeper
      namespaces: [ "verify-lit" ]
    action: PodFailure
    params:
      podFailure:
        duration: 10s
  pressureCfg:
    ssHost: root:14686Ban@tcp(127.0.0.1:3306)/ds_0
    duration: 10s
    reqTime: 5s
    distSQLs:
      - sql: select * from car;
    concurrentNum: 1
    reqNum: 2

After applying, the chaos object is created successfully, and you can see the following information

5、Demo

6、References

272

The change logic is as follows: Only after all the failures we are currently concerned with in chaos-mesh have entered the AllInjected phase can we change our state from creating to AllInjected. paused, we should check whether the pod and container we selected are running properly when the fault is paused. When all faults are Recovered, we update our status to AllRecovered As mentioned in the chaos-mesh document, it also serves as the evaluation basis for the updated status

mlycore commented 1 year ago

How about change this CRD name from ShardingSphereChaos to Chaos ?

moomman commented 1 year ago

How about change this CRD name from ShardingSphereChaos to Chaos ?

sure