chaosblade-io / chaosblade

An easy to use and powerful chaos engineering experiment toolkit.(阿里巴巴开源的一款简单易用、功能强大的混沌实验注入工具)
https://chaosblade.io
Apache License 2.0
5.98k stars 950 forks source link

Expose ChaosBlade event into SkyWalking #495

Open wu-sheng opened 3 years ago

wu-sheng commented 3 years ago

@xcaspar After our quick talk, I want to submit this integration to the ChaosBlade community officially.

Background

Chao Engineering is a method to test the project robust, it could be done in the test even prod event. In the testing, the owner of the system is expecting to check the system's interaction with the failure/load injected by the ChaosBlade. SkyWalking, as an APM system, is targeting to collect, analysis, and visualize the system health system from different angles, clearly, we all widely know traces, metrics, and logs. Recently, with the expanding of SkyWalking, we introduce the Event concept

In reality, a production system experiences many other events that may affect the performance of the system, such as upgrading, rebooting, chaos testing, etc. Although some of these events are reflected in the logs, many others are not. Hence, SkyWalking provides a more native way to collect these events.

And the fail/load injections are clear, they are events.

Read SkyWalking's doc for more details, https://skywalking.apache.org/docs/main/latest/en/concepts-and-designs/event/

Solution

SkyWalking provides friendly ways for other systems to integrate with us. Such as

  1. CLI event report, https://github.com/apache/skywalking-cli#event. You could use shell.
  2. Through the k8s event channel, https://github.com/apache/skywalking-kubernetes-event-exporter. But this would limit you in the k8s field. Even the project is in the CNCF, but I am feeling we should not limit the scope of the project. As Chaos engineering clearly is not only about k8s env.
  3. Use gRPC protocol directly, https://github.com/apache/skywalking-data-collect-protocol/blob/master/event/Event.proto. This is easy to adopt and doesn't limit the language or env. Just need to write a few codes.
  4. If you are using (3) in golang(this project is written in go), this repo would be better to integrate than the proto itself. https://github.com/apache/skywalking-goapi

I and @kezhenxu94 are willing to help if you face any issue in the integration process.

xcaspar commented 3 years ago

This is really a good idea. Let me read the documents you provided first. We will open-source Java and Golang SDK for application chaos in the future. I think ChaosBlade can integrate SkyWalking well.

wu-sheng commented 3 years ago

Once, this hasn't haven to an application-level event. We have VM and pod monitoring from k8s and service mesh perspective

  1. VM, https://skywalking.apache.org/docs/main/latest/en/setup/backend/backend-infrastructure-monitoring/#vms-monitoring
  2. pod, https://skywalking.apache.org/docs/main/latest/en/setup/backend/backend-infrastructure-monitoring/#k8s-monitoring
  3. Service Mesh, https://skywalking.apache.org/docs/main/latest/en/setup/envoy/als_setting/. Even control panel of the mesh, https://skywalking.apache.org/docs/main/latest/en/setup/istio/readme/

So, there are going to be various ways to integrate. We don't have to wait for the chaos SDK.

breakertt commented 3 years ago

@wu-sheng I have discussed with @xcaspar today about details in exposing chaosblade, especially JVM tracing expose to SkyWalking in the first stage.

Externally, the communication and report will be between Skywalking and chaosblade-box instead of chaosblade directly. There are two main advantages of using chaosblade-box: 1. No invade to chaosblade itself. 2. Great compatibility, the exposing is not limited to K8S or even chaosblade. The expose of chaosmesh to Skywalking can also be supported if chaosblade-box supports chaosmesh in the future. The protocol used will be gRPC. I assume you are also mainly concentrating on the runtime tracing report, thus, I will try to implement the JVM tracing expose to SkyWalking and ignore the experiment about CPU and network.

Internally, the endpoint parameter between chaosblade, chaosblade-box, chaosblade-operator, and chaosblade-exec-jvm will be reused and used to report JVM tracing and event internally.

Last, the support of tracing inspection in chaosblade-jvm-exec will be implemented. I believe there are already some great examples for it, do you have any ideas? I would appreciate it!

wu-sheng commented 3 years ago

especially JVM tracing expose to SkyWalking in the first stage.

I think JVM level is fine, but it seems we don't have a real relationship with tracing core, right? The relationship should rely on timestamp, right?

I will try to implement the JVM tracing expose to SkyWalking and ignore the experiment about CPU and network.

At stage 1, I am fine with ignoring the CPU or network. Eventually, we should support this too. SkyWalking has supporting VM monitoring(through Prometheus node exporter or zabbix agent), it would be great we have a VM service level event.

Externally, the communication and report will be between Skywalking and chaosblade-box instead of chaosblade directly.

Once this is recommended by your community, we are totally fine.

The protocol used will be gRPC

Does this mean, we are going to use https://github.com/apache/skywalking-data-collect-protocol/blob/master/event/Event.proto to report the event(or goapi repo)?

breakertt commented 3 years ago

@wu-sheng

I think JVM level is fine, but it seems we don't have a real relationship with tracing core, right? The relationship should rely on timestamp, right?

Sorry, I don't really understand your idea. What I have imagined is: Once the related JVM receives a function call from the user, e.g. HTTP GET, the whole tracing like https://skywalking.apache.org/screenshots/8.4.0/trace.jpg will be reported. Would mind helping me more on this?

At stage 1, I am fine with ignoring the CPU or network. Eventually, we should support this too. SkyWalking has supporting VM monitoring(through Prometheus node exporter or zabbix agent), it would be great we have a VM service level event.

Yes, I can understand that. I also discussed exposing the VM status via node exporter or something else.

Does this mean, we are going to use https://github.com/apache/skywalking-data-collect-protocol/blob/master/event/Event.proto to report the event(or goapi repo)?

Yes. For goapi repo, as we are going to implement in chaosblade-box (written in Java), so no.