KindlingProject / kindling

eBPF-based Cloud Native Monitoring Tool
http://kindling.harmonycloud.cn
Apache License 2.0
1.12k stars 181 forks source link

Add a real-time circuit breaker and downgrade mechanism to avoid high cardinality metrics data and excessive trace data #173

Open NeJan2020 opened 2 years ago

NeJan2020 commented 2 years ago

Is your feature request related to a problem? Please describe. Scenario 1: The service running on the server depends on the external database. When the database is damaged and the server has a steady stream of requests, a large amount of meaningless abnormal trace data will be generated and recorded on a specific topology. Scenario 2: In a large-scale k8s cluster, unmergeable URL and SQL statements may lead to excessive metric data. When the subsequent processing components such as Receiver/Prometheus cannot bear the pressure, unforeseen data loss will occur.

Describe the solution you'd like In Scenario 1, We need a mechanism that automatically turns on under high pressure to ensure that meaningless and repetitive traces are not saved in large numbers, to ensure that those more valuable data (such as topology) can get enough resources for processing. This consists of two parts: judging whether the current Trace pressure affects the operation of other parts, and judging whether a newly generated Trace data is worth processing and recording. In Scenario 2, We need a controllable service degradation logic to gradually reduce the impact of a divergent dimension (such as URL and SQL) on the entire system. The order of degradation can range from a divergent dimension of service to one or more dimensions of the entire monitored cluster. We also need to determine which dimensions of data are more valuable to determine the order of demotion.

Describe alternatives you've considered In Scenario 2, we can also add some logic to converge already divergent dimensions

NeJan2020 commented 2 years ago

Adding a dynamic configuration to manually adjust the header sampling rate for different services at the acquisition end is also a solution for scenario 1. After the user knows that the service is abnormal, before the repair is completed, close or reduce the error and slow trace collection of related services, while other services in the cluster are running normally.