Add a real-time circuit breaker and downgrade mechanism to avoid high cardinality metrics data and excessive trace data

Is your feature request related to a problem? Please describe. Scenario 1: The service running on the server depends on the external database. When the database is damaged and the server has a steady stream of requests, a large amount of meaningless abnormal trace data will be generated and recorded on a specific topology. Scenario 2: In a large-scale k8s cluster, unmergeable URL and SQL statements may lead to excessive metric data. When the subsequent processing components such as Receiver/Prometheus cannot bear the pressure, unforeseen data loss will occur.

Describe the solution you'd like In Scenario 1, We need a mechanism that automatically turns on under high pressure to ensure that meaningless and repetitive traces are not saved in large numbers, to ensure that those more valuable data (such as topology) can get enough resources for processing. This consists of two parts: judging whether the current Trace pressure affects the operation of other parts, and judging whether a newly generated Trace data is worth processing and recording. In Scenario 2, We need a controllable service degradation logic to gradually reduce the impact of a divergent dimension (such as URL and SQL) on the entire system. The order of degradation can range from a divergent dimension of service to one or more dimensions of the entire monitored cluster. We also need to determine which dimensions of data are more valuable to determine the order of demotion.

Describe alternatives you've considered In Scenario 2, we can also add some logic to converge already divergent dimensions

KindlingProject / kindling

Add a real-time circuit breaker and downgrade mechanism to avoid high cardinality metrics data and excessive trace data #173