apache / skywalking

APM, Application Performance Monitoring System
https://skywalking.apache.org/
Apache License 2.0
23.9k stars 6.53k forks source link

Add debug services(GraphQL) in query protocol #7112

Closed wu-sheng closed 3 years ago

wu-sheng commented 3 years ago

@kezhenxu94 @hanahmily We have been suffering the runtime debugging issues in the prod env for months. I want to discuss whether we should consider adding a new type of GraphQL service, named debug.

debug catalog should be a group of interactive services, provided from OAP instance level. Because of this, no query should be done through UI webapp or load balancer.

A demo command of debug service should like this(Ignore the format, I don't write it in GraphQL grammar)

debug start --target als-k8s --duration 5m debug statistic debug clear

We should work on a statistic API, very similar to the internal observability APIs, but because it only works after receiving the start command with a certain duration. Then, we could get the result as a statistic table until

  1. The next debug start
  2. debug clear
  3. 20min(or another threshold value) passed.

FYI @apache/skywalking-committers You may notice we could use this in the agent trace/meter/log analysis statistic.

The major difference between debug API and telemetry APIs is not from a tech perspective, it is more from a usage scenario. The debug API implementation could be more aggressive and provide more context in on-demand debugging.

kezhenxu94 commented 3 years ago

Can you provide some more example statistics? I'm not sure how much the statistics can help, also, if we need to sort out of the important statistics, why not just add them to the s11y metrics? If, by any chance, you mean to include some log-like details in the statistics, I'd rather provide an API to switch the log level at runtime.

wu-sheng commented 3 years ago

why not just add them to the s11y metrics?

Because it causes too much.

Can you provide some more example statistics?

Such as we could

  1. Get the IP list, which can't be successful analysis in the ALS IP mapping.
  2. Provide the IP list of services, which don't provide metadata exchange and falling into IP mapping.
wu-sheng commented 3 years ago

debug API is not about performance, like telemetry API. It focuses on status, and has a chance to list all unexpected statuses, due to limited resources(memory most) cost required.

hanahmily commented 3 years ago

@wu-sheng and I discussed this solution a few weeks ago. After that, I realized an auto-config log4j would give more help than dynamic statistics. We should promise to change the logging level on the fly since rebooting OAP would cover up the problems.

The reason is :

  1. Logging has more details than metrics, which will give our users more flexible paths to investigate issues.
  2. Logging would use fewer resources than specific metrics with some turn-on/off switches.

For this issue, if the ALS-related debugs logging is active, the user will leverage grep and wc to filter and statistic objects they want. We might provide some instruments about how to statistic the key metrics, or enhancement swctl to handle logs.

Future more, log4j has provided some ways to implement auto-config. We will introduce a mutation API to change the logging level for a particular logger.

wu-sheng commented 3 years ago

If dynamic log config helps, feel free to rename this issue and rewrite comments. The point of submitting this, we need a way to help identifying.

kezhenxu94 commented 3 years ago

I agree. Logs have more details and switching logging level is a more common debugging approach.

wu-sheng commented 3 years ago

@kezhenxu94 I am going to close this. Please file another one for changing log level in the runtime.

kezhenxu94 commented 3 years ago

In favor of https://github.com/apache/skywalking/issues/7114