Closed wu-sheng closed 3 years ago
Can you provide some more example statistics? I'm not sure how much the statistics can help, also, if we need to sort out of the important statistics, why not just add them to the s11y metrics? If, by any chance, you mean to include some log-like details in the statistics, I'd rather provide an API to switch the log level at runtime.
why not just add them to the s11y metrics?
Because it causes too much.
Can you provide some more example statistics?
Such as we could
debug
API is not about performance, like telemetry API. It focuses on status, and has a chance to list all unexpected statuses, due to limited resources(memory most) cost required.
@wu-sheng and I discussed this solution a few weeks ago. After that, I realized an auto-config log4j would give more help than dynamic statistics. We should promise to change the logging level on the fly since rebooting OAP would cover up the problems.
The reason is :
For this issue, if the ALS-related debugs logging is active, the user will leverage grep
and wc
to filter and statistic objects they want. We might provide some instruments about how to statistic the key metrics, or enhancement swctl
to handle logs.
Future more, log4j has provided some ways to implement auto-config. We will introduce a mutation API to change the logging level for a particular logger.
If dynamic log config helps, feel free to rename this issue and rewrite comments. The point of submitting this, we need a way to help identifying.
I agree. Logs have more details and switching logging level is a more common debugging approach.
@kezhenxu94 I am going to close this. Please file another one for changing log level in the runtime.
In favor of https://github.com/apache/skywalking/issues/7114
@kezhenxu94 @hanahmily We have been suffering the runtime debugging issues in the prod env for months. I want to discuss whether we should consider adding a new type of GraphQL service, named
debug
.debug
catalog should be a group of interactive services, provided from OAP instance level. Because of this, no query should be done through UI webapp or load balancer.A demo command of debug service should like this(Ignore the format, I don't write it in GraphQL grammar)
We should work on a statistic API, very similar to the internal observability APIs, but because it only works after receiving the
start
command with a certain duration. Then, we could get the result as a statistic table untildebug start
debug clear
FYI @apache/skywalking-committers You may notice we could use this in the agent trace/meter/log analysis statistic.
The major difference between
debug
API andtelemetry
APIs is not from a tech perspective, it is more from a usage scenario. Thedebug
API implementation could be more aggressive and provide more context in on-demand debugging.