Open tim-o opened 6 years ago
Potentially related: #18845
@tschottdorf @petermattis who would be the PM to look at this? which project does this fall in?
Sounds like webui to me. cc @piyush-singh. I also don't think this is feasible because our errors aren't structured. Plus, our logs are subject to change as we clean them up (I think cleaning them up is more important than trying to build tools that deal with the fact that they're messy).
I'm not entirely sure this should be displayed in the CockroachDB web UI though, because perhaps by the time these logs are collected the web UI is inoperative.
Having some kind of text processor on the log files that does the frequency analysis suggested by Tim, using fuzzy matching (text distance below some threshold) would probably work.
This is really an issue requesting the creation of new tooling, either as a new cockroach
sub-command or a separate tool. I don't think there is any extant component in CockroachDB that goes in this direction already.
Agreed that @piyush-singh could prioritize this, although I suspect that @kannanlakshmi would like to prioritize this too as it will greatly aid the troubleshooting of managed clusters.
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 5 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!
As either a support team member or an operator of CRDB, it's difficult to use cockroachDB's log files. Any time an issue is encountered, a significant amount of time is spent simply collecting and collating log files. This is time consuming and error prone. It'd be very helpful to create a script or a web application that could:
1) Take a cockroach debug zip as input. 2) Parse the logs and create a visualization of the frequency of errors and warnings over time, total and by node. 3) Show a list of errors by their frequency in the logs, total and by node, with the first and last occurrence. 4) Allow the user to drill down to see the distribution of a particular error over time.
... there are probably other things that we'd want to visualize, but this would be a very helpful start and would avoid a lot of manual hunting and pecking, and false starts.
Jira issue: CRDB-4877