Open samsp-msft opened 1 year ago
I think a pause button on all monitoring screens has value. Metrics, logging, tracing, console logs. Data is still gathered, and hitting resume updates to the UI to the latest. It wouldn't be too difficult to add.
A question is, is the pause button local to one screen or flow across all? That would be more difficult.
If we're looking at other tools, I use the "Clear" button in Chrome/Wireshark a lot. Similar question to above: Would "Clear" clear telemetry for one page or all?
The simplest is to not collect new data while pause is on, and drop the grpc records so they don't affect any of the views. I would make it global so it affects all windows onto that dashboard process.
Do we do the same for streaming logs?
The simplest is to not collect new data while pause is on, and drop the grpc records so they don't affect any of the views. I would make it global so it affects all windows onto that dashboard process.
Well, that would stop new data being added, but the graph would keep moving in time but with no new data. Not what you want.
The graph moves forward in time because time is moving forward, not because there is new data for a time. It's simpler to freeze the UI at a point in time and not respond to updates.
The fundamental question is, what does a pause button do:
You can pause the graphs from moving forward too? They have to be fed the time range to show, rather than use Now() have a field that is the last time to show.
Stopping the graph from moving forward in time is easy. It's work to move them. It's just a matter of not doing that.
I'm trying to nail down exactly how the button should work.
What does the pause button do:
What pages is the pause button on:
Is the pause button global:
What does the UI look like? A button that is selected/unselected, a checkbox, a toggle, etc?
These are important details and should be thought about beforehand to avoid wasting effort from re-work.
What pages is the pause button on:
Metrics
Metrics, Tracing, Structured logs (all the OTEL pages)
Metrics, Tracing, Structured logs, streaming logs (all monitoring pages)
You could even potentially consider the Projects/Containers/Executables pages to be monitoring pages as well since they're watching and responding to state changes.
Maybe "pause" is the wrong term. Its a toggle for whether to "record" data or not. Existing examples:
Chrome F12 tools:
Wireshark:
VS Telemetry Explorer:
The default is recording mode. When not recording its ??? paused? The example apps above all have some form of notion of when capturing or not. They also all have the ability to clear the results, so you can capture again and not have the old data interfere with what you are collecting this time.
"Pause" should:
Once paused, I should be able to switch between views, select traces, log entries metrics etc. It should only resume data collection when you tell it to. "Pause" should be global across the data sources (don't have a different state per data source/type).
Resuming collection will leave a hole in the timeline where no data has been collected. Ideally that should be opaque to the user (except maybe metrics as the graph will likely scroll again to the new time)
Pause should be at least for the OTEL pages, and probably the logs streaming. I don't consider projects/executables/containers to be collecting data so much as a control plane for the services that you are going to collect data from.
Ideally it would be a top-level control next to settings in the title bar (following precedent from other similar apps)
Ideally it would be a top-level control next to settings in the title bar (following precedent from other similar apps)
The OTEL and streaming log pages all have toolbars. It could go there.
It would take a bit of work to implement (UI changes in 6 pages, OTLP collector, log streaming infrastructure). Need the overall team to confirm that it's something we want and prioritized.
Container logs have timestamp, executable logs don't have timestamp. So what exactly it means to stop collecting data for them? What if you hit to stop collecting data at 1pm but the app hasn't written to stdout file yet or we haven't read the logs yet then you would miss out the logs which occurred before 1pm which potentially most important if you are debugging an issue and stopped collecting data for it. Because of lack of timestamp, there is no area that you didn't collect logs from 1pm to 2pm. Current log pages have auto scrolling to end but if user scrolls to particular section then we stop auto scrolling.
I am not entirely sure if not collect data is right thing to do here (could be different for OTEL pages), all the tools above mentioned are intercepting tools, data is being transferred and the tools capture part of the transfer. You open the tool when things are going wrong to intercept the transfer so you can debug the issue. Our dashboard monitors everything by default and there is no easy way to look into logs. Pausing UI so that you can focus on current data may be more appropriate.
For global pause, if all the pages are collecting data from in one form and through one service would work. Whenever you look at project/container logs, we go to docker or read the stdout file. So if you haven't opened logs page, we will start service to read logs from scratch and at best can filter out for container (no timestamp for projects). We don't have a way to tell docker to give logs only up to certain point, so the implementation of global switch would container a lot more stateful processing then just pausing data updates.
We do coerce the projects to show timestamps now, so that part is available. But even with that, pausing with data loss is an odd experience for logs, in my opinion. And that's not even getting into how the logs are implemented (file and stdout reading).
I'm inclined to say that this ask should focus on the metrics (or metrics and traces?) and if we get feedback that it'd be useful to have logs pause also, we can tackle that (and find a cohesive implementation for the UI).
Unclear if we want to do this, I would say no, but I'll mark this untriaged so the dashboard crew can look.
Instead of a default live/real-time view of the traces, can we let the user select what projects/resources he wants to see traces for and then click a button to see the most recent traces in the last 5/10/30/60 mins? Or for any specific time range?
I bet currently there is nothing in the backend collecting data for this, but the current experience is essentially useless in the cloud when you enable health checks and probes:
https://github.com/user-attachments/assets/96e75d26-0da0-4c58-9a0e-e90b77427ed3
I'm mainly interested in the cloud experience. But I bet local development with tons of heath enabled microservices is equally as bad.
Those health probes should be excluded from telemetry.
Similar to Chrome's network tracing, or other network scanning tools, so you can snapshot the current values and they don't get lost to time or scroll out of view while working on the code/problems