Closed joanneesteban closed 11 months ago
@kfrz for the analytics team to validate the data we're getting in the SLO report it'll be helpful to see when the systems are actually down. Is there a way to include in the SLO report a graph that shows when the downtime occurs, the way we have graphs for the latency?
Thanks team! We will review upon receiving the SLO report. Thanks in advance
This is the best I can come up with in an hour, this additional graph indicates at a (4hr, typo in screenshot) resolution the general availability of the service, it aligns with the dips in the above panel, which doesn't allow for adding time-ranges.
I'd be interested to know how the data correlates with the Google Analytics data, too - to see how the backend downtime affects the MDT submission project. Going to generate the report with this additional panel, happy to revisit it and tweak it if you have feedback.
External SLO report here - happy to iterate on this a bit more.
We may be able to investigate getting data from Google Analytics directly into Grafana to make the correlation easier.
@jonwehausen @bmcgrady-ep
As Keifer noted, we'll need to see if backend downtime correlates with lack of submissions. So, are MDT submissions generally lower during the downtime hours? One of the things we'll need to take into account is if submissions during these times is lower in general.
Please prioritize any dashboarding or GTM work that needs to go out this sprint. We can pull this into next sprint.
@kfrz thanks for the quick turnaround. So to clarify the 4 hour intervals for each day:
From the report, it's a bit unclear which time ranges are missing. But also, let us know if we can just see that more easily on Grafana!
@joanneesteban - I'm not seeing any evidence in Google Analytics that the backend downtime is correlating with a lack of submissions. I used this hourly GA report and looked at each day.
GA - 9/17
SLO report shows that there was downtime between 8pm-12am on the 17th, but there were forms submitted during that time.
@drorva
@kfrz can we dig into this and see why the discrepancy?
@alexpappasoddball can we schedule an investigation of this? We'll need @kfrz to collaborate with @bmcgrady-ep or someone else on the Analytics team.
@drorva Yes, we will get this into the pipeline to look at after we wrap up the sentry work we are currently trying to get across the finish line.
@alexpappasoddball any updates with this? We're looking to close out this ticket (I'm assuming the work will be continued with Datadog).
I'd like to dig a bit more into this since it looks like it'll take longer to get datadog. @alexpappasoddball can we do a quick single day look at this in sprint 40 and see if we can at least figure out the discrepancy. If it's a simple, 1 day, fix, let's do it, but anything involved can wait.
Thanks! We'll put this in our backlog until we hear back.
Here is the task for BE Tools for sprint 40. We will update this ticket with findings once completed!
The breakers board in Grafana shows general "uptime" of an external service, which may correlate to the Google Analytics data. This metric is calculated over 5m interval, i.e. if the service is down for 5m or more it will show as unavailable.
Beyond something like the above-linked board, there's not a great way to correlate this data directly - only so far as we already have, leveraging Grafana to visualize outage trends.
Issue Description
The VSP backend tools team has been sharing an SLO report weekly. How might we validate that the availability of these systems are accurately portraying when the system is up or down?
Tasks
Acceptance Criteria
@drorva