department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
282 stars 203 forks source link

Validate SLO report using GA #14146

Closed joanneesteban closed 10 months ago

joanneesteban commented 4 years ago

Issue Description

The VSP backend tools team has been sharing an SLO report weekly. How might we validate that the availability of these systems are accurately portraying when the system is up or down?


Tasks

Acceptance Criteria


@drorva

drorva commented 4 years ago

@kfrz for the analytics team to validate the data we're getting in the SLO report it'll be helpful to see when the systems are actually down. Is there a way to include in the SLO report a graph that shows when the downtime occurs, the way we have graphs for the latency?

jonwehausen commented 4 years ago

Thanks team! We will review upon receiving the SLO report. Thanks in advance

kfrz commented 4 years ago

image

This is the best I can come up with in an hour, this additional graph indicates at a (4hr, typo in screenshot) resolution the general availability of the service, it aligns with the dips in the above panel, which doesn't allow for adding time-ranges.

I'd be interested to know how the data correlates with the Google Analytics data, too - to see how the backend downtime affects the MDT submission project. Going to generate the report with this additional panel, happy to revisit it and tweak it if you have feedback.

kfrz commented 4 years ago

External SLO report here - happy to iterate on this a bit more.

We may be able to investigate getting data from Google Analytics directly into Grafana to make the correlation easier.

joanneesteban commented 4 years ago

@jonwehausen @bmcgrady-ep

As Keifer noted, we'll need to see if backend downtime correlates with lack of submissions. So, are MDT submissions generally lower during the downtime hours? One of the things we'll need to take into account is if submissions during these times is lower in general.

Please prioritize any dashboarding or GTM work that needs to go out this sprint. We can pull this into next sprint.

@kfrz thanks for the quick turnaround. So to clarify the 4 hour intervals for each day:

From the report, it's a bit unclear which time ranges are missing. But also, let us know if we can just see that more easily on Grafana!

bmcgrady-ep commented 4 years ago

@joanneesteban - I'm not seeing any evidence in Google Analytics that the backend downtime is correlating with a lack of submissions. I used this hourly GA report and looked at each day.

joanneesteban commented 4 years ago

Example:

GA - 9/17 image

SLO report shows that there was downtime between 8pm-12am on the 17th, but there were forms submitted during that time.

@drorva

drorva commented 4 years ago

@kfrz can we dig into this and see why the discrepancy?

drorva commented 3 years ago

@alexpappasoddball can we schedule an investigation of this? We'll need @kfrz to collaborate with @bmcgrady-ep or someone else on the Analytics team.

alexpappasoddball commented 3 years ago

@drorva Yes, we will get this into the pipeline to look at after we wrap up the sentry work we are currently trying to get across the finish line.

joanneesteban commented 3 years ago

@alexpappasoddball any updates with this? We're looking to close out this ticket (I'm assuming the work will be continued with Datadog).

drorva commented 3 years ago

I'd like to dig a bit more into this since it looks like it'll take longer to get datadog. @alexpappasoddball can we do a quick single day look at this in sprint 40 and see if we can at least figure out the discrepancy. If it's a simple, 1 day, fix, let's do it, but anything involved can wait.

joanneesteban commented 3 years ago

Thanks! We'll put this in our backlog until we hear back.

alexpappasoddball commented 3 years ago

Here is the task for BE Tools for sprint 40. We will update this ticket with findings once completed!

kfrz commented 3 years ago

The breakers board in Grafana shows general "uptime" of an external service, which may correlate to the Google Analytics data. This metric is calculated over 5m interval, i.e. if the service is down for 5m or more it will show as unavailable.

image

Beyond something like the above-linked board, there's not a great way to correlate this data directly - only so far as we already have, leveraging Grafana to visualize outage trends.