UserOfficeProject / issue-tracker

Shared place for features and bugs from all collaborators.
0 stars 0 forks source link

Improve how we monitor API performance #1087

Open ACLay opened 1 month ago

ACLay commented 1 month ago

What is the problem and why is it a problem

In STFC we currently forward telemetry to apollo studio. However, we're stuck on a plan that only gives 1 day of data retention and doesn't allow full inspection. While the cost seems like it'd be reasonable, to do so would require migrating to apollo supergraph, which seems like it'd complicate our architecture.

There are alternative systems we can look into that can be self hosted and won't require architecture changes, such as opentelemetry.

Steps to reproduce (if it's a bug).

ACLay commented 1 month ago

Apollo Studio

A paid serverless plan should run us $15/month/1M operation (above 10M), and give 90 days access to performance and trace data. Free plan gives 1 day.

I'm skeptical about it, as the UI and pricing pages don't seem to suggest this, but when inquiring about how a low usage account on the paid plan might work with retention levels, @srconway was told:

The 90 day data retention is active immediately on the Serverless paid plan. However, this only applies to cloud supergraphs

GraphQL Hive

A paid plan is $10/month/1M operations (above 1M), and gives 90 days access to performance data (no traces). Free plan gives 7 days. It has a self host option, but that needs lots of other systems and feels like it'd be a hassle to do reliably.

OpenTelemetry -> Jaeger -> Opensearch

We can set up opentelemetry to send traces to a jaeger instance which allows full trace inspection, and can even hook into elastic/opensearch to provide analytics. The storage requirements need investigating, as I do worry they might consume a lot. Maybe 10 minutes of minutes poking about in UOP over 1 days local testing produced 500 traces and 9.3MB of indexes on elastic. These traces are for graphql calls, incoming requests & external calls, like those to the UOWS, so I don't know how much storage we'd actually need for a real deployment

Operation count context

Looking over the april proxy logs (which contained a major ISIS direct round), we had 516,421 graphql requests on prod and 85,118 in dev for a total of 601,539.

joshhdawes commented 3 weeks ago

Sonia's going to get the paid Apollo Studio plan for a month so we can see which improvements we get.

srconway commented 2 weeks ago

Done. As of this afternoon we're on 'Serverless' rather than 'Serverless (Free)'. I'll get the first bill at the end of June so lets see if it makes any difference.

@ACLay @joshhdawes

ACLay commented 2 weeks ago

This is looking good. Old data's there and browsable, and their reported operation count for april is 594,501 which comes in very close to my prior estimate of 601,539 from the proxy logs https://studio.apollographql.com/org/isis-proposals-and-allocations/settings/billing

joshhdawes commented 1 week ago

Alex is going to put in a meeting to review Apollo Studio, along with other options that have been investigated.