Metrics and SLOs - Githubissues

kahest commented 9 months ago

Context

We currently use symbols fetched with https://github.com/getsentry/apple-system-symbols-upload as primary symbols source with symx being secondary/fallback. The intention is to make symx the primary source. We need information to decide when that switch can/should happen, and also for ongoing validation that the pipeline is working as expected.

Sidenote: it is likely that we will keep the old store around as a fallback for older symbols, but deactivate its symbols fetching/processing.

Questions we want to answer

how well does our Apple symbolication pipeline perform? what's the coverage and the time until symbols are available? how often do we fail?
how well does symx work, when can we switch symx to become primary?
- how often do we fall back from primary to symx? i.e. symbols are not available on primary
- how often is symx successful (i.e. has system symbols for the event)
- how often do we fallback once we have symx as primary - can we freeze/remove old symbols pipeline?

SLOs/Metrics

Name	Description	Granularity	Target
Coverage	What's the coverage of Apple events we have system symbols for?	per platform, total (TBD can we also do per store?)	95% (TBD)
Hit rate per store	How often (absolute/%) is the primary/secondary store hit with a symbols request? (assumption: 1-Coverage is the fail rate of the secondary)	per platform, total	-
Time to availability	How long does it take for symbols to become available on symx after being released? impacts Coverage	per platform	<24hrs (TBD)

### Tasks
- [ ] https://github.com/getsentry/sentry/pull/63091
- [ ] apply "`in_app`" heuristics
- [ ] ...TBD

kahest commented 9 months ago

@jernejstrasner @supervacuus as discussed FYI @philipphofmann @brustolin this is a discussion basis for SLOs and target values

philipphofmann commented 9 months ago

I think the target for coverage should be close to 100% for GA releases.

I have a couple of questions:

Does symx include beta release?
Does symx work for watchOS and tvOS arm64e because I still upload them manually?
I don't understand what the hit rate per store is. What is the store?

kahest commented 9 months ago

@philipphofmann

that's to be defined - IMO we should but maybe with different SLO
maybe @supervacuus knows
it's where we store symbols, e.g. GCS bucket for symx; hit rate is how often we try to find a symbol on a specific store

supervacuus commented 9 months ago

Does symx include beta release?

Yes!

Does symx work for watchOS and tvOS arm64e because I still upload them manually?

While we do mirror and process OTAs and IPSWs for [watch|tv]OS, only some of the OTAs currently process correctly, and none of the IPSWs. The latter is because ipsw (the CLI tool) fails when confronted with IPSWs containing two system-restore DMGs. I already know how to fix this, but I haven't come around to implementing it (also, with ipsw, I would have to understand how the maintainer wants this to be integrated).

philipphofmann commented 9 months ago

Beta releases 💯

While we do mirror and process OTAs and IPSWs for [watch|tv]OS, only some of the OTAs currently process correctly, and none of the IPSWs. The latter is because ipsw (the CLI tool) fails when confronted with IPSWs containing two system-restore DMGs. I already know how to fix this, but I haven't come around to implementing it (also, with ipsw, I would have to understand how the maintainer wants this to be integrated).

@supervacuus, so this means you're still figuring it out, and it seems like it could? If yes, 🥇.

supervacuus commented 9 months ago

Yes, @philipphofmann, this is primarily a question of "when" and definitely on the short-term fixes agenda. These two platforms are the main reasons I am interested in the success/failure metrics of the current symx/secondary store.

jernejstrasner commented 9 months ago

This is going to be tough based on what @Swatinem told me. We currently don't know in Symbolicator if it's a system or an app symbol so that immediately makes this impossible right now. We just know the platform.

supervacuus commented 9 months ago

We currently don't know in Symbolicator if it's a system or an app symbol so that immediately makes this impossible right now. We just know the platform.

@jernejstrasner and @Swatinem, can you elaborate on which of the metrics this affects?

There are two aspects to the hit rate of the stores from my limited POV:

counting how many symbolication requests can/can't be resolved against the primary or secondary store (this should be independent of the event meta-data). Grouping this by platform is already a huge win compared to what we currently have.
figuring out what kind of event is associated with a particular failed symbolication attempt (or, for us, an aggregate answer that would provide grouping towards platform/version/build/etc.). This is only nice to have compared with the first topic.

Right now, we can only say what kind of coverage we can provide concerning the downloadable artifacts (i.e., the ratio of artifacts with successful symbol extraction vs. all artifacts mirrored). But this is very coarse and doesn't show how many symbols are required/missing in the events currently symbolicated of all apple-related events.

Which metrics could the symbolicator provide that connect to the symbols required by Sentry's users?

Swatinem commented 9 months ago

I had a brief talk with @jernejstrasner on monday about this, where I explained that such SLOs are extremely difficult to create.

One reason as I explained and was mentioned above is that Symbolicator has no concept of "system symbol", as every symbol is treated the same.

Another thing to clarify here would be that Symbolicator also treats every symbol source the same. There is no concept of "primary" vs "fallback". There is a fan-out and every single configured symbol source will be queried for every single symbol. Heavily cached of course. Then afterwards, the best result is picked, and all the results are written into the candidates list.

I tried to put together some form of metrics based on symbolicators internal "download" metrics here: https://app.datadoghq.com/notebook/7167407/apple-symbol-sources?view=view-mode Though this is very heavily skewed towards missing symbols. As not-found symbols are periodically retried, whereas found symbols are kept in cache and are not going through this mechanism.

It should be possible to add such a metric when the symbolication response comes back to sentry. There we have all the results from the candidates and there is already some form of rudimentary classification of "in-app" symbols, although I wouldn’t trust that. Collecting the metrics in the Sentry worker also means we might be able to dogfood DDM for that ;-)

kahest commented 9 months ago

From an internal conversation on how we can integrate DDM into Sentry worker to answer “how good is symx in terms of coverage and time to availability”, info provided by @Swatinem:

starting from here: https://github.com/getsentry/sentry/blob/0132f2bf5b0219655234b3bbceaa84a9947d5f69/src/sentry/lang/native/sources.py#L692-L711 you can build some logic that compares symx vs the existing buckets, and either raise internal sentry errors with context, or just metrics

and there is some rudimentary classification for system images here: https://github.com/getsentry/sentry/blob/42c254b9b5bf25539042ce8253a88a4a73afa3fd/src/sentry/lang/native/processing.py#L101-L121 (or rather in the functions called from there)

so it might also be possible to reuse that logic, although its just a bunch of heuristics that drive the "debug file missing" error in the UI

so its possible to gather some metrics there, though I don’t vouch for the accuracy of the "system symbol" classification :wink:

but you can definitely measure:

both have the file (= best case for system symbols)

neither has the file (= most likely app symbol)

symx has the symbol (= an improvement)

symx does not have the symbol (= a regression)

My thoughts:

1/3/4 already indicate that it should be system frame because either the old store or symx (or both) have the symbol, IIUC
for 2 we can use the existing logic and see how far that gets us
this is a good start to know how good symx in terms of coverage
for time to availability we need of course a separate approach - for now o11y stats about scheduled runs can suffice

supervacuus commented 9 months ago

Sorry that I didn't respond to @Swatinem earlier. Thanks ❤️! Doing this in the worker is fine if we can identify and track the sources. Suggesting to do this in symbolicator is only due to my missing insight into how the processing pipeline elements interact ;-)

I have to think out loud, so please bear with me. Let's take a minor step back here and re-evaluate what we want to achieve (short-term and over time). If we understand the above cases of combinations between the two stores (Roman numerals for the referenced scenario above)...

	old	symx
1. (ii)	missing	missing
2 (iii)	missing	found
3. (iv)	found	missing
4. (i)	found	found

...then the following seems to apply:

This doesn't tell us much in the short term: these could be app symbols, but they could also be missing system symbols in both stores. In the longer term (when the old store is no longer updated), this case will only be helpful if we have more event metadata that allows us to investigate. This means that this is not an actionable counter/gauge on its own (even with app/system classification), but that path would have to defer to a more detailed metric that at least provides a histogram of a single identifiable dimension (what context/metadata is available during the processing of the source candidates? how much of the mentioned context can we put in a metric? maybe sending an error with event context when a symbol is missing from symx + discover is indeed the right approach for further investigation).
This would primarily validate that symx data is in use and it should be the only strictly increasing metric.
This could be a regression, but it could also be a symbol not inside the index window of symx. With meta-data, it could be actionable (because we could say whether this is a module from a recent OTA/IPSW image). This metric should strictly decline. But in the long term, it could be a signal to extend the IPSW mirror if we want to remove the old store from the sources. But as far as I understood, we could host both stores "indefinitely" (or try to merge them).
This is also primarily useful for the short term. The metric should also strictly decline and can be helpful together with 2.) and 3.) to have a somewhat complete picture of relative symbol coverage.

Form the above, I am currently missing how we can match any SLOs with these metrics, since coverage is only relative to a baseline that we seek to replace (i.e. useful for development of that replacement but hardly a guarantee for customers). Ideally symx would have 0 missing modules, but relative to what absolute number when we consider the old store to be out of date at some point? 0 (or some other low and decreasing threshold) missing of those classified as system? That puts a massive burden on the classifier in order for us not to ignore false negatives. If we push

both missing but classified as system
old: found, but symx: missing

as errors with event meta-data we could use this to identify missing images in the mirror or symbol-store in the short term. Via discover i could start to query, filter and export the data, that would allow us to act on missing symbols rather than just having a gauge to follow. Is this realistic in terms of the data that we have available in the worker and also in terms of the data we are producing? Or could we use DDM metrics the same way?

Did I get this right or am I missing a significant part of the big picture? I am just trying to get a better understanding of the data available below any metrics and how we can use those right now to guide prioritization. SLOs are only secondary for me, but I think we should already have an idea which things we track in absolute rather than relative terms. I have no clue how we would implement (and test!) this in the workers either... who would/should tackle this?

kahest commented 9 months ago

You got me good there by switching the order of scenarios ;) Thanks for your thoughtful reply ❤️

Some quick notes:

your scenario 1 (which is scenario 2 in the prevous post) is tricky - I don't fully understand how useful this is yet, definitely something to discuss further
we'd want to use Sentry Developer Metrics (old codename DDM - I'm still guilty of using that sometimes) for this
we'll figure out who can tackle this (and when) separately :)

supervacuus commented 9 months ago

You got me good there by switching the order of scenarios ;)

I am sorry, I followed the classic 00, 01, 10, 11 combination pattern. Will edit.

getsentry / symx

Metrics and SLOs #22

Context

Questions we want to answer

SLOs/Metrics