getsentry / symx

Apple symbol manager
MIT License
4 stars 0 forks source link

Metrics and SLOs #22

Open kahest opened 9 months ago

kahest commented 9 months ago

Context

We currently use symbols fetched with https://github.com/getsentry/apple-system-symbols-upload as primary symbols source with symx being secondary/fallback. The intention is to make symx the primary source. We need information to decide when that switch can/should happen, and also for ongoing validation that the pipeline is working as expected.

Sidenote: it is likely that we will keep the old store around as a fallback for older symbols, but deactivate its symbols fetching/processing.

Questions we want to answer

SLOs/Metrics

Name Description Granularity Target
Coverage What's the coverage of Apple events we have system symbols for? per platform, total (TBD can we also do per store?) 95% (TBD)
Hit rate per store How often (absolute/%) is the primary/secondary store hit with a symbols request? (assumption: 1-Coverage is the fail rate of the secondary) per platform, total -
Time to availability How long does it take for symbols to become available on symx after being released? impacts Coverage per platform <24hrs (TBD)
### Tasks
- [ ] https://github.com/getsentry/sentry/pull/63091
- [ ] apply "`in_app`" heuristics
- [ ] ...TBD
kahest commented 9 months ago

@jernejstrasner @supervacuus as discussed FYI @philipphofmann @brustolin this is a discussion basis for SLOs and target values

philipphofmann commented 9 months ago

I think the target for coverage should be close to 100% for GA releases.

I have a couple of questions:

  1. Does symx include beta release?
  2. Does symx work for watchOS and tvOS arm64e because I still upload them manually?
  3. I don't understand what the hit rate per store is. What is the store?
kahest commented 9 months ago

@philipphofmann

  1. that's to be defined - IMO we should but maybe with different SLO
  2. maybe @supervacuus knows
  3. it's where we store symbols, e.g. GCS bucket for symx; hit rate is how often we try to find a symbol on a specific store
supervacuus commented 9 months ago
  1. Does symx include beta release?

Yes!

  1. Does symx work for watchOS and tvOS arm64e because I still upload them manually?

While we do mirror and process OTAs and IPSWs for [watch|tv]OS, only some of the OTAs currently process correctly, and none of the IPSWs. The latter is because ipsw (the CLI tool) fails when confronted with IPSWs containing two system-restore DMGs. I already know how to fix this, but I haven't come around to implementing it (also, with ipsw, I would have to understand how the maintainer wants this to be integrated).

philipphofmann commented 9 months ago

Beta releases 💯

While we do mirror and process OTAs and IPSWs for [watch|tv]OS, only some of the OTAs currently process correctly, and none of the IPSWs. The latter is because ipsw (the CLI tool) fails when confronted with IPSWs containing two system-restore DMGs. I already know how to fix this, but I haven't come around to implementing it (also, with ipsw, I would have to understand how the maintainer wants this to be integrated).

@supervacuus, so this means you're still figuring it out, and it seems like it could? If yes, 🥇.

supervacuus commented 9 months ago

Yes, @philipphofmann, this is primarily a question of "when" and definitely on the short-term fixes agenda. These two platforms are the main reasons I am interested in the success/failure metrics of the current symx/secondary store.

jernejstrasner commented 9 months ago

This is going to be tough based on what @Swatinem told me. We currently don't know in Symbolicator if it's a system or an app symbol so that immediately makes this impossible right now. We just know the platform.

supervacuus commented 9 months ago

We currently don't know in Symbolicator if it's a system or an app symbol so that immediately makes this impossible right now. We just know the platform.

@jernejstrasner and @Swatinem, can you elaborate on which of the metrics this affects?

There are two aspects to the hit rate of the stores from my limited POV:

Right now, we can only say what kind of coverage we can provide concerning the downloadable artifacts (i.e., the ratio of artifacts with successful symbol extraction vs. all artifacts mirrored). But this is very coarse and doesn't show how many symbols are required/missing in the events currently symbolicated of all apple-related events.

Which metrics could the symbolicator provide that connect to the symbols required by Sentry's users?

Swatinem commented 9 months ago

I had a brief talk with @jernejstrasner on monday about this, where I explained that such SLOs are extremely difficult to create.

One reason as I explained and was mentioned above is that Symbolicator has no concept of "system symbol", as every symbol is treated the same.

Another thing to clarify here would be that Symbolicator also treats every symbol source the same. There is no concept of "primary" vs "fallback". There is a fan-out and every single configured symbol source will be queried for every single symbol. Heavily cached of course. Then afterwards, the best result is picked, and all the results are written into the candidates list.

I tried to put together some form of metrics based on symbolicators internal "download" metrics here: https://app.datadoghq.com/notebook/7167407/apple-symbol-sources?view=view-mode Though this is very heavily skewed towards missing symbols. As not-found symbols are periodically retried, whereas found symbols are kept in cache and are not going through this mechanism.

It should be possible to add such a metric when the symbolication response comes back to sentry. There we have all the results from the candidates and there is already some form of rudimentary classification of "in-app" symbols, although I wouldn’t trust that. Collecting the metrics in the Sentry worker also means we might be able to dogfood DDM for that ;-)

kahest commented 9 months ago

From an internal conversation on how we can integrate DDM into Sentry worker to answer “how good is symx in terms of coverage and time to availability”, info provided by @Swatinem:

My thoughts:

supervacuus commented 9 months ago

Sorry that I didn't respond to @Swatinem earlier. Thanks ❤️! Doing this in the worker is fine if we can identify and track the sources. Suggesting to do this in symbolicator is only due to my missing insight into how the processing pipeline elements interact ;-)

I have to think out loud, so please bear with me. Let's take a minor step back here and re-evaluate what we want to achieve (short-term and over time). If we understand the above cases of combinations between the two stores (Roman numerals for the referenced scenario above)...

 old symx
1. (ii) missing missing
2 (iii) missing found
3. (iv) found missing
4. (i) found found

...then the following seems to apply:

  1. This doesn't tell us much in the short term: these could be app symbols, but they could also be missing system symbols in both stores. In the longer term (when the old store is no longer updated), this case will only be helpful if we have more event metadata that allows us to investigate. This means that this is not an actionable counter/gauge on its own (even with app/system classification), but that path would have to defer to a more detailed metric that at least provides a histogram of a single identifiable dimension (what context/metadata is available during the processing of the source candidates? how much of the mentioned context can we put in a metric? maybe sending an error with event context when a symbol is missing from symx + discover is indeed the right approach for further investigation).
  2. This would primarily validate that symx data is in use and it should be the only strictly increasing metric.
  3. This could be a regression, but it could also be a symbol not inside the index window of symx. With meta-data, it could be actionable (because we could say whether this is a module from a recent OTA/IPSW image). This metric should strictly decline. But in the long term, it could be a signal to extend the IPSW mirror if we want to remove the old store from the sources. But as far as I understood, we could host both stores "indefinitely" (or try to merge them).
  4. This is also primarily useful for the short term. The metric should also strictly decline and can be helpful together with 2.) and 3.) to have a somewhat complete picture of relative symbol coverage.

Form the above, I am currently missing how we can match any SLOs with these metrics, since coverage is only relative to a baseline that we seek to replace (i.e. useful for development of that replacement but hardly a guarantee for customers). Ideally symx would have 0 missing modules, but relative to what absolute number when we consider the old store to be out of date at some point? 0 (or some other low and decreasing threshold) missing of those classified as system? That puts a massive burden on the classifier in order for us not to ignore false negatives. If we push

as errors with event meta-data we could use this to identify missing images in the mirror or symbol-store in the short term. Via discover i could start to query, filter and export the data, that would allow us to act on missing symbols rather than just having a gauge to follow. Is this realistic in terms of the data that we have available in the worker and also in terms of the data we are producing? Or could we use DDM metrics the same way?

Did I get this right or am I missing a significant part of the big picture? I am just trying to get a better understanding of the data available below any metrics and how we can use those right now to guide prioritization. SLOs are only secondary for me, but I think we should already have an idea which things we track in absolute rather than relative terms. I have no clue how we would implement (and test!) this in the workers either... who would/should tackle this?

kahest commented 9 months ago

You got me good there by switching the order of scenarios ;) Thanks for your thoughtful reply ❤️

Some quick notes:

supervacuus commented 9 months ago

You got me good there by switching the order of scenarios ;)

I am sorry, I followed the classic 00, 01, 10, 11 combination pattern. Will edit.