getsentry / team-sdks

A meta repository for tracking work across all SDK teams.
0 stars 0 forks source link

[Starfish] Collect more data on mobile spans to support drill-downs on starfish #22

Open shruthilayaj opened 11 months ago

shruthilayaj commented 11 months ago

Project Board

Description

To support the flows on mobile starfish, we've identified a list of data requirements for the mobile SDKs. I'm splitting the requests into "Must Have" (they power some of the core flows) and "Nice To Have" (they augment the user experience).

Must Have

  1. Distinguish between app start page loads vs regular page loads. One suggestion would be to make this distinction with different transaction.op for these transactions. Right now, they are all ui.load transactions.

    In the product, we'd do something like:

    • App Warm Start: TTID extracted from transaction.op: app.warm.start
    • App Cold Start: TTID extracted from transaction.op: app.cold.start
    • TTID: TTID extracted from transaction.op: ui.load

Edit: The different transaction.op will also allow us to filter sample spans, span metrics and sample transactions based on the flow the user is taking. So if they're debugging slow cold starts, we can show span metrics, sample spans and transactions of only the app.cold.start transactions.

  1. Collect # slow and frozen frames per span.

    In the product, this will facilitate UI jank workflows, help narrow it down to work done by certain spans.

  2. Tag spans with if they ran on the main thread or not. Blocked by https://github.com/getsentry/rfcs/pull/75

  3. Tag spans that ended before TTID and spans that ended before TTFD.

    This one is the lowest priority of the Must Haves since we can currently do this on ingest with the transactions model. However, this work might have to move to the SDK layer with span streaming.

Nice To Have

  1. More auto-instrumented spans

    This one is a little more open-ended! Can we auto-instrument more layout and draw spans? Or spans that capture other CPU bound work like compression, encryption, serialization etc? Can we get more useful app start spans?

### Nice To Have Tasks
- [ ] https://github.com/getsentry/sentry-cocoa/issues/3345
- [ ] https://github.com/getsentry/sentry-java/pull/2979
### Platforms that need to be supported
- [ ] Android
- [ ] iOS
- [ ] React Native
- [ ] Flutter

RFC

No response

Slack-Channel

discuss-starfish

Notion Document(s)

Mobile Starfish V0 Milestone Doc (includes an excalidraw that details out the flows we want to support) Data Requirements (scroll to the bottom to see mobile sdk parts)

Stakeholder(s)

@getsentry/team-starfish @alexjillard

Team(s)

Mobile

AbhiPrasad commented 11 months ago

Collect # slow and frozen frames per span.

Right now we don't have a spec for measurements on a span, so this will just be an integer that is collected and put under a key in span.data. The mobile team can decide naming for the span data key, but we need to make sure it is consistent.

Tag spans that ended before TTID and spans that ended before TTFD.

This can also live on span.data if implemented on SDK

stefanosiano commented 11 months ago

Tag spans with if they ran on the main thread or not.

Can you specify better the meaning of “ran on the main thread”? There are 3 possibilities: 1) The span started on the main thread (easily doable) 2) The span finished on the main thread (easily doable) 3) The span refers to code executed on the main thread (probably not doable) We may even detect if a span started AND finished in the main thread, but there’s no way to know if the code actually executed on the main thread or not, afaik

markushi commented 11 months ago

Nice summary, thank you! Could you elaborate a bit more on the following:

  1. Collect # slow and frozen frames per span.

What kind of spans would you expect to see slow/frozen frames being attached to? I guess it makes sense for TTID/TTFD spans or "UI" spans in general, but for a DB query span running in the background thread we probably shouldn't collect any of this right?

markushi commented 11 months ago

On another note with multi screen / foldables becoming more popular we should also consider the use case of having multiple screens open at the same time (also relates to https://github.com/getsentry/sentry-cocoa/issues/1768)

shruthilayaj commented 11 months ago

Can you specify better the meaning of “ran on the main thread”? There are 3 possibilities:

Use case is we want is to identify spans that did work on the main thread, so we can narrow down the scope of spans to investigate for specific concerns. For example, if you're investigating slow TTID, we could just show you the spans that ran on the main thread during that time.

But if we can't capture that, what would be the next best estimate in your opinion? If we do

if a span started AND finished in the main thread

will we miss out on a lot of potential spans that might have done work and either only started or ended on the main thread? I think we'd want to minimize false negatives in this case!

AbhiPrasad commented 11 months ago

re: span on main thread, we had an rfc with some prior thought here: https://github.com/getsentry/rfcs/pull/75, not sure if @jonasba there is appetite to merge that.

Copying my comment from that RFC, OpenTelemetry's general approach here is that the thread that started a signal (metric, span) is what they usually just track - and I think that's a reasonable approach for us to take too, so we track if span was created from main thread. This is the simplest solution that has the least performance burden in terms of data collection, and matches with what OTEL does as well.

If a span kicked off from the main thread, also easy to understand causation (we know why span started). We can maybe use profiling to fill in the other details.

What kind of spans would you expect to see slow/frozen frames being attached to?

@shruthilayaj, maybe it's reasonable for us to collect this on spans that are running on UI thread. Feel like it's not needed otherwise.

shruthilayaj commented 11 months ago

What kind of spans would you expect to see slow/frozen frames being attached to? I guess it makes sense for TTID/TTFD spans or "UI" spans in general, but for a DB query span running in the background thread we probably shouldn't collect any of this right?

Yup, I agree! It would make sense that they are attributed to UI spans- basically we want to capture this for spans doing work that can cause frame drops! And if IO bound work doesn't cause dropped frames, we probably don't want to collect it there. Follow up question for you though @markushi, if you have multiple UI spans running when we detect frame drops, would we increment the frame drop counter on all of them?

markushi commented 11 months ago

if you have multiple UI spans running when we detect frame drops, would we increment the frame drop counter on all of them?

Yes, I think that's the best approach! As any of the spans could be the culprit.

shruthilayaj commented 11 months ago

On another note with multi screen / foldables becoming more popular we should also consider the use case of having multiple screens open at the same time

I don't want to include this in the scope for V0 mobile starfish, but yeah definitely something to tackle for V1+!

shruthilayaj commented 11 months ago

@markushi Do we have access/are we able to extract the slow and frozen frame durations - ie the actual time it took to render a frame when we detect a slow or frozen frame? In the product, a sum of all the slow & frozen frame durations on a span would provide a better signal of which spans to prioritize rather than just a raw count of slow or frozen frames (these metrics are important too! we still want to keep them).

markushi commented 10 months ago

@shruthilayaj Yes, using the Frame Metrics API we can retrieve detailed information on a per frame level. Let me check on other mobile platforms 👀

marandaneto commented 10 months ago

Profiling is able to narrow down the underlying problem of slow and frozen frames. Android Studio has this built in already, I guess this would be the best solution (when combined with profiling).

philipphofmann commented 8 months ago

Do we have access/are we able to extract the slow and frozen frame durations - ie the actual time it took to render a frame when we detect a slow or frozen frame?

Yes, we do on Cocoa. We could also add timestamps for the frames, so you could highlight exactly in the UI when these happen. We already do that for profiling.

romtsn commented 8 months ago

FYI that's how otel/splunk is tracking current screen can be helpful for grouping txs by screen:

Android - https://github.com/open-telemetry/opentelemetry-android/blob/main/instrumentation/src/main/java/io/opentelemetry/android/instrumentation/activity/VisibleScreenTracker.java

https://github.com/signalfx/splunk-otel-ios/blob/cc1bba73a0dd86975311202e901c9aca3923ea0a/SplunkRumWorkspace/SplunkRum/SplunkRum/UIInstrumentation.swift#L190-L208