getsentry / team-mobile

Meta issues for the Mobile team
4 stars 1 forks source link

[EPIC] Add slow and frozen frames to span data #156

Open markushi opened 1 year ago

markushi commented 1 year ago

[!NOTE] This feature should ship in the same release in combination with frame delay, because only then Mobile Starfish can provide meaningful statistics for our users and we don't want anybody to use a release with only one of the two features.

Add frames.total, frames.slow, and frames.frozen to span data to all spans. The value describes the number of slow/frozen frames that happened while the span was running.

Notion doc: https://www.notion.so/sentry/Data-Requirements-b4f616bac8bc449096a765f915e0a209

### Tasks
- [ ] https://github.com/getsentry/sentry-cocoa/issues/3448
- [ ] https://github.com/getsentry/sentry-java/pull/3081
- [ ] https://github.com/getsentry/develop/pull/1099
- [ ] https://github.com/getsentry/sentry-java/pull/3086
- [ ] https://github.com/getsentry/sentry-java/issues/3106
- [ ] https://github.com/getsentry/sentry-react-native/issues/3566
philipphofmann commented 1 year ago

Instead of adding the number of slow and frozen frames, I would add a list of slow and frozen frames containing the start and end times of every slow and frozen frame. On Cocoa, we can get this information. Can we also get this on Android, @markushi?

philipphofmann commented 1 year ago

As discussed in our weekly sync yesterday, it's not possible yet to add a list of timestamps to spans or transactions. It will take some time until that feature is available. Instead, we plan on adding total_frames, slow_frames, and frozen_frames to spans started on the main thread.

Furthermore, we plan on adding a new feature called frame delay see https://github.com/getsentry/team-mobile/issues/160.

stefanosiano commented 1 year ago

total_frames have some difficulties. Basically, the Android callback we rely on isn't called when the frame doesn't change. Since we are going to add frame delay, it could be a good moment to just drop the total frames? So we could have frame delay relative to span duration, as that's what users should care about. Otherwise, explaining the total_frames refers to frames with changed content only, could be harder for the user to understand. Especially because iOS and Android have different behaviours.

philipphofmann commented 1 year ago

@stefanosiano, the backend needs total_frames to calculate slow and frozen frame rates. Frame delay could replace these. WDYT, @stefanosiano and @markushi?

stefanosiano commented 1 year ago

if the total_frames is needed, we could try to update how it's calculated, by taking into account the refresh rate changes and aligning on the two platforms. I don't think we can send a time series of refresh rates to the backend to calculate the total frames correctly. So I'd prefer to introduce frame delay and drop total_frames entirely 😄

philipphofmann commented 1 year ago

As stated, we could drop total_frames once we add frame_delay. A big argument for dropping total frames and using frame delay is that it would work for app start transactions. Currently, we don't add frame numbers because we can't correctly measure the total frames. When having frame delay, slow and frozen frames, and frame delay would be accurate for app start transactions.

philipphofmann commented 1 year ago

As decided in a call, we can't drop total frames because it's required for the existing performance product. We might be able to drop it for starfish, but we should try to get proper numbers for total frames on Android for now.

philipphofmann commented 12 months ago

Decision to keep sending slow and frozen frames for both spans and transaction

@markushi, @kahest, and @philipphofmann decided on call on November 28th, 2023, that we should send total, slow, and frozen frames for both transactions and spans. While there are problems with slow and frozen frames, as pointed out below, these problems are inconsequential when looking at the aggregate data of spans and transactions, especially when we add frame delay. Total, slow, and frozen frames aim to provide enough data to give our users a high-level view of which transactions or spans cause animation glitches. They don't tell users exactly where these glitches occur in a span or transaction, which is not that important when looking at the aggregate data. In future iterations, we can add the exact timestamps of delayed frames to visualize them, but that's not the current goal.

Problems with slow and frozen frames

@brustolin pointed out the following problems with our current approach of capturing slow and frozen frames for both transactions and spans. The following scenarios can be applied to transactions and spans cause they use the same strategy.

ff = frozen frame
nf = normal frame
s = span

Scenario 1
[ ---- ff ---- ][nf][nf][nf]
             [----- s ------]

When the span starts, there are 0 ff and 0 nf, although an ff is ongoing.
When the span finishes there are 1 ff and 2 nf. 
Then the SDK adds 1 ff and 2 nf to the span, although the ff mostly occurred before the span.

Scenario 2
[nf][ ---- ff ---- ][nf]
   [----- s ------]

When the span starts, there are 0 nf. 
During the span something blocks the UI thread, causing a ff.
When the span finishes, the ff is still ongoing. It could take a millisecond until the next frame is drawn.
But the SDK adds 0 ff and 0 nf to the span, although a ff was ongoing and the span lasted for 1000 ms.

Scenario 3
[ ---- ff -------------------]
             [----- s ------]

When the span starts, there is an ongoing ff and there are 0 ff.
The work captured with the span on the main thread blocks the renderer.
When the span finishes, the ff is still ongoing.
Then the SDK reports 0 nf and 0 ff although a ff was ongoing, and the ff
finishes after the blocking work of the span when the renderer has time again
to render a new frame.

The above two problems are relatively minor for transactions as they last longer. Spans instead have a shorter duration, and these two problems will happen more often, leading to wrong frame information on spans.

Again, this is a use case to prove that better approaches exist than slow and frozen frames, such as frame delay. When a frozen frame spreads across two spans or transactions, assigning it to the correct one is tricky. We should keep a time series of delayed frames in our frames tracker. For every transaction/span, we can ask how many ms frames were delayed.

Scenario 1
[ ---- ff ---- ][nf][nf][nf]
             [----- s ------]

When the span finishes the SDK queries the frame tracker for the frame delay with start and
end timestamp of the span.
The frame tracker knows there was a delayed frame when the span started, but only a part of
the whole duration occurred during the span's lifetime. Therefore it states that the frame
delay was around 50 ms.

Scenario 2
[nf][ ---- ff ---- ][nf]
   [----- s ------]

When the span finishes the SDK queries the frame tracker for the frame delay with start and
end timestamp of the span.
The frame tracker knows that there was a delayed frame after the span started and is still ongoing.
Therefore it states that the frame delay was almost the whole duration of the span.