Node ANR tracking not grouping well with GraphQL + Apollo

AbhiPrasad commented 10 months ago

Is there an existing issue for this?

[X] I have checked for existing issues https://github.com/getsentry/sentry-javascript/issues
[X] I have reviewed the documentation https://docs.sentry.io/
[X] I am using the latest SDK release https://github.com/getsentry/sentry-javascript/releases

How do you use Sentry?

Sentry Saas (sentry.io)

Which SDK are you using?

@sentry/node

SDK Version

7.74.0

Framework Version

No response

Link to Sentry event

No response

SDK Setup

No response

Steps to Reproduce

Set up graphql + apollo app

Enable Sentry SDK with ANR tracking
Throw ANR error from graphql resolver

Expected Result

Everything is grouped well

Actual Result

Something that didn't work so well was that we got a new issue every time the problem with big GraphQL response occurred. It is probably related to the fact that the stack trace just contains Apollo server related code. I had to create a custom fingerprint rule to group the issues.

From @ollipa, raised as part of https://github.com/getsentry/sentry-javascript/issues/9307

ApplicationNotResponding: Application Not Responding for at least 1000 ms
  File "app:///../node_modules/graphql/execution/execute.js", line 412, col 5, in executeFields
  File "app:///../node_modules/graphql/execution/execute.js", line 925, col 10, in completeObjectValue
  File "app:///../node_modules/graphql/execution/execute.js", line 646, col 12, in completeValue
  File "app:///../node_modules/graphql/execution/execute.js", line 595, col 23, in completeValue
  File "app:///../node_modules/graphql/execution/execute.js", line 707, col 25
  File "app:///../node_modules/graphql/execution/execute.js", line 687, col 34, in completeListValue
  File "app:///../node_modules/graphql/execution/execute.js", line 618, col 12, in completeValue
  File "app:///../node_modules/graphql/execution/execute.js", line 595, col 23, in completeValue
  File "app:///../node_modules/graphql/execution/execute.js", line 500, col 19, in executeField
  File "app:///../node_modules/graphql/execution/execute.js", line 414, col 22, in executeFields
  File "app:///../node_modules/graphql/execution/execute.js", line 925, col 10, in completeObjectValue
  File "app:///../node_modules/graphql/execution/execute.js", line 646, col 12, in completeValue
  File "app:///../node_modules/graphql/execution/execute.js", line 595, col 23, in completeValue
  File "app:///../node_modules/graphql/execution/execute.js", line 707, col 25
  File "app:///../node_modules/graphql/execution/execute.js", line 687, col 34, in completeListValue
  File "app:///../node_modules/graphql/execution/execute.js", line 618, col 12, in completeValue
  File "app:///../node_modules/graphql/execution/execute.js", line 595, col 23, in completeValue
  File "app:///../node_modules/graphql/execution/execute.js", line 500, col 19, in executeField
  File "app:///../node_modules/graphql/execution/execute.js", line 414, col 22, in executeFields
  File "app:///../node_modules/graphql/execution/execute.js", line 925, col 10, in completeObjectValue
  File "app:///../node_modules/graphql/execution/execute.js", line 806, col 10, in completeAbstractValue
  File "app:///../node_modules/graphql/execution/execute.js", line 635, col 12, in completeValue
  File "app:///../node_modules/graphql/execution/execute.js", line 595, col 23, in completeValue
  File "app:///../node_modules/graphql/execution/execute.js", line 707, col 25
  File "app:///../node_modules/graphql/execution/execute.js", line 687, col 34, in completeListValue
  File "app:///../node_modules/graphql/execution/execute.js", line 618, col 12, in completeValue
  File "app:///../node_modules/graphql/execution/execute.js", line 595, col 23, in completeValue
  File "app:///../node_modules/graphql/execution/execute.js", line 500, col 19, in executeField
  File "app:///../node_modules/graphql/execution/execute.js", line 414, col 22, in executeFields
  File "app:///../node_modules/graphql/execution/execute.js", line 925, col 10, in completeObjectValue
  File "app:///../node_modules/graphql/execution/execute.js", line 646, col 12, in completeValue
  File "app:///../node_modules/graphql/execution/execute.js", line 595, col 23, in completeValue
  File "app:///../node_modules/graphql/execution/execute.js", line 497, col 9
  File "node:internal/process/task_queues", line 95, col 5, in processTicksAndRejections

The recursion might be causing issues here (varying amount of frames even though root cause is the same).

timfish commented 9 months ago

Is there something we can add/improve in the server-side fingerprinting to help with this?

Or maybe this stack trace rule is causing this? https://github.com/getsentry/sentry/blob/62a16a2dcb1a5e31314036bae969988ce69dcd94/src/sentry/grouping/enhancer/enhancement-configs/newstyle%402023-01-11.txt#L70-L76

# Android ANR: Exception stack is a snapshot of the UI/main thread. The
# *outermost* in-app frame is most indicative of which user action has led to ANR,
# and that's what we want to group by. (innermost=crashing frame)
#
# Note: Newer Android SDKs send the snapshot flag with ANRs, so this rule is
# not strictly necessary.
error.mechanism:ANR invert-stacktrace=1

AbhiPrasad commented 9 months ago

Is there something we can add/improve in the server-side fingerprinting to help with this

Not sure, I'm not too familiar with our grouping code here. Let me ping and ask.

armenzg commented 8 months ago

Are there any known error types that should not use the stack trace for grouping? Is this more a node issue than graphql?

Is this custom fingerprint sufficient to fix the problem? error.type:RangeError runtime.name:node -> some_fingerprint

@ollipa @timfish does this work for you?

As a fix on Sentry, we could add an internal fingerprint (we do not have support for this) rather than expecting customers to do so.

timfish commented 8 months ago

This is a node issue but it might affect other platforms since it's entirely related to how App Not Responding errors are captured. When we detect that the main event loop is blocked, we pause via the debugger interface and capture a stack trace. The exact stack trace will depend on where exactly the debugger pauses but there will usually be a common part of the stack trace.

For example, take this code:

function longRunningProcess1 {
  // blocking that takes 50ms to complete
}

function longRunningProcess12{
  // blocking that takes 25ms to complete
}

function longWork() {
  for (let i = 0; i < 20; i++) {
    longRunningProcess1();
    longRunningProcess2();
  }
}

longWork();

longWork will block for 1500ms and our ANR detection might capture a stack trace somewhere between 1000-1050ms. Depending on the timing we will pause in longRunningProcess1 maybe 66% of the time, longRunningProcess2 33% of the time and very occasionally we'll pause in longWork. This causes the events to be in 3 separate groups but these should probably all be grouped under longWork as this is the common "ancestor".

The GraphQL + Apollo grouping issue above appears to be caused by the fact that the long running blocking process we stop and capturing a stack trace from is recursive. The stack traces do have a common ancestor but it's distance to the top of the stack varies.

getsentry / sentry-javascript