Open AbhiPrasad opened 10 months ago
Is there something we can add/improve in the server-side fingerprinting to help with this?
Or maybe this stack trace rule is causing this? https://github.com/getsentry/sentry/blob/62a16a2dcb1a5e31314036bae969988ce69dcd94/src/sentry/grouping/enhancer/enhancement-configs/newstyle%402023-01-11.txt#L70-L76
# Android ANR: Exception stack is a snapshot of the UI/main thread. The
# *outermost* in-app frame is most indicative of which user action has led to ANR,
# and that's what we want to group by. (innermost=crashing frame)
#
# Note: Newer Android SDKs send the snapshot flag with ANRs, so this rule is
# not strictly necessary.
error.mechanism:ANR invert-stacktrace=1
Is there something we can add/improve in the server-side fingerprinting to help with this
Not sure, I'm not too familiar with our grouping code here. Let me ping and ask.
Are there any known error types that should not use the stack trace for grouping? Is this more a node issue than graphql?
Is this custom fingerprint sufficient to fix the problem?
error.type:RangeError runtime.name:node -> some_fingerprint
@ollipa @timfish does this work for you?
As a fix on Sentry, we could add an internal fingerprint (we do not have support for this) rather than expecting customers to do so.
This is a node issue but it might affect other platforms since it's entirely related to how App Not Responding errors are captured. When we detect that the main event loop is blocked, we pause via the debugger interface and capture a stack trace. The exact stack trace will depend on where exactly the debugger pauses but there will usually be a common part of the stack trace.
For example, take this code:
function longRunningProcess1 {
// blocking that takes 50ms to complete
}
function longRunningProcess12{
// blocking that takes 25ms to complete
}
function longWork() {
for (let i = 0; i < 20; i++) {
longRunningProcess1();
longRunningProcess2();
}
}
longWork();
longWork
will block for 1500ms and our ANR detection might capture a stack trace somewhere between 1000-1050ms. Depending on the timing we will pause in longRunningProcess1
maybe 66% of the time, longRunningProcess2
33% of the time and very occasionally we'll pause in longWork
. This causes the events to be in 3 separate groups but these should probably all be grouped under longWork
as this is the common "ancestor".
The GraphQL + Apollo grouping issue above appears to be caused by the fact that the long running blocking process we stop and capturing a stack trace from is recursive. The stack traces do have a common ancestor but it's distance to the top of the stack varies.
Is there an existing issue for this?
How do you use Sentry?
Sentry Saas (sentry.io)
Which SDK are you using?
@sentry/node
SDK Version
7.74.0
Framework Version
No response
Link to Sentry event
No response
SDK Setup
No response
Steps to Reproduce
Expected Result
Everything is grouped well
Actual Result
From @ollipa, raised as part of https://github.com/getsentry/sentry-javascript/issues/9307
The recursion might be causing issues here (varying amount of frames even though root cause is the same).