Open nebsta opened 6 months ago
Hey @nebsta,
I tried to reproduce the issue, but couldn't. I see that you have your own wrapper around SenrySpan. Can you share more details on how to initialize your instance of SentrySpan so we can reproduce the issue? A small sample project that reproduces the issue would even be the best. Furthermore, it would be great if you could share the crash report with us.
It seems like the culprit is #3892, which moved capturing the transaction to a background thread.
Hey @nebsta,
I tried to reproduce the issue, but couldn't. I see that you have your own wrapper around SenrySpan. Can you share more details on how to initialize your instance of SentrySpan so we can reproduce the issue? A small sample project that reproduces the issue would even be the best. Furthermore, it would be great if you could share the crash report with us.
It seems like the culprit is #3892, which moved capturing the transaction to a background thread.
@philipphofmann sorry for the late reply. Didn't get a notification about you responding. Here is a snippet of what the wrapper looks like. We mainly have this so that we can swap out reporting libraries in the future if we want to. As you can see it's pretty simple, nothing too fancy.
// Copyright © 2023 Trustpilot. All rights reserved.
import Foundation
import Sentry
class SentrySpan: SentrySpanProtocol {
private let sentrySpan: Span
init(span: Span) {
self.sentrySpan = span
}
func startChild(operation: String) -> SentrySpanProtocol {
let sentrySpan = self.sentrySpan.startChild(operation: operation)
return SentrySpan(span: sentrySpan)
}
func finish(status: TrustpilotSpanStatus) {
sentrySpan.finish(status: status.toSentrySpanStatus())
}
func finish() {
sentrySpan.finish()
}
}
extension TrustpilotSpanStatus {
func toSentrySpanStatus() -> SentrySpanStatus {
switch self {
case .internalError: return .internalError
}
}
}
// Copyright © 2023 Trustpilot. All rights reserved.
import Foundation
protocol SentrySpanProtocol {
func startChild(operation: String) -> SentrySpanProtocol
func finish(status: TrustpilotSpanStatus)
func finish()
}
enum TrustpilotSpanStatus {
case internalError
}
I'll see if I can get a full crash report from the sentry dashboard and post it here soon.
Thanks for the update, @nebsta. The full crash report would be helpful. I think I see a couple of occurrences of this in our internal SDK crash detection.
@philipphofmann classically I can't seem to replicate the issue anymore, so I'm unable to get the crash log for you. But by the sounds of it, it seems like it's a known issue if you have seen occurrences of it.
Yes, I think we should have enough information to fix it. Thanks for the update.
@philipphofmann any update on a fix?
@nebsta, I'm sorry no, not yet. We didn't get to this yet.
@philipphofmann are there any updates on this issue? We are currently blocked with upgrading to the latest sentry, as if we do, it will cause most (if not all) of our users to experience a crash on start up.
Hi @nebsta, still no updates. But we will keep this issue up to date.
Hey @krystofwoldrich @philipphofmann - Just to update on this, this issue is currently blocking us from upgrading to Xcode 16 as 8.20.0 won't compile with Xcode 16 and upgrading to 8.36.0 (the latest version) still causes this crash. Were you able to reproduce this issue?
Hello @P-Foley-Trustpilot can you share the error Xcode 16 gives with version 8.20.0?
Hi @brustolin, i've attached the errors list and where they occur
Oh yes. We have fixes for those missing references. And Im sorry, but we dont have any feedback on the Span issue, we couldn't reproduce it.
Oh yes. We have fixes for those missing references. And Im sorry, but we dont have any feedback on the Span issue, we couldn't reproduce it.
This seem to be at odds with what @philipphofmann was saying earlier in the thread. He mentioned that you are seeing these issues and that you have all the information you need to resolve it
philipphofmann on May 29 Thanks for the update, @nebsta. The full crash report would be helpful. I think I see a couple of occurrences of this in our internal SDK crash detection.
philipphofmann on May 29 Yes, I think we should have enough information to fix it. Thanks for the update.
I see. Yeah, maybe @philipphofmann knows something that I dont. Let me align with him.
We fixed the issue of the missing reference mentioned in https://github.com/getsentry/sentry-cocoa/issues/3977#issuecomment-2358020031 with https://github.com/getsentry/sentry-cocoa/issues/4050 shipped in 8.29.0.
Now, we only need to fix this error. I bumped the priority for this, but I can't give you an ETA yet.
We fixed the issue of the missing reference mentioned in #3977 (comment) with #4050 shipped in 8.29.0.
Now, we only need to fix this error. I bumped the priority for this, but I can't give you an ETA yet.
Thanks for the response @philipphofmann. We'll wait for an ETA from you 👍
@nebsta, unfortunately, all the crashes I see in our internal SDK crash detection are before 8.13.1, in which we fixed a crash in SentryTracer.finishInternal
with #3333. I can't reproduce the issue, and I don't have a crash report. Therefore, I can't fix the underlying problem here. Do you have a repro or a crash report, @nebsta? Otherwise, I can't do anything at the moment.
@nebsta, unfortunately, all the crashes I see in our internal SDK crash detection are before 8.13.1, in which we fixed a crash in
SentryTracer.finishInternal
with #3333. I can't reproduce the issue, and I don't have a crash report. Therefore, I can't fix the underlying problem here. Do you have a repro or a crash report, @nebsta? Otherwise, I can't do anything at the moment.
Hey @philipphofmann interestingly, I can only seem to replicate the issue on a simulator, but not on device. If we upgrade from 8.20.0 -> 8.36.0, then we see the same exception as I posted originally on simulator, but it seems fine on device. This might just be due to the limited testing I'm doing. We are going to be doing more rigorous regression testing on our Swift 6/XCode 16 update, so we will report back if we can see the exception happening then on devices.
@nebsta, thanks for the update. Can you share a stacktrace of a crash on a simulator?
@nebsta, thanks for the update. Can you share a stacktrace of a crash on a simulator?
I can't seem to export the detailed stack trace. The best I can do is the screenshot that I originally attached to the ticket
@philipphofmann I had a thought the other day. Are the span operations thread safe? Because we're currently calling span.finish within an actor, so it's being executed in the threadpool and not on the main actor. Could that potentially be why we're seeing this problem?
Hello @nebsta, span operations are thread safe, finishing it outside the main actor should not cause a problem.
Do you have Sentry profiling enabled? If so, can you test without it?
Hello @nebsta, span operations are thread safe, finishing it outside the main actor should not cause a problem.
Do you have Sentry profiling enabled? If so, can you test without it?
Yeah we do, I can turn it off and see if that resolves the issue
@brustolin I tried setting the profiling sample rate to 0, and it doesn't seem to be happening anymore, though I also noticed before changing the same rate, that the exception was happening on my device via XCode even on a fresh install. Which is different to how we originally reported it, where it would only happen when we updated from an older version of sentry.
@brustolin @philipphofmann I did notice one thing that may be related to this issue. Would like to get your thoughts on it.
Basically we use the manually profiling transactions/spans for timing how long our bootstrapping process takes in the app. Our Bootstrapping system is broken up into steps, with each step being a single span within the overall bootstrap transaction.
I just noticed that there is actually a bug whereby if one of the bootstrapping items fails, we actually never finish the running span, or the running transaction. This would also makes sense, as I have also seen our profiling sometimes showing ridiculously large bootstrapping times, which this is probably the reason for.
The question is that if we don't finish a span/transaction, could there be issue here if we try to finish the same span/transaction when a previous one wasn't finished correctly?
The question is that if we don't finish a span/transaction, could there be issue here if we try to finish the same span/transaction when a previous one wasn't finished correctly?
Yes, that could cause problems, but he SDK shouldn't crash. If you can confirm that this fixes the problem, it's easier for us to investigate the root cause.
The question is that if we don't finish a span/transaction, could there be issue here if we try to finish the same span/transaction when a previous one wasn't finished correctly?
Finishing or refinishing a span should not matter; we have safeguards for these scenarios.
If you could provide us with a reproduction of your flow and how you use transactions/spans, it would help us a lot in identifying whether we have a bug in the said safeguards.
@brustolin I've outlined an explanation of how we're using them below, including some code examples as well. Hopefully that gives you a better picture.
The main class where we use the spans and where this exception is happening is called the Bootstrapper
. This is an actor
, which goes through a list of BootstrapItem
instances (which are each themselves actor
instances as well), and executes some setup work that we need to perform on the startup of our application.
You can see a full gist of what the Boostrapper
looks like here, and what one of the BootstrapItem
implementations look like here
If you look within the Bootstrapper.loadAll
method, you'll see where we are creating/finishing all the spans for the bootstrap process. We start 1 span for the entire bootstrap process, and then spin off child spans for each item so that we can get a better breakdown of where our bottlenecks are.
let transaction = dependencies.sentry.startTransaction(name: "Bootstrap", operation: "Start") // HERE
for item in items {
let itemName = String(describing: type(of: item))
TLog.bootstrap.info("Loading item: \(itemName)")
dependencies.sentry.addBreadcrumb(BootstrapItemBreadcrumb.started(name: itemName))
var span: SentrySpanProtocol?
do {
span = transaction?.startChild(operation: itemName) // HERE
try await item.load()
try Task.checkCancellation()
dependencies.sentry.addBreadcrumb(BootstrapItemBreadcrumb.finished(name: itemName))
span?.finish()
} catch {
TLog.bootstrap.error("Error encountered on item: \(itemName), error: \(error)")
span?.finish(status: .internalError) // HERE
transaction?.finish() // HERE
dependencies.sentry.addBreadcrumb(BootstrapItemBreadcrumb.failed(name: itemName))
/// Some bootstrap items may not need to raise errors up to Sentry, so we log them here instead.
guard item.shouldRaiseFailureToSentry else {
TLog.bootstrap.error("Error encountered on item: \(itemName), error: \(error)")
return
}
throw BootstrapError.itemFailed(itemName: String(itemName), error: error)
}
}
transaction?.finish() // HERE
As I mentioned, there is a bug in this implementation, as if there is an exception in any of the bootstrap items, then we don't properly finish the span for the entire bootstrap process, which is something we're going to fix on our end. This is why I was wondering if perhaps this could be the reason for our crash, as I mentioned that we do sometimes see some instances in the 99th percentile that are reported to be going for as long as a couple of days, so I wonder if there is an issue in the mechanism for ensure that spans are always cleaned up.
Hopefully that's enough to give you a better picture.
I should also mention that the spans we're using the Bootstrapper
are actually our own span class (which you can see here, which is just a light wrapper around the Sentry Span
class. We do this so that our reporting manager is SDK agnostic, and so we can more easily write unit testing.
I looked at the diff in SentryTracer from 8.20.0 to 8.25.0 and couldn't find anything suspicious that could cause this crash.
@nebsta, can you try to enable the Sentry Cocoa SDK debug logs with options.debug = true
and share them with us here please? That could help us identify the problem.
Also, can you try to update to the latest version 8.36.0, and check if the problem persists? I know you currently can't update to a higher version than 8.25.0, but if you could check that in debug and it works, we know the problem was fixed.
There is a small chance that https://github.com/getsentry/sentry-cocoa/pull/4380 fixes the problem, but I doubt it because canceling the deadline timer for the manual transaction should be a NoOp as the deadline timer is nil.
I looked at the diff in SentryTracer from 8.20.0 to 8.25.0 and couldn't find anything suspicious that could cause this crash.
@nebsta, can you try to enable the Sentry Cocoa SDK debug logs with
options.debug = true
and share them with us here please? That could help us identify the problem.Also, can you try to update to the latest version 8.36.0, and check if the problem persists? I know you currently can't update to a higher version than 8.25.0, but if you could check that in debug and it works, we know the problem was fixed.
There is a small chance that #4380 fixes the problem, but I doubt it because canceling the deadline timer for the manual transaction should be a NoOp as the deadline timer is nil.
@philipphofmann We have been attempting to upgrade to 8.36.0 instead since it came out, but we're still getting the same issue. I'll try enabling the logs and see how that goes
@philipphofmann I've attached the logs that appear when enabling debug leading up to the crash. One thing I also noticed actually looking at the code, which you'll also see in the logs, is that we actually start sentry on boot, then we stop it after the bootstrapper and then call start agains straight after. I can't remember exactly why we do this, but I wonder if this could also be causing issues.
The last log message is
[Sentry] [debug] [SentrySDK:224] SDK main thread init started...
The logs don't show more log messages for installing all the integrations. Do you start your transaction leading to a crash directly after initializing the SDK a second time? I tried to reproduce your problem by starting the SDK, stopping it, starting it again and then directly capturing a transaction, but it works flawlessly.
is that we actually start sentry on boot, then we stop it after the bootstrapper and then call start agains straight after. I can't remember exactly why we do this, but I wonder if this could also be causing issues.
Can you maybe try not stopping and starting the SDK and check if the crash still occurs? If it disappears, it would make it easier to identify the problem.
@philipphofmann
Do you start your transaction leading to a crash directly after initialising the SDK a second time?
The finish on the transaction that triggers the crash is before we stop and start the system.
Can you maybe try not stopping and starting the SDK and check if the crash still occurs? If it disappears, it would make it easier to identify the problem.
I'll give this a go today and get back to you
It looks like this has been resolved when updating from 8.20.0 to 8.39.0! Was there a change in 8.39.0 that was expected to fix this?
@P-Foley-Trustpilot, we made so many improvements between these versions for that part of the SDK that it's hard to pin it down. It's great that it is fixed for you. Maybe others could retry with 8.39.0 and tell us if it's fixed.
Platform
iOS
Environment
Develop
Installed
Swift Package Manager
Version
8.25.0
Did it work on previous versions?
8.20.0
Steps to Reproduce
Expected Result
We expect the crash not to occur
Actual Result
Few other notes:
Are you willing to submit a PR?
No response