getsentry / sentry-unity

Development of Sentry SDK for Unity
https://docs.sentry.io/platforms/unity/
MIT License
207 stars 51 forks source link

Intermittent splash screen hang on Android #286

Closed adamsoutar closed 2 years ago

adamsoutar commented 3 years ago

Environment

How do you use Sentry? Sentry SaaS (sentry.io)

Which SDK and version? Sentry Unity 0.4.3 (https://github.com/getsentry/unity.git#0.4.3 via UPM)

Steps to Reproduce

  1. Add Sentry to the project and enter the DSN
  2. Build an Android APK
  3. Run the Android app

Result

When running the built Android app, sometimes it will work, but roughly 1 in 10 starts of the app will hang indefinitely on the splash screen. This was a baffling issue, and it is resolved by removing sentry-unity from the project.

Looking at adb logcat for one of the hangs, it seems to print this:

08-20 17:17:07.104 27021 27036 I zygote  : Rejecting re-init on previously-failed class java.lang.Class<com.unity3d.player.AssetPackManagerWrapper$b>: java.lang.NoClassDefFoundError: Failed resolution of: Lcom/google/android/play/core/assetpacks/AssetPackStateUpdateListener;
08-20 17:17:07.104 27021 27036 I zygote  :   at boolean com.unity3d.player.UnityPlayer.nativeRender() ((null):-2)
08-20 17:17:07.104 27021 27036 I zygote  :   at boolean com.unity3d.player.UnityPlayer.access$300(com.unity3d.player.UnityPlayer) ((null):-1)
08-20 17:17:07.104 27021 27036 I zygote  :   at boolean com.unity3d.player.UnityPlayer$e$1.handleMessage(android.os.Message) ((null):-1)
08-20 17:17:07.104 27021 27036 I zygote  :   at void android.os.Handler.dispatchMessage(android.os.Message) (Handler.java:102)
08-20 17:17:07.104 27021 27036 I zygote  :   at void android.os.Looper.loop() (Looper.java:164)
08-20 17:17:07.104 27021 27036 I zygote  :   at void com.unity3d.player.UnityPlayer$e.run() ((null):-1)
08-20 17:17:07.104 27021 27036 I zygote  : Caused by: java.lang.ClassNotFoundException: Didn't find class "com.google.android.play.core.assetpacks.AssetPackStateUpdateListener" on path: DexPathList[[zip file "/data/app/me.overflo.mygame-Jtekv7VX5bBOly1L6sqzfA==/base.apk"],nativeLibraryDirectories=[/data/app/me.overflo.mygame-Jtekv7VX5bBOly1L6sqzfA==/lib/arm, /data/app/me.overflo.mygame-Jtekv7VX5bBOly1L6sqzfA==/base.apk!/lib/armeabi-v7a, /system/lib, /system/vendor/lib]]
08-20 17:17:07.104 27021 27036 I zygote  :   at java.lang.Class dalvik.system.BaseDexClassLoader.findClass(java.lang.String) (BaseDexClassLoader.java:125)
08-20 17:17:07.104 27021 27036 I zygote  :   at java.lang.Class java.lang.ClassLoader.loadClass(java.lang.String, boolean) (ClassLoader.java:379)
08-20 17:17:07.104 27021 27036 I zygote  :   at java.lang.Class java.lang.ClassLoader.loadClass(java.lang.String) (ClassLoader.java:312)
08-20 17:17:07.104 27021 27036 I zygote  :   at boolean com.unity3d.player.UnityPlayer.nativeRender() ((null):-2)
08-20 17:17:07.104 27021 27036 I zygote  :   at boolean com.unity3d.player.UnityPlayer.access$300(com.unity3d.player.UnityPlayer) ((null):-1)
08-20 17:17:07.104 27021 27036 I zygote  :   at boolean com.unity3d.player.UnityPlayer$e$1.handleMessage(android.os.Message) ((null):-1)
08-20 17:17:07.104 27021 27036 I zygote  :   at void android.os.Handler.dispatchMessage(android.os.Message) (Handler.java:102)
08-20 17:17:07.104 27021 27036 I zygote  :   at void android.os.Looper.loop() (Looper.java:164)
08-20 17:17:07.104 27021 27036 I zygote  :   at void com.unity3d.player.UnityPlayer$e.run() ((null):-1)

I wasn't able to make much sense of that, but it does seem to only be printed on the launches that hang. No logs are printed from my C# code on these launches, implying that it hasn't started to properly load the game yet.

The issue is not reproducible on iOS or PC. There isn't a consistent gap between launches that hang - sometimes they're only 1 or 2 apart, sometimes you can go for 10 or more without seeing anything go wrong. I had to sit and open and close the app over and over haha

Potentially useful info

Unity version - 2021.1.17f1 Scripting Backend - IL2CPP API Compatibility Level - .NET Standard 2.0 C++ Compiler Configuration - Master

Please do let me know if there's any other info that would be helpful for debugging this issue - I'd love to be able to use Sentry in my project. I might come back and attach a reproducible example project if I have the time this weekend.

bruno-garcia commented 3 years ago

Thanks for the details. The logcat logs don't seem to have anything related to Sentry, just google play stuff.

How long is the hang for? Does it get stuck on it forever? Or does it eventually go through the splash screen?

Did you have a chance to enable Sentry's SDK debug flag? You can do it through the Sentry window:

image

Make sure to unselect: Only In Editor so that the logs also happen in the Android player. And the level: Debug so we get as much info as possible.

bitsandfoxes commented 3 years ago

Hey @adamsoutar! We just released 0.5.0 today that addresses several issues, one of which was code-stripping related and could cause the SDK to fail to initialize at startup. (which would also lead to a lack of logs in logcat) It would be very appreciated if you could give this new version a try and let us know if it now works in your project.

adamsoutar commented 3 years ago

Hey, sorry I've not come back to this issue, I've been away for a few days. Thanks so much for trying to help! I've recompiled my project today with v0.5.0, switching between Release/Master compilation modes in case there were any more code stripping issues, but the issue still occurs.

Looking deeper into the hangs, they do eventually seem to go through. I've recorded the logcat of one of them here with Sentry debug output.

You can see the hang in the timestamps - I opened the app at 17:15:58, but the first log from my game code is at 17:17:29 - a minute and a half of splash screen.

Currently I have the game open on my Android test phone, and it's been sitting on the splash for >15 minutes. I assume it would eventually go through though.

This is why I thought it was an absolute freeze, I just never waited more than a minute on the splash. It seems every splash is delayed a bit by Sentry (they're all noticeably longer than comparable starts on iOS), but some are crazy long.

I can also reproduce the issue in Airplane mode (and even airplane mode all the way from boot), which to me suggests the variable hang lengths are not due to waiting on network activity. This issue has me a bit stumped 🤔

bruno-garcia commented 3 years ago

Based on your description it sounds like the issue is due to the InitCacheFlushTimeout option but that would only make sense if it was set to 2 minutes instead of 2 seconds as we defined the default:

https://github.com/getsentry/sentry-unity/blob/bc8a90833899e4df68a4d30717d3890701382be3/src/Sentry.Unity/ScriptableSentryUnityOptions.cs#L93

This is configured via the Sentry Window too: image

The motivation from this feature, as taken from the docs:

When caching is enabled (that is, CacheDirectoryPath is set), this option controls the timeout that limits how long the SDK will attempt to flush existing cache during initialization. Note that flushing the cache involves sending the payload to Sentry in a blocking operation. Setting this option to zero means that Sentry will not attempt to flush the cache during initialization, but instead will do so when the next payload is queued up.

If that value was 120000 instead of 2000 and the device was offline, and the HttpClient timeout was not smaller than that, it's possible this could happen.

This feature made more sense when the only envelope item type was of error events. Now that the SDK sends sessions, we should reconsider this. We should write a 'crash' trigger file on disk and only invoke the Flush On Init Blocking call in that case, to avoid any sort of slow down in case the app didn't crash the execution before.

Even though this doesn't explain the 2 minute splash screen delay, until we reconsider what triggers the blocking flush on init, we should disable this feature altogether in the Unity SDK.

To confirm this could be related to this option, could you set it to 0 and see if the issue happens again?

adamsoutar commented 3 years ago

Hey Bruno, I think you might be on to something!

I had not messed with the Transport section, here's how it looked while I was writing the earlier comments on this issue:

I tried setting the timeout to 0:

And that didn't fix it, it was still acting like before.

However, after disabling offline caching:

The game does not hang!

When I mentioned before that I'd had the splash screen open for >15 minutes, that actually lasted an hour and a half before I gave up and closed it, so definitely longer than the timeout.

It might be worth noting that I had the phone connected to stable WiFi for the whole time - it wasn't in Airplane mode and I would expect any uploads it was doing to succeed.

Seems like something in the offline caching feature is getting stuck for variable amounts of time, while not respecting the timeout?

bruno-garcia commented 3 years ago

Thanks for the additional context. It's good to know it's not related to the init cache timeout feature or to the fact the device was offline.

So seems like something going on with offline caching only, but I can't imagine what would block for such a huge amount of time (as opposed to deadlock, simply never completing loading).

It would be great if we a repro so we can debug this here. I haven't experience such case yet on my tests. Would you be able to share a project we can run on our end that (with offline caching enabled) reproduces the issue?

Offline caching is not only useful for mobile to capture errors when the devices is not connected to the Internet but also to help track sessions (release health) we can tell if the session exited gracefully or in an abnormal way (and with native support: Crashed free sessions)

adamsoutar commented 3 years ago

Absolutely. I'll work on trying to produce a reproducible example project. No guarantee I'll be able to, as I have no idea what's causing the crash, but we'll try 😃

adamsoutar commented 3 years ago

Hey, I was able to reproduce the issue very easily - just a blank new project with Sentry added does the trick. I was able to make repros for

Unity 2020.3.12f1: https://drive.google.com/file/d/1BhU0RFRtCsRN0wLl-6-xDep5Xq22g4C4/view?usp=sharing And Unity 2021.1.17f1 https://drive.google.com/file/d/1ky3v17HH_t4l4gEA4kyTAoYnZ_PBRXyd/view?usp=sharing

These are just the Unity versions I had lying around installed on my machine. I use 2021.1.17 for my game (the main project I filed this issue for).

To make the repros smaller to download, I removed the Library, Temp and Logs folders from the projects, but let me know if you need them - I can upload them. I also removed my Sentry DSN, leaving that field blank.

Just in case you're unable to reproduce, I've uploaded a video of what the issue looks like, and the amount of attempts it sometimes takes (happens at 2 minutes 20 seconds).

Hope these help! And thanks for your help so far

bruno-garcia commented 3 years ago

Thank you so much for the video and the repro. We'll definitely take a look at this really soon and get back to you.

Tanek81 commented 3 years ago

Hi, I found this thread and I am adding my 2 cents as we are having the same issue but we do not have the Sentry Unity plugin. I think this is Unity-related, it looks like the build is not including some Android classes on some specific devices, I am trying to understand if it depends on the Android version, currently it's crashing on two Android 7 devices, @adamsoutar what device/Android version are you using? I am investigating to see if there is already an open bug (here: https://forum.unity.com/threads/upload-aab-then-downloading-from-android-store-crashes-at-launch-but-apk-works-fine.1157939/ ) for their team, @bruno-garcia have you already tried contacting Unity team for this?

Thank you, I hope this helps :)

bruno-garcia commented 3 years ago

Thank you again for the repro, clearly it took some effort to do do it. Now through @Tanek81's comment it seem this isn't related to Sentry's SDK (and we couldn't really repro on our end) so I removed the bug label. Please keep the updates coming in this issue.

@bruno-garcia have you already tried contacting Unity team for this?

We did not. But thanks for giving us an update here about it

bruno-garcia commented 3 years ago

We managed to reproduce this and with further investigation from @bitsandfoxes it seems to only happen if Auto Session Tracking is turned on: image

We only managed to repro after many attempts so it's not very clear yet if this is an SDK bug related to sessions or something with Unity as @Tanek81 experienced. But we'll keep investigating and for the time being we suggest disabling Auto Session Tracking.

bitsandfoxes commented 3 years ago

So I've investigated this further but the lack of actual reproducibility makes it really really hard. If the debugger is to be trusted SDK sometimes (tm) gets stuck when creating a session when it runs DateTimeOffset.Now. I'm going out on a limb here and point the finger at IL2CPP magic because what else should or even could interfere with that?

bruno-garcia commented 2 years ago

@adamsoutar please let us know if this is still an issue for you. We haven't been able to repro this anymore

bruno-garcia commented 2 years ago

We had some Android Smoke Tests hang and we suspect it might be related to this issue so reopening:

image

bruno-garcia commented 2 years ago

Doesn't seem to be the case that the failing smoke test is due to splashscreen hang up. No more reports of this and no one in the team has experienced this in months.