customerio / customerio-reactnative

MIT License
25 stars 13 forks source link

Android ApplicationNotResponding - addTask in io.customer.sdk.queue.QueueImpl #266

Closed florin-bc closed 5 months ago

florin-bc commented 6 months ago

SDK version: 3.7.0

Environment: Production

Are logs available? No customer.io logs available.

Describe the bug We've been using Sentry to track bugs in production and since installing your SDK we've been observing some ANR (Application Not Responding) crashes/hangs, both in background (background ANR) and in foreground, for our Android users. It seems to get stuck at addTask from io.customer.sdk.queue.QueueImpl. It seems to happen on:

To Reproduce Can't seem to be able to reproduce it.

Expected behavior Not to receive ANR crashes/hangs anymore.

Screenshots

Additional context

Is there any way we can treat this case and either force flush the information somehow, retry later or even drop it if we get to this state?

mrehan27 commented 6 months ago

Hey @florin-bc. Thanks for reaching out and providing great details. Sorry to hear about the unpleasant experience and the ANR issues caused by the SDK in your app.

I understand that reproducing production issues and gathering more details can be challenging, but I'd still like to ask a few questions that might help us identify the root cause:

For your question

Is there any way we can treat this case and either force flush the information somehow, retry later or even drop it if we get to this state?

Unfortunately, there is currently no way to flush and skip this state if it occurs. The issue seems to happen on fetching device token, and we currently do this every time the SDK is initialized.

I understand you may not have all the answers or only have limited information, but it's difficult for us to implement a fix without pinpointing the root cause. Appreciate your patience as we work through this together. Please share any additional information you think might help us. Have a great day.

florin-bc commented 6 months ago

Hey @florin-bc. Thanks for reaching out and providing great details. Sorry to hear about the unpleasant experience and the ANR issues caused by the SDK in your app.

I understand that reproducing production issues and gathering more details can be challenging, but I'd still like to ask a few questions that might help us identify the root cause:

  • Can you confirm if your app is initializing the SDK only once throughout its lifecycle?
  • Can you share the approximate frequency of these issues? e.g. is it happening for most of your users or just a few?
  • Is there any user behavior, screen tracking, or other methods that might help us determine a pattern for these issues?
  • Is there any trace or data that might indicate the number of unsent events around the time the ANR occurred?
  • Are there any other similarities, such as issues occurring for specific device manufacturers, network conditions, or other settings?
  • Lastly, can you think of anything your app might be doing differently that could lead us to the root cause?

For your question

Is there any way we can treat this case and either force flush the information somehow, retry later or even drop it if we get to this state?

Unfortunately, there is currently no way to flush and skip this state if it occurs. The issue seems to happen on fetching device token, and we currently do this every time the SDK is initialized.

I understand you may not have all the answers or only have limited information, but it's difficult for us to implement a fix without pinpointing the root cause. Appreciate your patience as we work through this together. Please share any additional information you think might help us. Have a great day.

Also good to mention is that Sentry is pointing to Suspect Root Cause of "The main thread is blocked/waiting, trying to acquire lock 0x09d97279 (io.customer.sdk.queue.QueueImpl) held by the suspect frame of this thread."

image image

mrehan27 commented 6 months ago

Thanks a lot @florin-bc. This information is really helpful. Unfortunately, it's hard for me to confirm if any fix we apply will definitively resolve the problem without first being able to reproduce it. To help me in this, can you please tell us where exactly in your app's lifecycle you are updating profile attributes? This information will help me add similar code and enforce a prolonged lock to help identify the root cause. Additionally, if there is anything specific you are doing with profile attributes that might help us identify the problem, such as updating with objects or in a loop, or anything else that might be specific to those few affected users, that would be very useful.

Regarding the other points you raised, they are indeed very helpful. It seems like the lock is blocking, but it shouldn't wait too long and should exit soon, allowing other events to proceed. As for the push notifications being returned multiple times, it seems very unlikely because we fetch the token from FCM only once using its task, so it's hard to see how it could run multiple times. I suspect something else might be blocking the queue, possibly due to a failed operation or a prolonged task, but I need more clues to confirm this.

We haven’t encountered many similar cases before, so it's a bit challenging for me to pinpoint the exact root cause at this moment. However, I can assure you that we have taken note of this issue and will work on the fix as soon as we identify the root cause. I appreciate your continued support while we debug this together. Please keep your ideas and suggestions coming. Have a great day.

florin-bc commented 6 months ago

Hi again. Sorry for the late reply, but this issue has recently lost priority internally. After more investigation on our side and your response, we have concluded that it is just a small hang sometimes :slightly_smiling_face:, usually when our app starts up. We've managed to minimize these issue by simply spacing out the events sent and basically work-around the ANR.

There are a couple of things I left out / was not fully clear about:

  1. We applied a patch on the expo customerio plugin library to make sure "config.autoFetchDeviceToken = false" (everywhere where ios/PushService.swift template exists, so it appears after expo prebuild command runs)

    • reasoning: since the native versions of the library do not support notifications via FCM, and only support APNS, leaving autoFetchDeviceToken on true would mess with Firebase swizzling and would override a lot of functionality, rendering flows like push notifications non-functional, as SDKs would not forward the APNS to firebase and the rest of our libraries :slightly_smiling_face: this took a bit to debug and was a bit of a hassle
    • what we did after was take the firebase token from RN and send it to CIO, basically creating multiple devices, with FCM -> that is why we are saying that the token may be sent multiple times since RN, Firebase and our flows inside the app of "check permissions allowed, if so, update token on cio and in database" potentially running multiple times at start, or onForeground etc.
    • the result makes sure we have the tokens CIO + only having FCM notifications set up in configuration => only 1 notification per device :wink:
  2. We sometimes send around 50-70 events one after another. It is possible that they do not get to flush immediately - i think this should not impact anything overall but thought to mention it

  3. When our app starts up, there are "a few" endpoints called, that each can update the user traits or send events when done ... these updates may be all triggering around the same time, at a ms level, hence my input on them blocking each other + sentry logs

    • because of how everything is organized we have to send traits individually and cannot always bulk commit
  4. Similar to point 3, we also have some polling events (every 30s or earlier if buttons are pressed) that could trigger many user traits or send events simultaneously.

  5. There is something that I also recently fixed related to event properties not being Encodable and failing to send the events completely (according to the native logs). The value was of course null :slightly_smiling_face:. I'm thinking maybe if this happens or any other error appears when the queue chooses what type of event to prioritize and triggers a longer delay than needed

I'm afraid that's all the info I can provide on the subject, but I hope it is helpful.

P.S. Userful tags for others :slightly_smiling_face: push notifications do not work customer.io, APNs, firebase, not working, push token

mrehan27 commented 5 months ago

Hey @florin-bc. Thanks for great details and confirmation about the behavior. I'm glad it isn't a major issue, but we'll still take a closer look to see if we can eliminate the small hang as well.

For your other queries, here are the responses:

  1. It looks like you are using our Expo plugin, but since it doesn't support FCM on iOS, you had to make additional tweaks to make it work. If this is correct, can you consider utilizing our React Native package instead? If not, can you share more about your experience while integrating so we can note this down internally for future improvements? Also, it would be helpful to know if you are using the bare or managed workflow on Expo. If you still prefer using Expo and registering the token yourself, you can try skipping the registration for devices running on Android using platform check. Android devices should automatically be registered for you, which might help.

  2. This is helpful. We currently do not batch requests, so they may take some time. This is noted, and we'll consider improvements for this in the future.

  3. This is helpful too. I understand it may not be possible for you to combine traits, but having something like a store that caches and combines all those traits and sends them after a short wait might help reduce the lag. I understand this requires some effort on your end as well. We'll take a closer look to see if we can avoid locking the thread in such cases.

  4. If it is possible and you can combine and wait to send user traits, that might reduce some ANRs.

  5. Ideally, this shouldn't happen. Even if the event is not encodable, it should just throw a proper error or be skipped, but it shouldn't block or crash the app. If you notice anything related in the logs, please share it with us so we can investigate further.

Thank you for all the great details. This is really helpful and has been noted. While the improvements might not be reflected very soon, I can assure you we are keeping a close eye on this and will do our best to minimize such cases and improve the user experience.

Since there are no pending issues at the moment and all your feedback is noted internally, I'll go ahead and close this issue. Even if the issue is closed, please feel free to ask any questions or share any feedback you have. Thanks again. Have a great day!

florin-bc commented 5 months ago

Got it, much appreciated!

For 1.: We use it already, the problem is if the swizzling is broken :slightly_smiling_face: well, nothing works. Our patch just allows RN to receive the APNS token, the rest will still work as the SDK or the expo plugin already have listeners in place if swizzling is not active on the expo-plugin side :slightly_smiling_face:. I feel like the code that we disabled via this patch could be removed.

"customerio-expo-plugin": "1.0.0-beta.15",
"customerio-reactnative": "3.7.0",

customerio-expo-plugin+1.0.0-beta.15.patch

diff --git a/node_modules/customerio-expo-plugin/lib/commonjs/helpers/native-files/ios/PushService.swift b/node_modules/customerio-expo-plugin/lib/commonjs/helpers/native-files/ios/PushService.swift
index 0a0db95..047ddf7 100644
--- a/node_modules/customerio-expo-plugin/lib/commonjs/helpers/native-files/ios/PushService.swift
+++ b/node_modules/customerio-expo-plugin/lib/commonjs/helpers/native-files/ios/PushService.swift
@@ -22,6 +22,7 @@ public class CIOAppPushNotificationsHandler : NSObject {
     }
     MessagingPushAPN.initialize { config in
       config.showPushAppInForeground = {{SHOW_PUSH_APP_IN_FOREGROUND}}
+      config.autoFetchDeviceToken = false
     }
   }

diff --git a/node_modules/customerio-expo-plugin/lib/module/helpers/native-files/ios/PushService.swift b/node_modules/customerio-expo-plugin/lib/module/helpers/native-files/ios/PushService.swift
index 0a0db95..047ddf7 100644
--- a/node_modules/customerio-expo-plugin/lib/module/helpers/native-files/ios/PushService.swift
+++ b/node_modules/customerio-expo-plugin/lib/module/helpers/native-files/ios/PushService.swift
@@ -22,6 +22,7 @@ public class CIOAppPushNotificationsHandler : NSObject {
     }
     MessagingPushAPN.initialize { config in
       config.showPushAppInForeground = {{SHOW_PUSH_APP_IN_FOREGROUND}}
+      config.autoFetchDeviceToken = false
     }
   }

diff --git a/node_modules/customerio-expo-plugin/src/helpers/native-files/ios/PushService.swift b/node_modules/customerio-expo-plugin/src/helpers/native-files/ios/PushService.swift
index 0a0db95..047ddf7 100644
--- a/node_modules/customerio-expo-plugin/src/helpers/native-files/ios/PushService.swift
+++ b/node_modules/customerio-expo-plugin/src/helpers/native-files/ios/PushService.swift
@@ -22,6 +22,7 @@ public class CIOAppPushNotificationsHandler : NSObject {
     }
     MessagingPushAPN.initialize { config in
       config.showPushAppInForeground = {{SHOW_PUSH_APP_IN_FOREGROUND}}
+      config.autoFetchDeviceToken = false
     }
   }
mrehan27 commented 5 months ago

Thanks a lot for the details @florin-bc. I'm checking this with team so we can decide whether to include this patch. If not, I'll get back with more information. Appreciate your input and continuous support and details on the issue. Have a great day.