airsdk / Adobe-Runtime-Support

Report, track and discuss issues in Adobe AIR. Monitored by Adobe - and HARMAN - and maintained by the AIR community.
197 stars 11 forks source link

Input dispatching timed out ANRs & ANR rate survey #29

Open neuronix opened 6 years ago

neuronix commented 6 years ago

Hi guys,

I am mirorring my post from the Starling forum here with Daniel's permission.

We have ANRs on Android in all our apps and would like to know if it's maybe our apps particularly, or if it's AIR related.

Could you please check your Google Dev Console and go to "Android Vitals" > "Overview".

The first section is "Bad Behaviors".

Please share the following values (no need to mention the app, just pick the biggest app you have or the average of all your apps please):

The main ANR we are experiencing is "Input dispatching timed out" triggered by "com.adobe.air.customHandler.callTimeoutFunction". There are heaps of topics on it on the web, and they only proper response by Adobe is that we should avoid doing heavy calculations on user input.

But all our apps use Starling, and Starling defers all user inputs to next frame so this should not be an issue.

skolesnyk commented 6 years ago

App with Air 27, Starling 1.8, Feathers 2.x ANR: 0.20% Crash rate: 0.23%

Seb-AS commented 6 years ago

Hi,

Our App(Starling 1.8-AIR 26/21):

ANR Rate : 4% Crash Rate: 5%

39 reports last 7 days:

Broadcast of Intent { act=android.intent.action.SCREEN_OFF flg=0x50000010 (has extras) }

"main" prio=5 tid=1 Native | group="main" sCount=1 dsCount=0 obj=0x74de5000 self=0xb7d81f78 | sysTid=21001 nice=-6 cgrp=bg_non_interactive sched=0/0 handle=0xb6f01bec | state=S schedstat=( 883042412234 69541281861 500913 ) utm=83394 stm=4910 core=1 HZ=100 | stack=0xbe568000-0xbe56a000 stackSize=8MB | held mutexes=

00 pc 000000000000f9b0 /system/lib/libc.so (syscall+28)

01 pc 00000000000138ab /system/lib/libc.so (pthread_join+90)

02 pc 00000000001fa2a5 /data/app/air.com.app1-1/lib/arm/libCore.so (???)

03 pc 00000000001fa34b /data/app/air.com.app1-1/lib/arm/libCore.so (???)

04 pc 00000000001c98d5 /data/app/air.com.app1-1/lib/arm/libCore.so (???)

05 pc 00000000001c3901 /data/app/air.com.app1-1/lib/arm/libCore.so (???)

06 pc 00000000003477df /data/app/air.com.app1-1/lib/arm/libCore.so (???)

07 pc 00000000003cee7b /data/app/air.com.app1-1/lib/arm/libCore.so (???)

08 pc 00000000003d10ed /data/app/air.com.app1-1/lib/arm/libCore.so (???)

09 pc 000000000012ebeb /data/app/air.com.app1-1/lib/arm/libCore.so (???)

10 pc 0000000000130b27 /data/app/air.com.app1-1/lib/arm/libCore.so (???)

11 pc 00000000001349b1 /data/app/air.com.app1-1/lib/arm/libCore.so (???)

12 pc 00000000001215e3 /data/app/air.com.app1-1/lib/arm/libCore.so (Java_com_adobe_air_customHandler_callTimeoutFunction+10)

13 pc 00000000001bbc1f /data/dalvik-cache/arm/data@app@air.com.app1-1@base.apk@classes.dex (Java_com_adobe_air_customHandler_callTimeoutFunction__II+90)

at com.adobe.air.customHandler.callTimeoutFunction (Native method) at com.adobe.air.customHandler.handleMessage (customHandler.java:22) at android.os.Handler.dispatchMessage (Handler.java:102) at android.os.Looper.loop (Looper.java:135) at android.app.ActivityThread.main (ActivityThread.java:5912) at java.lang.reflect.Method.invoke! (Native method) at java.lang.reflect.Method.invoke (Method.java:372) at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run (ZygoteInit.java:1405) at com.android.internal.os.ZygoteInit.main (ZygoteInit.java:1200)

10 reports last 7 days:

Input dispatching timed out (Waiting because the touched window has not finished processing the input events that were previously delivered to it.)

"main" prio=5 tid=1 NATIVE | group="main" sCount=1 dsCount=0 obj=0x415a4ca8 self=0x414de570 | sysTid=22564 nice=-6 sched=0/0 cgrp=apps handle=1074331988 | state=S schedstat=( 0 0 0 ) utm=4477 stm=515 core=2

00 pc 0000000000021910 /system/lib/libc.so (__futex_syscall3+8)

01 pc 000000000000ef94 /system/lib/libc.so (__pthread_cond_timedwait_relative+48)

02 pc 000000000000eff4 /system/lib/libc.so (__pthread_cond_timedwait+64)

03 pc 0000000000013007 /system/lib/libc.so (pthread_join+74)

04 pc 00000000001fa2a5 /data/app-lib/air.com.app1-1/libCore.so

at com.adobe.air.customHandler.callTimeoutFunction (Native Method) at com.adobe.air.customHandler.handleMessage (customHandler.java:22) at android.os.Handler.dispatchMessage (Handler.java:102) at android.os.Looper.loop (Looper.java:136) at android.app.ActivityThread.main (ActivityThread.java:5017) at java.lang.reflect.Method.invokeNative (Native Method) at java.lang.reflect.Method.invoke (Method.java:515) at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run (ZygoteInit.java:779) at com.android.internal.os.ZygoteInit.main (ZygoteInit.java:595) at dalvik.system.NativeStart.main (Native Method)

neuronix commented 6 years ago

Thanks to both of you! Since both apps have such different rates, could you give a little extra info?

Seb-AS commented 6 years ago

Do you have little or a lot of user inputs ? Yes, we have. Do you have any input intense situations like tapping or dragging on the screen? Yes, we have. Do you have ANEs included in the app (notably ads) ? Ads: AdCel, AdColony, SuperAwesome Some District Anes.(Contacts/mediaPlayer/flurry/more) Goviral Custom Anes.

neuronix commented 6 years ago

Thanks @Seb-AS, will be interesting to compare with @skolesnyk's app regarding the amount of user inputs.

skolesnyk commented 6 years ago

Scrolling lists, tapping on Sound Player play/pause buttons, text entry. There's FacebookAPI ANE integration, so some users switch to Facebook native app or browser and then back to the app. Using about 14 ANEs, mostly Distriqt, but also BranchIO, Fyber (with 2 ad networks): showing interstitials without user's action and showing Fyber Offerwall after user's interaction with UI.

MathiasFabian commented 6 years ago

App with Air 25, Starling 2.2 ANR: 7.5% Crash rate: 4.0%

Broadcast of Intent { act=android.intent.action.SCREEN_ON flg=0x50000010 } Broadcast of Intent { act=android.intent.action.SCREEN_OFF flg=0x50000010 } Broadcast of Intent { act=org.herozero.NOTIFICATION_DELAYED and a lot of Input dispatching timed out errors

App with Air 25, Starling 1.8 ANR: 2.6% Crash rate: 2.2%

Broadcast of Intent { act=android.intent.action.SCREEN_ON flg=0x50000010 } Broadcast of Intent { act=android.intent.action.SCREEN_OFF flg=0x50000010 } and a lot of Input dispatching timed out and keyDispatchingTimedOut errors

Do you have little or a lot of user inputs ? Yes, in both apps. Do you have any input intense situations like tapping or dragging on the screen? First app has more taps, second a lot of swiping. Do you have ANEs included in the app (notably ads) ? Both Apps use a lot of distriqt ANEs (Application, Billing, Facebook, ...) and for ads the Appodeal ANE.

neuronix commented 6 years ago

Thank you @MathiasFabian for sharing!

@PrimaryFeather this issue seems to be plaguing a lot if not just about all AIR apps. Could you get someone from the AIR team to look into this?

calibrae commented 6 years ago

App with Air 22 and Starling 1.9 ANR: 2.79% Crash 0.51%

Input dispatching timed out (Waiting to send non-key event because the touched window has not finished processing certain input events that were delivered to it over 500.0ms ago. Wait queue length: 5. Wait queue head age: 5502.9ms.)

Do you have little or a lot of user inputs ? No. Do you have any input intense situations like tapping or dragging on the screen? Yes, the app is an educational game for kids Do you have ANEs included in the app (notably ads) ? Distriqt Extension Files for Android app only.

chichilatte commented 6 years ago

EDIT: The Google Play Console was being extremely flakey – our most common ANR error was in fact Broadcast of Intent { act=android.intent.action.SCREEN_OFF ... They accounted for 77% of all ANRs, the rest being Input dispatching timed out. One thing to note is that the stack traces for both errors mention com.adobe.air.customHandler.callTimeoutFunction EDIT2: The Play Console is definitely flakey for us – using the same filters and time period as i used previously it's saying 90% of our ANR errors are Input dispatching timed out, 10% Broadcast of Intent { act=android.intent.action.SCREEN_OFF ....
Could it be these are the same error? We've spoken to Google and they say their tech team are having a look.


We can replicate the ANRs but it takes a while and they seem to happen randomly. We're completely flummoxed. The ANRs can happen on a completely static screen with only very occasional input. I suspected a memory leak but Scout says no.

marchbold commented 6 years ago

Concerned our ANEs are mentioned here, if this is related to us we will chase down the issue immediately, though I'm fairly confident it's not.

Is anyone having this issue without distriqt's ANEs?

neuronix commented 6 years ago

@marchbold we use Milkman ANEs, I don’t think it’s related to your ANEs in particular, my gut feeling is an issue in the Android/AIR input handling.

marchbold commented 6 years ago

Yeah, that's my feeling too, just was a little concerned when saw our extensions mentioned several times. We did actually do a thorough review of our ANEs to identify any potential causes when we encountered this issue in our apps and couldn't find anything.

Our largest app:

ANRs are mainly due to two causes:

Similar logs to the ones above.

We use Fyber Ads ANE

Crashes are mainly related to Fyber:

chichilatte commented 6 years ago

Our most common crashes over last 60 days:

dcampano commented 6 years ago

We are also seeing very similar ANR rates and similar causes:

Here are some stats on our app:

We also don't think that these ANRs are extension related. When we have had some other ANRs caused by extensions we were able to see a clear stacktrace to the extension code.

Do you have little or a lot of user inputs? Typically 1 or 2 taps every few seconds Do you have any input intense situations like tapping or dragging on the screen? No Do you have ANEs included in the app (notably ads) ? Answered above

darriuk commented 6 years ago

We're seeing ANRs and crashes above the bad behaviour thresholds in our app. • ANR rate: 0.73% • Crash rate: 1.47% • Air 23.0, Starling 1.8.1, Feathers 2.2

Do you have little or a lot of user inputs ? Not that much user input.

Do you have any input intense situations like tapping or dragging on the screen? Tapping, yes, but never really intense tapping. Also dragging the screen on some pages. ANRs have happened when neither tapping nor dragging was being done by the user, and no intense computation was being done by our app.

Do you have ANEs included in the app (notably ads) ? Yes, but no ads: • Milkman GoViral, RateBox • Distriqt – androidsupport.V4, Core, Firebase, Crash, GoogleIdentity, Notifications, playservices.Auth, playservices.Base, playservices.Identity, Resources • pl.mateuszmackowiak.nativeANE.properties.SystemProperties.ane

Typical ANR reports:

Input dispatching timed out (Waiting to send non-key event because the touched window has not finished processing certain input events that were delivered to it over 500.0ms ago

Input dispatching timed out (Waiting because the touched window has not finished processing the input events that were previously delivered to it.)

Typical crash traces:

java.lang.NullPointerException: 
  at com.adobe.air.AIRWindowSurfaceView.onKeyUp (AIRWindowSurfaceView.java:395)
  at android.view.KeyEvent.dispatch (KeyEvent.java:2907)
  at android.view.View.dispatchKeyEvent (View.java:7543)
java.lang.NullPointerException: 
  at com.adobe.air.AIRWindowSurfaceView.surfaceChanged (AIRWindowSurfaceView.java:735)
  at android.view.SurfaceView.updateWindow (SurfaceView.java:731)
  at android.view.SurfaceView.onWindowVisibilityChanged (SurfaceView.java:322)
Ankit-Adobe commented 6 years ago

Hi, Could you please a log a bug at http://tracker.adobe.com/ and attach a sample application where this issue is reproducible.

Thanks, Ankit | Adobe AIR Engineering

neuronix commented 6 years ago

Hi @Ankit-Adobe, Thanks for joining the conversation. There is already a bug in the tracker here: https://tracker.adobe.com/#/view/AIR-4198519

We have tried to reproduce the bug but without success so far. Could you give us more information on when/how this error is triggered? It always seems to originate from com.adobe.air.customHandler.callTimeoutFunction

cgascons commented 6 years ago

Hi guys, we are having the exact same errors on all our apps, we have also updated the forum topic (https://forum.starling-framework.org/topic/anrs-crash-rate-survey?replies=5#post-109884) and voted and commented on the tracker url provided.

ANR rate: 3.60% • Crash rate: 0.90% • Air 26.0, Starling 2.2 (2017-06-27), Feathers 3.1.2 (January 2017)

Thanks,

chichilatte commented 6 years ago

Just a note to say i made an edit to my original post above (the main ANR error was wrong)

PrimaryFeather commented 6 years ago

Just a thought (actually an idea of Robby Campano): could it be that the ANRs we are seeing are the result of users hitting the screen while the apps are restoring from a context loss? After all, this takes some time, and the textures are restored synchronously, i.e. all input will be blocked.

I could switch the texture restoration to being asynchronous, but it's been reported that this is not entirely reliable on some Android hardware. Could be worth a shot, though — this requires only a small change inside the AssetManager (I think). Then I could make this an "opt-in" flag.

Any comments?

cgascons commented 6 years ago

As a matter of fact I was just testing that, these are the results:

Turns out that, -when the user locks the device- the app has a harder time recovering from a context loss when the app is in foreground rather than in background. At least it took significantly more time on my device (Samsung Galaxy S6).

If I send the app to background, then lock the device, unlock it back, and bring it to foreground again the restoration is still visible, but it takes none to very short time.

When I try it the other way around (app was in foreground when the device got locked), while the app is restoring, I can still play around and click stuff when, according with what you just said, all the input should be blocked. I have to say though, that I can't really tell an exact restoration time for each example, since sometimes my device takes longer because, I don't know, it doesn't like my face :)

I'm not entirely sure if I'm missing something but I can definitely mess around with the app when recovering from the context loss. That said, I still haven't been able to reproduce the ANR itself on my phone while debugging.

PrimaryFeather commented 6 years ago

Thanks for the additional info, Christian!

I'm not entirely sure if I'm missing something but I can definitely mess around with the app when recovering from the context loss.

You're totally right: it depends on the size of the textures being restored. I imagine it takes a while loading a 4k*4k texture, for example; and during that time, input will be blocked. Loading several smaller ones should not be that much of an issue, because there's a gap for AS3 to process stuff between each upload process.

cgascons commented 6 years ago

I see, perhaps what I'm about to suggest isn't doable, but what if the whole Stage was left untouchable as soon as the restoration begins, and it gets back to touchable as soon as the TEXTURES_RESTORED event gets dispatched? Could that be an option? I understant that during that restoration time no ANR's could be generated since the touch events would not be triggered, right?

PrimaryFeather commented 6 years ago

Hm, no, that wouldn't help. Since it's a synchronous operation, all of AS3 stops accepting input — and that includes the mouse events of the classic Flash (which Starling otherwise queues). The only solution would be to avoid that synchronous (time consuming) operation.

Again, I don't know if that's really the issue — it's just a possibility.

neuronix commented 6 years ago

I do not think this is related to context loss. ANRs occur during gameplay periods according to our user reports. However it seems to occur after a period without user inputs:

Case 1: match 3 game, users report freezing often after waiting to think on a move (5s maybe) Case 2: timer based game, users report freezing always when about 4-5s left out of 15, so again probably a short pause between inputs.

PrimaryFeather commented 6 years ago

ANRs occur during gameplay periods according to our user reports.

Aha, that's useful information — thanks!

chichilatte commented 6 years ago

Update on Google Play Console showing unreliable figures for Broadcast of Intent { act=android.intent.action.SCREEN_OFF ... and Input dispatching timed out ANR errors: Google got back to us saying it's a bug in their system and the two errors are the same.

cgascons commented 6 years ago

@chichilatte Did Google say anything else in specific? I mean, did they really assume these ANR's are they fault? If so, did they give you an ETA on how to fix them? I don't quite get how this is Google's fault though, it would make sense if these ANR's only appeared on the last Android version, but at least on our case it's happening from Android 5.0 to 7.1 so how is this their fault? Any info would be really appreciated since all our games are starting to lose visibility because of this high ANR % index.

neuronix commented 6 years ago

@cgascons I think there is a misunderstanding here, the issue chichilatte is talking about is Google not merging two identical ANR reports, not that the ANRs are Google’s fault.

cgascons commented 6 years ago

Oh I see now, @neuronix you are totally right, sorry about it!

soccerob commented 6 years ago

Here's a thought, we tested creating 1 million objects at once, which causes our app to hang for about 5 seconds, and we're getting an ANR once the app starts responding again... as soon as GC kicks off. We pulled the logs from the android device and noticed the same line in the error log as we're seeing from the Google Play Dashboard. The following line is from our device log at the very bottom of the stack trace: "HeapTaskDaemon" daemon prio=5 tid=6 Blocked

Here's our code that allowed us to generate the ANR: var temp:Array = new Array(); for (var i:int = 0; i < 1000000; i++) { var s:Sprite = new Sprite(); temp.push( s ); }

neuronix commented 6 years ago

Awesome find @soccerob ! I hadn't thought of the GC, I only tried hanging the app with heavy calculations.

@Ankit-Adobe please check out this reproductible case!

soccerob commented 6 years ago

Although our app doesn't ever construct this many objects at once, the thought is that the ANR might still be caused during garbage collection. If that's the case, we might look into object pooling some of our more robust classes to be re-used and never be marked for garbage collection. We're running through Scout and noticed that a context loss also triggers a significant task for garbage collection. Still investigating... but would love some additional input on this.

chichilatte commented 6 years ago

Just checked our Google ANR reports for the last week: 83% have HeapTaskDaemon daemon prio=5 tid=6 Blocked as the last line in the stacktrace. The others have GC or GCDaemon in the stacktrace (usually the first line after main).

cgascons commented 6 years ago

We also have HeapTaskDaemon daemon prio=5 tid=6 Blocked at the end of some our ANR stacktraces, the others have Signal Catcher. Not sure what to think, we do know for sure though, that these ANRs started increasing right after publishing an update on our client which "theoretically" was meant to increase overall performance because there were lots of objects we were using in some classes that weren't being properly destroyed. Now that all of those are being properly flagged for collection, the GC seems to be having a hard time.

neuronix commented 6 years ago

Also have HeapTaskDaemon daemon prio=5 tid=6 Blocked in most ANR stack traces, hope @Ankit-Adobe can use this to implement a fix! :)

soccerob commented 6 years ago

@cgascons if that indeed caused your ANR rate to go up, this may very well be the culprit. ironic that by cleaning up your objects properly, you've now managed to cause a worse issue than when the objects were just leaking memory

cgascons commented 6 years ago

Indeed @soccerob, it's quite a shame. We are getting ready a new version with "better" GC management than before. What we are doing is attempting to "force-calling" the gc more often by making use of a very dirty hack we found around (even though we are not sure if it's still working). Basically some people claim that making this call:

try { new LocalConnection().connect( "foo" ); new LocalConnection().connect( "foo" ); } catch ( e:Error ) {}

forces the GC to kick in. We are about to create the release version with advanced telemetry enabled to test on Scout and make sure this still works.

If it does work, we will publish this new version hoping to decrease the ANR's until Adobe comes back with a solution for this.

neuronix commented 6 years ago

Does System.gc() not work? We also added gc calls in strategic end points to our app as a temporary workaround. In Scout I can see the gc spikes that match the calls in the flash version, haven't checked AIR yet though.

cgascons commented 6 years ago

Hi neuronix, as far as I know System.gc() only works on a Debugger

neuronix commented 6 years ago

Would have been to good.. Well there is this, but not guaranteed to actually trigger GC: https://help.adobe.com/en_US/FlashPlatform/reference/actionscript/3/flash/system/System.html#pauseForGCIfCollectionImminent()

cgascons commented 6 years ago

I'm pretty sure we used that call a long time ago. Quite ashamed to be honest I don't quite remember why we did stop using it.. Anyways, will give it a try as well and share it here if it also works fine and it's noticeable in Scout.

laFunkhh commented 6 years ago

Hi guys! I'm also affected by this ANRs.

I've been using System.pauseForGCIfCollectionImminent() since the beginning, I can't tell if it makes a difference. With Air + Flex SDK 26 and after making many performance improvements (reduced draw calls from ~150 to 20) I managed to drop the rate from 2.5% to 1.5%. In my latest build I tried using Air 28 with the ASC compiler and the ANRs jumped again to 2.5%.

Still clueless about what's the actual trigger for the ANRs.

cgascons commented 6 years ago

As of today I can also confirm that, after even adding some more performance tweaks and switching from System.gc() to System.pauseForGCIfCollectionImminent(), the ANR's are still present, always preceded by the GC error block.

chichilatte commented 6 years ago

I wonder, would it help if you did System.pauseForGCIfCollectionImminent() with a low figure, e.g. System.pauseForGCIfCollectionImminent(0.2)? It defaults is 0.75 otherwise. With a smaller number GC would happen more frequently and take less time. Course it all depends on how often you call the function.

neuronix commented 6 years ago

Indeed chicilatte, one should use a very small number. We'll be testing with System.pauseForGCIfCollectionImminent(0.1) and see how things go.

cgascons commented 6 years ago

Uhmm, I'm not quite sure if that's going to make a huge difference to be honest, but it's worth a try. It's just exasperating to be in this situation...

PrimaryFeather commented 6 years ago

I'm forwarding a question from Adobe:

Is it possible for [you] to share the details of the devices along with OS version where they are seeing ANR’s and Crashes?

Is this information available somewhere with the ANR stats?

neuronix commented 6 years ago

Here is an report from our Google Dev Console. BroadCast of intent is the same as the Input dispatch ANR.

anrs