coronalabs / corona

Solar2D Game Engine main repository (ex Corona SDK)
https://solar2d.com/
MIT License
2.49k stars 267 forks source link

Android: Runtime bug - ANR in com.ansca.corona.Controller - new insight from Google Play #643

Open appfamily opened 10 months ago

appfamily commented 10 months ago

Describe the bug Play Store has been reporting ANRs deep inside Solar2d for years. It has been discussed in forums, such as here but without conclusive results.

There is a new built-in analysis of ANRs in the Play Store and perhaps it can shed light on the root cause? See below

To Reproduce Steps to reproduce the behavior:

Target platform and where build was made:

Additional context image

whole test project or build.settings Available on demand

vegasfun commented 10 months ago

Similar issue: ANR-img-1 ANR-img-2

pouwelsjochem commented 10 months ago

Not sure if it's 100% related, but definitely could be: https://discuss.cocos2d-x.org/t/is-there-any-solution-of-anr/43080.

pouwelsjochem commented 10 months ago

Another interesting topic: https://issuetracker.google.com/issues/263307511 it's from a native Android app, but a bit below is an example project can be found & a video showing what happens when the ANR pops up

clang-clang-clang commented 9 months ago

I have the demo, steps, and screen recording to reproduce this ANR. @Shchvova

Use the Lua code as below or modified sample - Fishies to reproduction the Controller.stop() ANR with Solar2D 3699:

  --[[
      This is a demonstration for reproducing an ANR issue with
      Android Controller.stop()/start().
      Triggering the Android lifecycle while Lua is processing heavy 
      work (as shown below) can replicate the problem.
  ]]

  local sum = 0
  local co = coroutine.create(function ()
      for i = 1, 100000 do

          sum = sum + i

          local start_time = os.clock()
          while os.clock() - start_time < 0.1  do
          end

          coroutine.yield(sum)
      end
  end)

  timer.performWithDelay(10, function()
      for i = 1, 100 do
          if co and coroutine.status(co) ~= "dead" then
              coroutine.resume(co)
          end
      end
  end, 0)

The steps are as follows (Using the timeline of screen recording):

  1. 00:00:01 Start by testing the app as usual.
  2. 00:00:13 Enter the game scenes. Due to heavy computation, there is only one frame update at 22 and 32 seconds, which is not significant.
  3. 00:00:33 Exit or kill the app.
  4. 00:00:35 Manually restart the app from the desktop. Android Studio's first re-run won't reproduce this ANR.
  5. 00:00:37 Lock the screen by pressing the power button before fully entering the game scenes. Wait for at least 5 seconds.
  6. 00:00:44 Unlock the screen.
  7. 00:00:45 Seeing the ANR pop-up, use adb bugreport to retrieve it from the phone. This reproduced Controller.stop() ANR.
  8. 00:00:50 Exit or kill the app.
  9. 00:01:00 Manually restart the app from the desktop.
  10. 00:01:01 Exit to the desktop before fully entering the game scenes.
  11. 00:01:04 Re-enter the app.
  12. 00:01:06 Lock the screen by pressing the power button before fully entering the game scenes. Wait for at least 5 seconds.
  13. 00:01:16 See whether the ANR pop-up appears or not, as the app may have been killed by the system. This reproduced Controller.start() ANR (sometimes together with Controller.stop() ANR).

The screen recording:

https://github.com/coronalabs/corona/assets/53489032/23696249-eb74-4277-b242-061214caf791

clang-clang-clang commented 9 months ago

I tested 3699 with a simple patch (either a draft or proof of concept) and the ANR disappeared. However, I'm not familiar with the entire locking process, so I'm not sure about potential effects.

platform/android/sdk/src/com/ansca/corona/graphics/opengl/GLSurfaceView.java

        public void surfaceDestroyed() {
            synchronized(sGLThreadManager) {
                if (LOG_THREADS) {
                    Log.i("GLThread", "surfaceDestroyed tid=" + getId());
                }
                mHasSurface = false;
                sGLThreadManager.notifyAll();
-                while((!mWaitingForSurface) && (!mExited)) {
-                    try {
-                        sGLThreadManager.wait();
-                    } catch (InterruptedException e) {
-                        Thread.currentThread().interrupt();
-                    }
-                }
+//                while((!mWaitingForSurface) && (!mExited)) {
+//                    try {
+//                        sGLThreadManager.wait();
+//                    } catch (InterruptedException e) {
+//                        Thread.currentThread().interrupt();
+//                    }
+//                }
            }
        }
platform/android/sdk/src/com/ansca/corona/CoronaActivity.java

    private void requestResumeCoronaRuntime() {
...
        // Start/resume the Corona runtime.
-       fController.start();
+       new Thread(fController::start).start();

...
    private void requestSuspendCoronaRuntime() {
        // Suspend the Corona runtime.
        if (fController != null) {
-           fController.stop();
+           new Thread(fController::stop).start();
        }

I hope the information helpful. Thank you.

pouwelsjochem commented 9 months ago

Interesting @clang-clang-clang, is there a reason why you've use coroutines for the example? Wouldn't the ANR trigger otherwise? That would explain why not everyone is seeing the high ANR rate.

I'm not quite sure this is the right solution though. With my limited knowledge of locking in Java, simply removing the wait in surfaceDestroyed() could result in possible crashes in other methods which think everything is correctly destroyed, while the "destroy" is not actually finished yet.

I did spend some of last week on the ANR issue as well, and have actually come up with a different solution by looking at the libgdx library which also uses Android's GLSurfaceView.java. They had similar issues on the onPause state and avoid it by using two different methods.

  1. Killing the app when onPause takes longer than 4000ms (ANR triggers at 5000ms)
  2. Using an optimised updateRuntimeState() where not everything is locked/synchronized all of the time, just the moments when controller is actually used.

Start of last weekend I've integrated these solutions in my own fork of Solar2D (https://github.com/pouwelsjochem/solar2d/blob/40f0134d8ee7447373d0900257909f36fd863ba3/platform/android/sdk/src/com/ansca/corona/Controller.java#L222) and slowly rolled that out to my users. Our ANR rate of this last build decreased by roughly 90%, from ~1% to ~0.1% (over 14K sessions in the last 24 hours). I believe Vlad is going to incorporate my changes into the main repo soon.

clang-clang-clang commented 9 months ago

Great news. Glad to hear there's a solution.

Since I can't reproduce the online situation, I can only approximate the ANR by simulating the calculated pressure. Just keeping the GLThread busy and triggering the lifecycle can cause an ANR.

As for the 'patch' (it's a draft or proof of concept), it can't been production yet. I suspect that the sGLThreadManager also has a locking issue.