flutter / flutter

Flutter makes it easy and fast to build beautiful apps for mobile and beyond
https://flutter.dev
BSD 3-Clause "New" or "Revised" License
165.15k stars 27.23k forks source link

Mac_android microbenchmarks is 3.24% flaky #83939

Closed keyonghan closed 3 years ago

keyonghan commented 3 years ago

The post-submit test builder mac_android_microbenchmarks had a flaky ratio 3.24% for the past 15 days, which is above our 2% threshold.

The test shows infra failure, however it is indeed test failures (https://github.com/flutter/flutter/issues/77193)

Please follow https://github.com/flutter/flutter/wiki/Reducing-Test-Flakiness#fixing-flaky-tests to fix the flakiness and enable the test back after validating the fix (internal dashboard to validate: go/flutter_test_flakiness).

(Creating this issue from infra team after the gardener rotation's one-time scan.)

jmagman commented 3 years ago

I started filing this and then I noticed every test failure seemed completely different. Which doesn't mean it shouldn't be investigated, but it's a weird one.

jmagman commented 3 years ago

Also I ignored this because they really were infra failures: https://ci.chromium.org/p/flutter/builders/prod/Mac_android%20microbenchmarks/1367 https://ci.chromium.org/p/flutter/builders/prod/Mac_android%20microbenchmarks/1379 https://ci.chromium.org/p/flutter/builders/prod/Mac_android%20microbenchmarks/1390

Upload metrics failure: https://ci.chromium.org/p/flutter/builders/prod/Mac_android%20microbenchmarks/1366

keyonghan commented 3 years ago

Also I ignored this because they really were infra failures: https://ci.chromium.org/p/flutter/builders/prod/Mac_android%20microbenchmarks/1367 https://ci.chromium.org/p/flutter/builders/prod/Mac_android%20microbenchmarks/1379 https://ci.chromium.org/p/flutter/builders/prod/Mac_android%20microbenchmarks/1390

These are due to timeout on Waiting for observatory port to be available

Upload metrics failure: https://ci.chromium.org/p/flutter/builders/prod/Mac_android%20microbenchmarks/1366

This was caused by cocoon new deployment propagating (multiple tests failed around that time).

jmagman commented 3 years ago

These are due to timeout on Waiting for observatory port to be available

I missed the blue test timeout, only saw the purple part.

zanderso commented 3 years ago

Do we know why these failures don't include any logs from the device?

zanderso commented 3 years ago

@bdero Could you take a crack at this one?

bdero commented 3 years ago

@zanderso Sure, I'll try to repro this on one of my devices.

bdero commented 3 years ago

+1 on https://github.com/flutter/flutter/issues/83939#issuecomment-854192435

So far (100 runs in) I haven't been able to reproduce this with a Pixel 3 XL. Could someone with access to the Moto G4s run the test a couple hundred times, observing ADB output with flutter logs when it hangs? /cc @godofredoc, @keyonghan

pushd dev/benchmarks/microbenchmarks; ../../../bin/flutter run -v --profile --no-publish-port -d [device_id] lib/stocks/layout_bench.dart; popd

Otherwise, I can order a couple of aftermarket Moto G4s -- they're pretty cheap but it'll take a few days to ship. Might be best for me to get a hold of one either way.

The failure cases themselves have device ID variation:

dnfield commented 3 years ago

Do we know why these failures don't include any logs from the device?

Logcat is starting, but we're only getting a beginning of main log and something about packagemanager for our process.

There are no other logs. It seems like the app is hanging and doing nothing and we just can't connect to the observatory.

dnfield commented 3 years ago

@godofredoc @keyonghan - do we have any health checks that run on these devices? e.g. something that tells us how much free space they have, when their last reboot was?

We should add something to take an ADB screenshot if/when this times out too.

dnfield commented 3 years ago

We also need to bump the execution timeout for these guys. We let the tests take up to 30 minutes before timing them out, but once that happens no subsequent steps are allowed to run.

dnfield commented 3 years ago

https://github.com/flutter/infra/pull/467 increases the expiration time for these shard.

https://flutter-review.googlesource.com/c/recipes/+/14400 will take a screenshot after the test, which should help us know e.g. whether frames even rendered, whether we're stuck on some other system UI for some reason, etc.

bdero commented 3 years ago

There are no other logs. It seems like the app is hanging and doing nothing and we just can't connect to the observatory.

Although the are small logcat snaps with -t 1 before activity launches, it doesn't look like the log stream being read to collect the observatory URI as output by flutter at runtime is ever printed/forwarded to the test stdout -- otherwise there'd be a line similar to the following for each time the flutter activity runs, right?:

[+6318 ms] I/flutter ( 7514): Observatory listening on http://127.0.0.1:37037/epPy3QczJHY=/

Although there's one such line mysteriously present at the top of each test run's stdout regardless of the number of activity runs. Not sure what's going on there.

Since waiting on this stream synchronously blocks anyways, maybe we should toss a print in there just in case there's useful engine output that clues to the cause of hanging next time it flakes.

dnfield commented 3 years ago

That linked code is to avoid printing out the enormous amount of logspam that comes from logcat. It's filtering based on our process ID.

We could try to make it more verbose if we really need to, but it's likely that it will just give us lots of system based garbage.

It may be better to instead have the recipe just get the full logcat and add it to the logs dir separately so we could analyze that, rather than clogging up what's meant to be application specific logging.

dnfield commented 3 years ago

And the initial observatory line you're seeing is from the test running harness on the host itself, e.g.:

Executing: /opt/s/w/ir/x/w/recipe_cleanup/tmpjowsyt/flutter sdk/bin/cache/dart-sdk/bin/dart --disable-dart-dev --enable-vm-service=0 --no-pause-isolates-on-exit bin/tasks/microbenchmarks.dart in /opt/s/w/ir/x/w/recipe_cleanup/tmpjowsyt/flutter sdk/dev/devicelab with environment {BOT: true, LANG: en_US.UTF-8}
 [microbenchmarks] [STDOUT] Observatory listening on http://127.0.0.1:52664/KB4dbBmnho4=/

So dart --disable-dart-dev --enable-vm-service=0 --no-pause-isolates-on-exit bin/tasks/microbenchmarks.dart is causing that line to print. It's not coming from logcat.

bdero commented 3 years ago

We could try to make it more verbose if we really need to, but it's likely that it will just give us lots of system based garbage.

This seems essential for getting a grip on flakes like this one, as none of the streaming log output is making it to the tests at activity launch right now, and so AFAICT we don't actually get output produced by the flutter pid recorded anywhere during this moment.

+1 to possibly having completely unfiltered logcat output as a separate test artifact as well, that seems like a great idea and would probably also have the benefit of making many device resource contention issues more obvious (OOMs, FD limits, no disk space, etc). 👍

godofredoc commented 3 years ago

@bdero @dnfield did you take a look at #77193? it has a few repros and more logs. The "waiting for observatory to be open" happens across all the android tests.

Regarding the timeouts would it be possible to add a timeout to the code blocking on getting output from observatory? microbenchmarks is the reason we added timeouts to devicelab tests to prevent a single test holding the test bed for a very long time.

I do have an android phone I can try to repro, can you please share the commands to collect additional logs?

dnfield commented 3 years ago

Thanks @godofredoc .

I can see in the logs there that we definitely filtered logs we shouldn't have.

It'd be good to always have the devicelab bots get all logs separately, and it'd also be good to try to get the tool to pull logs. Last time I was looking at this I tried ot make sure timelines always got uploaded, but I didn't go as far as the devicelogs. Filed https://github.com/flutter/flutter/issues/84000

chinmaygarde commented 3 years ago

The linked PR has been reviewed. We can hopefully get better logs now.

chinmaygarde commented 3 years ago

The linked PR was merged. Are we still tracking the reliability of this? Not sure how many times this has happened over the past week. Let's add more instances of failures here (now with better logs). Removing Zach's assignment as he is OOO.

keyonghan commented 3 years ago

This is one instance happened on 06/11: https://ci.chromium.org/ui/p/flutter/builders/prod/Mac_android%20microbenchmarks/1496/overview

dnfield commented 3 years ago

The full(er) device log showed:

06-11 20:41:39.938 31115 31115 F libc    : Fatal signal 11 (SIGSEGV), code 1, fault addr 0x4 in tid 31115 (microbenchmarks)
06-11 20:41:40.021 31143 31143 F DEBUG   : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
06-11 20:41:40.021 31143 31143 F DEBUG   : Build fingerprint: 'motorola/athene/athene:7.0/NPJS25.93-14-8.1-9/4:user/release-keys'
06-11 20:41:40.021 31143 31143 F DEBUG   : Revision: 'p2a0'
06-11 20:41:40.021 31143 31143 F DEBUG   : ABI: 'arm'
06-11 20:41:40.021 31143 31143 F DEBUG   : pid: 31115, tid: 31115, name: microbenchmarks  >>> com.yourcompany.microbenchmarks <<<
06-11 20:41:40.021 31143 31143 F DEBUG   : signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x4
06-11 20:41:40.021 31143 31143 F DEBUG   :     r0 ad1b3ebc  r1 9a47a0c0  r2 9aa0eba1  r3 00000000
06-11 20:41:40.022 31143 31143 F DEBUG   :     r4 9c23d238  r5 00000007  r6 9c23d230  r7 aa85fe00
06-11 20:41:40.022 31143 31143 F DEBUG   :     r8 99e9ee36  r9 beb7d718  sl abb2e800  fp 9bff438c
06-11 20:41:40.022 31143 31143 F DEBUG   :     ip 9c170d6c  sp b6bfbbd8  lr 9bcfd6b4  pc 9c0c07e8  cpsr 800f0030
06-11 20:41:40.022 31143 31143 F DEBUG   : 
06-11 20:41:40.022 31143 31143 F DEBUG   : backtrace:
06-11 20:41:40.023 31143 31143 F DEBUG   :     #00 pc 003d17e8  /data/app/com.yourcompany.microbenchmarks-1/lib/arm/libflutter.so (offset 0x501000)
06-11 20:41:40.023 31143 31143 F DEBUG   :     #01 pc 0000e6b0  /data/app/com.yourcompany.microbenchmarks-1/lib/arm/libflutter.so (offset 0x501000)
dnfield commented 3 years ago

@flutter-symbolizer-bot https://github.com/flutter/flutter/issues/83939#issuecomment-860892433

flutter-symbolizer-bot commented 3 years ago
There were failures symbolizing some of the crashes I found When processing https://github.com/flutter/flutter/issues/83939#issuecomment-860892433 I found crash ``` Crash(engineVariant: EngineVariant(os: android, arch: arm, mode: null), frames: [CrashFrame.android(no: 00, pc: 4003816, binary: /data/app/com.yourcompany.microbenchmarks-1/lib/arm/libflutter.so, rest: (offset 0x501000), buildId: null), CrashFrame.android(no: 01, pc: 59056, binary: /data/app/com.yourcompany.microbenchmarks-1/lib/arm/libflutter.so, rest: (offset 0x501000), buildId: null)], format: native, androidMajorVersion: 7) ``` but failed to symbolize it with the following notes: * Unknown engine hash See [my documentation](https://github.com/flutter-symbolizer-bot/flutter-symbolizer-bot/blob/main/README.md#commands) for more details.
dnfield commented 3 years ago

@flutter-symbolizer-bot engine#b807dfde90a1f9409ddc3a396635320ec345ff68 https://github.com/flutter/flutter/issues/83939#issuecomment-860892433

flutter-symbolizer-bot commented 3 years ago

crash from https://github.com/flutter/flutter/issues/83939#issuecomment-860892433 symbolized using symbols for b807dfde90a1f9409ddc3a396635320ec345ff68 android-arm-release

#00 003d17e8 <...>/lib/arm/libflutter.so (offset 0x501000)
                                         jpeg_idct_16x16
                                         third_party/libjpeg-turbo/jidctint.c:2435:10
#01 0000e6b0 <...>/lib/arm/libflutter.so (offset 0x501000)
                                         __sync_xor_and_fetch_4
                                         /buildbot/src/android/gcc/toolchain/build/../gcc/gcc-4.9/libgcc/config/arm/linux-atomic.c:132:0

(Defaulted to release engine because build-id is unavailable or unreliable)

dnfield commented 3 years ago

@flutter-symbolizer-bot engine#b807dfde90a1f9409ddc3a396635320ec345ff68 profile arm64 https://github.com/flutter/flutter/issues/83939#issuecomment-860892433

flutter-symbolizer-bot commented 3 years ago

crash from https://github.com/flutter/flutter/issues/83939#issuecomment-860892433 symbolized using symbols for b807dfde90a1f9409ddc3a396635320ec345ff68 android-arm64-profile

#00 00000000003d17e8 <...>/lib/arm/libflutter.so (offset 0x501000)
                                                 dart::Zone::AllocUnsafe(long)
                                                 third_party/dart/runtime/vm/zone.h:219:5
                                                 long* dart::Zone::Alloc<long>(long)
                                                 third_party/dart/runtime/vm/zone.h:250:41
                                                 long* dart::Zone::Realloc<long>(long*, long, long)
                                                 third_party/dart/runtime/vm/zone.h:274:27
                                                 dart::BaseGrowableArray<long, dart::ZoneAllocated, dart::Zone>::Resize(long)
                                                 third_party/dart/runtime/platform/growable_array.h:235:30
                                                 dart::BaseGrowableArray<long, dart::ZoneAllocated, dart::Zone>::Add(long const&)
                                                 third_party/dart/runtime/platform/growable_array.h:85:5
                                                 dart::Library::LookupReExport(dart::String const&, dart::ZoneGrowableArray<long>*) const
                                                 third_party/dart/runtime/vm/object.cc:12508:10
#01 000000000000e6b0 <...>/lib/arm/libflutter.so (offset 0x501000)
                                                 dart::RuntimeEntry::RuntimeEntry(char const*, void (*)(dart::NativeArguments), long, bool, bool, bool)
                                                 third_party/dart/runtime/vm/runtime_entry.h:43:9
                                                 __cxx_global_var_init.57
                                                 third_party/dart/runtime/vm/runtime_entry.cc:247:1
                                                 _GLOBAL__sub_I_runtime_entry.cc
                                                 third_party/dart/runtime/vm/runtime_entry.cc:0:0
dnfield commented 3 years ago

@a-siva @rmacnak-google @aam - looks like we got a segfault in the VM here. We don't have a consistent reproduction yet unfortunately.

mraleph commented 3 years ago

@flutter-symbolizer-bot engine#b807dfde90a1f9409ddc3a396635320ec345ff68 profile arm https://github.com/flutter/flutter/issues/83939#issuecomment-860892433

flutter-symbolizer-bot commented 3 years ago

crash from https://github.com/flutter/flutter/issues/83939#issuecomment-860892433 symbolized using symbols for b807dfde90a1f9409ddc3a396635320ec345ff68 android-arm-profile

#00 003d17e8 <...>/lib/arm/libflutter.so (offset 0x501000)
                                         dart::CodePtr::CodePtr(dart::ObjectPtr const&)
                                         third_party/dart/runtime/vm/tagged_pointer.h:330:1
                                         dart::Code::ptr() const
                                         third_party/dart/runtime/vm/object.h:6676:3
                                         dart::Code::IsUnknownDartCode(dart::CodePtr)
                                         third_party/dart/runtime/vm/object.cc:17348:47
                                         dart::Code::PayloadSizeOf(dart::CodePtr)
                                         third_party/dart/runtime/vm/object.h:6189:9
                                         dart::Code::Size() const
                                         third_party/dart/runtime/vm/object.h:6186:31
                                         dart::StubCode::JumpToFrameSize()
                                         third_party/dart/runtime/vm/stub_code.h:61:3
                                         dart::StubCode::InJumpToFrameStub(unsigned int)
                                         third_party/dart/runtime/vm/stub_code.cc:134:16
                                         dart::Profiler::SampleThread(dart::Thread*, dart::InterruptedThreadState const&)
                                         third_party/dart/runtime/vm/profiler.cc:1271:41
                                         dart::ThreadInterrupterAndroid::ThreadInterruptSignalHandler(int, siginfo*, void*)
                                         third_party/dart/runtime/vm/thread_interrupter_android.cc:48:5
#01 0000e6b0 <...>/lib/arm/libflutter.so (offset 0x501000)
                                         __emutls_get_address
                                         /buildbot/src/android/gcc/toolchain/build/../gcc/gcc-4.9/libgcc/emutls.c:199:0
dnfield commented 3 years ago

Ahh were these arm32 devices? Sorry

godofredoc commented 3 years ago

Those are Moto G4, I believe they are arm64

mraleph commented 3 years ago

You can see the actual ABI in the crash log, it is 32bit ARM

rmacnak-google commented 3 years ago

Oh, I suspect the profiler is racing with stub code initialization. After the stub code entries were changed to use handles, the check for stub code being initialized could return a false positive between handle allocation and storing into the handle.

dnfield commented 3 years ago

Another one here: https://storage.googleapis.com/flutter_logs/flutter/3794884fbbac70c94813d787a9dff7114cfe28ce/microbenchmarks/c52ef745-9eba-4c36-b5e0-561aa8cad662/ZY223VMBPX.log

zanderso commented 3 years ago

Bumping to P2 to surface in triage.

chinmaygarde commented 3 years ago

In the VM team meeting this morning, Ryan mentioned he is going to be working on a speculative fix.

chinmaygarde commented 3 years ago

Update from VM team: Ryan understands the problem and is looking into prospective fixes.

ghostgzt commented 3 years ago

huawei nova6 and lenovo moba also crash

signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x4

2.0.6 2.2.2 stable and master ,all crash!

chinmaygarde commented 3 years ago

Marking this as fixed after https://github.com/dart-lang/sdk/commit/7ebd051a06184069d3e8f97979be454e3af6f8e5.

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. If you are still experiencing a similar issue, please open a new bug, including the output of flutter doctor -v and a minimal reproduction of the issue.