Closed keyonghan closed 3 years ago
I started filing this and then I noticed every test failure seemed completely different. Which doesn't mean it shouldn't be investigated, but it's a weird one.
Also I ignored this because they really were infra failures: https://ci.chromium.org/p/flutter/builders/prod/Mac_android%20microbenchmarks/1367 https://ci.chromium.org/p/flutter/builders/prod/Mac_android%20microbenchmarks/1379 https://ci.chromium.org/p/flutter/builders/prod/Mac_android%20microbenchmarks/1390
Upload metrics failure: https://ci.chromium.org/p/flutter/builders/prod/Mac_android%20microbenchmarks/1366
Also I ignored this because they really were infra failures: https://ci.chromium.org/p/flutter/builders/prod/Mac_android%20microbenchmarks/1367 https://ci.chromium.org/p/flutter/builders/prod/Mac_android%20microbenchmarks/1379 https://ci.chromium.org/p/flutter/builders/prod/Mac_android%20microbenchmarks/1390
These are due to timeout on
Waiting for observatory port to be available
Upload metrics failure: https://ci.chromium.org/p/flutter/builders/prod/Mac_android%20microbenchmarks/1366
This was caused by cocoon new deployment propagating (multiple tests failed around that time).
These are due to timeout on
Waiting for observatory port to be available
I missed the blue test timeout, only saw the purple part.
Do we know why these failures don't include any logs from the device?
@bdero Could you take a crack at this one?
@zanderso Sure, I'll try to repro this on one of my devices.
+1 on https://github.com/flutter/flutter/issues/83939#issuecomment-854192435
So far (100 runs in) I haven't been able to reproduce this with a Pixel 3 XL. Could someone with access to the Moto G4s run the test a couple hundred times, observing ADB output with flutter logs
when it hangs? /cc @godofredoc, @keyonghan
pushd dev/benchmarks/microbenchmarks; ../../../bin/flutter run -v --profile --no-publish-port -d [device_id] lib/stocks/layout_bench.dart; popd
Otherwise, I can order a couple of aftermarket Moto G4s -- they're pretty cheap but it'll take a few days to ship. Might be best for me to get a hold of one either way.
The failure cases themselves have device ID variation:
Do we know why these failures don't include any logs from the device?
Logcat is starting, but we're only getting a beginning of main log and something about packagemanager for our process.
There are no other logs. It seems like the app is hanging and doing nothing and we just can't connect to the observatory.
@godofredoc @keyonghan - do we have any health checks that run on these devices? e.g. something that tells us how much free space they have, when their last reboot was?
We should add something to take an ADB screenshot if/when this times out too.
We also need to bump the execution timeout for these guys. We let the tests take up to 30 minutes before timing them out, but once that happens no subsequent steps are allowed to run.
https://github.com/flutter/infra/pull/467 increases the expiration time for these shard.
https://flutter-review.googlesource.com/c/recipes/+/14400 will take a screenshot after the test, which should help us know e.g. whether frames even rendered, whether we're stuck on some other system UI for some reason, etc.
There are no other logs. It seems like the app is hanging and doing nothing and we just can't connect to the observatory.
Although the are small logcat snaps with -t 1
before activity launches, it doesn't look like the log stream being read to collect the observatory URI as output by flutter at runtime is ever printed/forwarded to the test stdout -- otherwise there'd be a line similar to the following for each time the flutter activity runs, right?:
[+6318 ms] I/flutter ( 7514): Observatory listening on http://127.0.0.1:37037/epPy3QczJHY=/
Although there's one such line mysteriously present at the top of each test run's stdout regardless of the number of activity runs. Not sure what's going on there.
Since waiting on this stream synchronously blocks anyways, maybe we should toss a print in there just in case there's useful engine output that clues to the cause of hanging next time it flakes.
That linked code is to avoid printing out the enormous amount of logspam that comes from logcat. It's filtering based on our process ID.
We could try to make it more verbose if we really need to, but it's likely that it will just give us lots of system based garbage.
It may be better to instead have the recipe just get the full logcat and add it to the logs dir separately so we could analyze that, rather than clogging up what's meant to be application specific logging.
And the initial observatory line you're seeing is from the test running harness on the host itself, e.g.:
Executing: /opt/s/w/ir/x/w/recipe_cleanup/tmpjowsyt/flutter sdk/bin/cache/dart-sdk/bin/dart --disable-dart-dev --enable-vm-service=0 --no-pause-isolates-on-exit bin/tasks/microbenchmarks.dart in /opt/s/w/ir/x/w/recipe_cleanup/tmpjowsyt/flutter sdk/dev/devicelab with environment {BOT: true, LANG: en_US.UTF-8}
[microbenchmarks] [STDOUT] Observatory listening on http://127.0.0.1:52664/KB4dbBmnho4=/
So dart --disable-dart-dev --enable-vm-service=0 --no-pause-isolates-on-exit bin/tasks/microbenchmarks.dart
is causing that line to print. It's not coming from logcat.
We could try to make it more verbose if we really need to, but it's likely that it will just give us lots of system based garbage.
This seems essential for getting a grip on flakes like this one, as none of the streaming log output is making it to the tests at activity launch right now, and so AFAICT we don't actually get output produced by the flutter pid recorded anywhere during this moment.
+1 to possibly having completely unfiltered logcat output as a separate test artifact as well, that seems like a great idea and would probably also have the benefit of making many device resource contention issues more obvious (OOMs, FD limits, no disk space, etc). 👍
@bdero @dnfield did you take a look at #77193? it has a few repros and more logs. The "waiting for observatory to be open" happens across all the android tests.
Regarding the timeouts would it be possible to add a timeout to the code blocking on getting output from observatory? microbenchmarks is the reason we added timeouts to devicelab tests to prevent a single test holding the test bed for a very long time.
I do have an android phone I can try to repro, can you please share the commands to collect additional logs?
Thanks @godofredoc .
I can see in the logs there that we definitely filtered logs we shouldn't have.
It'd be good to always have the devicelab bots get all logs separately, and it'd also be good to try to get the tool to pull logs. Last time I was looking at this I tried ot make sure timelines always got uploaded, but I didn't go as far as the devicelogs. Filed https://github.com/flutter/flutter/issues/84000
The linked PR has been reviewed. We can hopefully get better logs now.
The linked PR was merged. Are we still tracking the reliability of this? Not sure how many times this has happened over the past week. Let's add more instances of failures here (now with better logs). Removing Zach's assignment as he is OOO.
This is one instance happened on 06/11: https://ci.chromium.org/ui/p/flutter/builders/prod/Mac_android%20microbenchmarks/1496/overview
The full(er) device log showed:
06-11 20:41:39.938 31115 31115 F libc : Fatal signal 11 (SIGSEGV), code 1, fault addr 0x4 in tid 31115 (microbenchmarks)
06-11 20:41:40.021 31143 31143 F DEBUG : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
06-11 20:41:40.021 31143 31143 F DEBUG : Build fingerprint: 'motorola/athene/athene:7.0/NPJS25.93-14-8.1-9/4:user/release-keys'
06-11 20:41:40.021 31143 31143 F DEBUG : Revision: 'p2a0'
06-11 20:41:40.021 31143 31143 F DEBUG : ABI: 'arm'
06-11 20:41:40.021 31143 31143 F DEBUG : pid: 31115, tid: 31115, name: microbenchmarks >>> com.yourcompany.microbenchmarks <<<
06-11 20:41:40.021 31143 31143 F DEBUG : signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x4
06-11 20:41:40.021 31143 31143 F DEBUG : r0 ad1b3ebc r1 9a47a0c0 r2 9aa0eba1 r3 00000000
06-11 20:41:40.022 31143 31143 F DEBUG : r4 9c23d238 r5 00000007 r6 9c23d230 r7 aa85fe00
06-11 20:41:40.022 31143 31143 F DEBUG : r8 99e9ee36 r9 beb7d718 sl abb2e800 fp 9bff438c
06-11 20:41:40.022 31143 31143 F DEBUG : ip 9c170d6c sp b6bfbbd8 lr 9bcfd6b4 pc 9c0c07e8 cpsr 800f0030
06-11 20:41:40.022 31143 31143 F DEBUG :
06-11 20:41:40.022 31143 31143 F DEBUG : backtrace:
06-11 20:41:40.023 31143 31143 F DEBUG : #00 pc 003d17e8 /data/app/com.yourcompany.microbenchmarks-1/lib/arm/libflutter.so (offset 0x501000)
06-11 20:41:40.023 31143 31143 F DEBUG : #01 pc 0000e6b0 /data/app/com.yourcompany.microbenchmarks-1/lib/arm/libflutter.so (offset 0x501000)
@flutter-symbolizer-bot https://github.com/flutter/flutter/issues/83939#issuecomment-860892433
@flutter-symbolizer-bot engine#b807dfde90a1f9409ddc3a396635320ec345ff68 https://github.com/flutter/flutter/issues/83939#issuecomment-860892433
crash from https://github.com/flutter/flutter/issues/83939#issuecomment-860892433 symbolized using symbols for b807dfde90a1f9409ddc3a396635320ec345ff68
android-arm-release
#00 003d17e8 <...>/lib/arm/libflutter.so (offset 0x501000)
jpeg_idct_16x16
third_party/libjpeg-turbo/jidctint.c:2435:10
#01 0000e6b0 <...>/lib/arm/libflutter.so (offset 0x501000)
__sync_xor_and_fetch_4
/buildbot/src/android/gcc/toolchain/build/../gcc/gcc-4.9/libgcc/config/arm/linux-atomic.c:132:0
(Defaulted to release engine because build-id is unavailable or unreliable)
@flutter-symbolizer-bot engine#b807dfde90a1f9409ddc3a396635320ec345ff68 profile arm64 https://github.com/flutter/flutter/issues/83939#issuecomment-860892433
crash from https://github.com/flutter/flutter/issues/83939#issuecomment-860892433 symbolized using symbols for b807dfde90a1f9409ddc3a396635320ec345ff68
android-arm64-profile
#00 00000000003d17e8 <...>/lib/arm/libflutter.so (offset 0x501000)
dart::Zone::AllocUnsafe(long)
third_party/dart/runtime/vm/zone.h:219:5
long* dart::Zone::Alloc<long>(long)
third_party/dart/runtime/vm/zone.h:250:41
long* dart::Zone::Realloc<long>(long*, long, long)
third_party/dart/runtime/vm/zone.h:274:27
dart::BaseGrowableArray<long, dart::ZoneAllocated, dart::Zone>::Resize(long)
third_party/dart/runtime/platform/growable_array.h:235:30
dart::BaseGrowableArray<long, dart::ZoneAllocated, dart::Zone>::Add(long const&)
third_party/dart/runtime/platform/growable_array.h:85:5
dart::Library::LookupReExport(dart::String const&, dart::ZoneGrowableArray<long>*) const
third_party/dart/runtime/vm/object.cc:12508:10
#01 000000000000e6b0 <...>/lib/arm/libflutter.so (offset 0x501000)
dart::RuntimeEntry::RuntimeEntry(char const*, void (*)(dart::NativeArguments), long, bool, bool, bool)
third_party/dart/runtime/vm/runtime_entry.h:43:9
__cxx_global_var_init.57
third_party/dart/runtime/vm/runtime_entry.cc:247:1
_GLOBAL__sub_I_runtime_entry.cc
third_party/dart/runtime/vm/runtime_entry.cc:0:0
@a-siva @rmacnak-google @aam - looks like we got a segfault in the VM here. We don't have a consistent reproduction yet unfortunately.
@flutter-symbolizer-bot engine#b807dfde90a1f9409ddc3a396635320ec345ff68 profile arm https://github.com/flutter/flutter/issues/83939#issuecomment-860892433
crash from https://github.com/flutter/flutter/issues/83939#issuecomment-860892433 symbolized using symbols for b807dfde90a1f9409ddc3a396635320ec345ff68
android-arm-profile
#00 003d17e8 <...>/lib/arm/libflutter.so (offset 0x501000)
dart::CodePtr::CodePtr(dart::ObjectPtr const&)
third_party/dart/runtime/vm/tagged_pointer.h:330:1
dart::Code::ptr() const
third_party/dart/runtime/vm/object.h:6676:3
dart::Code::IsUnknownDartCode(dart::CodePtr)
third_party/dart/runtime/vm/object.cc:17348:47
dart::Code::PayloadSizeOf(dart::CodePtr)
third_party/dart/runtime/vm/object.h:6189:9
dart::Code::Size() const
third_party/dart/runtime/vm/object.h:6186:31
dart::StubCode::JumpToFrameSize()
third_party/dart/runtime/vm/stub_code.h:61:3
dart::StubCode::InJumpToFrameStub(unsigned int)
third_party/dart/runtime/vm/stub_code.cc:134:16
dart::Profiler::SampleThread(dart::Thread*, dart::InterruptedThreadState const&)
third_party/dart/runtime/vm/profiler.cc:1271:41
dart::ThreadInterrupterAndroid::ThreadInterruptSignalHandler(int, siginfo*, void*)
third_party/dart/runtime/vm/thread_interrupter_android.cc:48:5
#01 0000e6b0 <...>/lib/arm/libflutter.so (offset 0x501000)
__emutls_get_address
/buildbot/src/android/gcc/toolchain/build/../gcc/gcc-4.9/libgcc/emutls.c:199:0
Ahh were these arm32 devices? Sorry
Those are Moto G4, I believe they are arm64
You can see the actual ABI in the crash log, it is 32bit ARM
Oh, I suspect the profiler is racing with stub code initialization. After the stub code entries were changed to use handles, the check for stub code being initialized could return a false positive between handle allocation and storing into the handle.
Bumping to P2 to surface in triage.
In the VM team meeting this morning, Ryan mentioned he is going to be working on a speculative fix.
Update from VM team: Ryan understands the problem and is looking into prospective fixes.
huawei nova6 and lenovo moba also crash
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x4
2.0.6 2.2.2 stable and master ,all crash!
Marking this as fixed after https://github.com/dart-lang/sdk/commit/7ebd051a06184069d3e8f97979be454e3af6f8e5.
This thread has been automatically locked since there has not been any recent activity after it was closed. If you are still experiencing a similar issue, please open a new bug, including the output of flutter doctor -v
and a minimal reproduction of the issue.
The post-submit test builder mac_android_microbenchmarks had a flaky ratio 3.24% for the past 15 days, which is above our 2% threshold.
The test shows infra failure, however it is indeed test failures (https://github.com/flutter/flutter/issues/77193)
Please follow https://github.com/flutter/flutter/wiki/Reducing-Test-Flakiness#fixing-flaky-tests to fix the flakiness and enable the test back after validating the fix (internal dashboard to validate: go/flutter_test_flakiness).
(Creating this issue from infra team after the gardener rotation's one-time scan.)