dotnet / xharness

C# command line tool for running tests on Android / iOS / tvOS devices and simulators
MIT License
155 stars 49 forks source link

Install / Uninstall apk Sometimes Fails on Android #354

Closed steveisok closed 3 years ago

steveisok commented 3 years ago

There are some runs indicated in https://github.com/dotnet/runtime/issues/44306 that show either the apk could not be installed or removed. Good examples of this are in:

https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-master-0d80c385864846a9b0/JIT/console.17d80b24.log?sv=2019-07-07&se=2020-11-22T18%3A21%3A05Z&sr=c&sp=rl&sig=7poXAdaraL1BwYAj7GxDrIhZLK%2F9Uo6PZg8P022tEmM%3D

    [18:48:43] info: Attempting to remove apk 'net.dot.JIT_SIMD': 
      [18:48:43] fail: Error: Exit code: 224
                       Standard Output:
                       cmd: Failure calling service package: Broken pipe (32)

https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-43740-merge-1f95bf9c506d4bcb9e/JIT/console.8114e152.log?%3F%253F%25253Fsv%25253D2019-07-07%252526se%25253D2020-11-23T13%2525253A30%2525253A16Z%252526sr%25253Dc%252526sp%25253Drl%252526sig%25253Dyr2NQCCzTeXQKU3kTZHet06tu32Jnp1%2525252B6y3AUwdMIzM%2525253D

My guess is that adb is somehow corrupt and either resolves itself via workitem completion or an internal timeout. We should look into capturing the error state or states and try to correct it. We should try to avoid failing the tests the first time.

akoeplinger commented 3 years ago

This is most likely the adb server process getting wrecked somehow. We might be able to fix it by running adb kill-server and adb start-server to reboot it.

akoeplinger commented 3 years ago

/cc @MattGal

tkapin commented 3 years ago

@MattGal do you want to look into this or you'd prefer us to investigate? Cheers!

MattGal commented 3 years ago

I'm happy to have you all investigate... my thoughts here is if adb install fails it should perhaps time out earlier (5 minutes is the default in Xharness) and retry, requesting an infra retry and machine reboot if it fails. If you need me to be involved though I can be.

akoeplinger commented 3 years ago

@MattGal in this case there's no timeout, adb fails immediately with the "Broken pipe" error (see the timestamps in the log).

MattGal commented 3 years ago

Merged PR that does a single retry here, and moved the issue to validate. If this doesn't make the problem go away I want to teach XHarness how to ask for reboot / retry when it detects it's running in Helix.

steveisok commented 3 years ago

@MattGal I have noticed that there are other, seemingly similar failures with different exit codes.

See https://github.com/dotnet/runtime/issues/42548#issuecomment-729597494 for the latest example.

What does xharness exit code 78 mean? I've found multiple rows in kusto with that exit code.

premun commented 3 years ago

@steveisok 78 seems to be PACKAGE_INSTALLATION_FAILURE

https://github.com/dotnet/xharness/blob/master/src/Microsoft.DotNet.XHarness.Common/CLI/ExitCode.cs

MattGal commented 3 years ago

I'm happy to have you all investigate... my thoughts here is if adb install fails it should perhaps time out earlier (5 minutes is the default in Xharness) and retry, requesting an infra retry and machine reboot if it fails. If you need me to be involved though I can be.

This failure mode is not what I made a PR for, is not the log at the top, but it is the one I was talking about in this comment. I will investigate after I finish my inbox this morning, but my sense here is if adb uninstall fails, the emulator / device is in a broken state and this is the case where an infra retry / machine reboot makes good sense.

mattleibow commented 3 years ago

This may or may not be related, but it also seems to be that it tries to install to early.

I have a script that creates and boots the emulator, and then runs xharness. It has been failing on CI and locally (sometimes). However, if I boot up the emulator and then wait a bit, it installs packages just fine. It is also deceiving in that subsequent boots seem to work, but this is due to the fact that it is just restoring the emulator from a save point.

If I also wait for the boot to complete using adb shell getprop sys.boot_completed, then it also works.

dbug: Executing command: 'C:\Users\mattl\.dotnet\tools\.store\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20575.2\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20575.2\runtimes\any\native\adb\windows\adb.exe -s emulator-5554 wait-for-device'
dbug:
dbug: Executing command: 'C:\Users\mattl\.dotnet\tools\.store\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20575.2\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20575.2\runtimes\any\native\adb\windows\adb.exe -s emulator-5554 logcat -c'
dbug: Executing command: 'C:\Users\mattl\.dotnet\tools\.store\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20575.2\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20575.2\runtimes\any\native\adb\windows\adb.exe -s emulator-5554 version'
dbug: Working with Android Debug Bridge version 1.0.41
      Version 29.0.6-6198805
      Installed as C:\Users\mattl\.dotnet\tools\.store\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20575.2\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20575.2\runtimes\any\native\adb\windows\adb.exe

info: Attempting to remove apk 'com.microsoft.azure.devicetests':
dbug: Executing command: 'C:\Users\mattl\.dotnet\tools\.store\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20575.2\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20575.2\runtimes\any\native\adb\windows\adb.exe -s emulator-5554 uninstall com.microsoft.azure.devicetests'
fail: Error: Exit code: 20
      Standard Output:

      Standard Error:

info: Attempting to install C:/Projects/azure-mobile-apps-net-client/e2etests/DeviceTests.Android/bin/Debug/com.microsoft.azure.devicetests-Signed.apk:
dbug: Executing command: 'C:\Users\mattl\.dotnet\tools\.store\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20575.2\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20575.2\runtimes\any\native\adb\windows\adb.exe -s emulator-5554 install "C:/Projects/azure-mobile-apps-net-client/e2etests/DeviceTests.Android/bin/Debug/com.microsoft.azure.devicetests-Signed.apk"'
fail: Error:
      Exit code: 1
      Standard Output:
      Performing Streamed Install

      Standard Error:
      Performing Streamed Install

crit: Install failure: Test command cannot continue
XHarness exit code: 78
akoeplinger commented 3 years ago

@mattleibow interesting, I'd have thought wait-for-device would only return once the device is ready ¯_(ツ)_/¯

I guess we can add adb shell getprop sys.boot_completed to make sure.

akoeplinger commented 3 years ago

or maybe add -delay-adb to the emulator command line: https://androidstudio.googleblog.com/2019/05/emulator-2907-canary.html ?

btw. when you say "it has been failing on CI" you don't mean the Helix queues right? the emulator should always be running there.

mattleibow commented 3 years ago

"it has been failing on CI" refers to the default Azure hosted bots. I have to manually download, install, create and then spin them up.

akoeplinger commented 3 years ago

So looks like the https://github.com/dotnet/xharness/pull/388 is working for fixing the issue we saw during apk installation on some of the physical machines, I saw one machine got autofixed with the new xharness by rebooting the devices:

C:\h\w\A40208F8\w\AD510979\e>dotnet exec "C:\h\w\A40208F8\p/microsoft.dotnet.xharness.cli/1.0.0-prerelease.20601.1/tools/netcoreapp3.1/any/Microsoft.DotNet.XHarness.CLI.dll" android test --app "Microsoft.Extensions.FileProviders.Physical.Tests.apk" --output-directory "C:\h\w\A40208F8\w\AD510979\uploads" --timeout 1800 -p="net.dot.Microsoft.Extensions.FileProviders.Physical.Tests" -v -i=net.dot.MonoRunner  
XHarness command issued: android test --app Microsoft.Extensions.FileProviders.Physical.Tests.apk --output-directory C:\h\w\A40208F8\w\AD510979\uploads --timeout 1800 -p=net.dot.Microsoft.Extensions.FileProviders.Physical.Tests -v -i=net.dot.MonoRunner
[05:17:02] dbug: Android Test command called: App = C:\h\w\A40208F8\w\AD510979\e\Microsoft.Extensions.FileProviders.Physical.Tests.apk
                 Instrumentation Name = net.dot.MonoRunner
[05:17:02] dbug: Output Directory:C:\h\w\A40208F8\w\AD510979\uploads
                 Timeout = 1800 seconds.
[05:17:02] dbug: Arguments to instrumentation:
[05:17:02] dbug: ADBRunner using ADB.exe supplied from C:\h\w\A40208F8\p\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20601.1\tools\netcoreapp3.1\any\..\..\..\runtimes\any\native\adb\windows\adb.exe
[05:17:02] dbug: Full resolved path:'C:\h\w\A40208F8\p\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20601.1\runtimes\any\native\adb\windows\adb.exe'
[05:17:02] info: Will attempt to run device on detected architecture: 'arm64-v8a'
[05:17:02] dbug: Executing command: 'C:\h\w\A40208F8\p\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20601.1\runtimes\any\native\adb\windows\adb.exe  start-server'
[05:17:02] dbug: 
[05:17:02] dbug: Executing command: 'C:\h\w\A40208F8\p\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20601.1\runtimes\any\native\adb\windows\adb.exe  devices -l'
[05:17:02] dbug: Evaluating output line for device serial: 017AY1S22P             device product:sargo model:Pixel_3a device:sargo transport_id:1
[05:17:02] dbug: Executing command: 'C:\h\w\A40208F8\p\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20601.1\runtimes\any\native\adb\windows\adb.exe  -s 017AY1S22P shell getprop ro.product.cpu.abi'
[05:17:03] info: Using first-found compatible device of 1 total- serial: '017AY1S22P' - Arch: arm64-v8a
[05:17:03] info: Active Android device set to serial '017AY1S22P'
[05:17:03] info: Waiting for device to be available (max 5 minutes)
[05:17:03] dbug: Executing command: 'C:\h\w\A40208F8\p\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20601.1\runtimes\any\native\adb\windows\adb.exe -s 017AY1S22P wait-for-device'
[05:17:03] dbug: 
[05:17:03] dbug: Executing command: 'C:\h\w\A40208F8\p\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20601.1\runtimes\any\native\adb\windows\adb.exe -s 017AY1S22P shell getprop sys.boot_completed'
[05:17:03] dbug: Executing command: 'C:\h\w\A40208F8\p\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20601.1\runtimes\any\native\adb\windows\adb.exe -s 017AY1S22P logcat -c'
[05:17:03] dbug: Executing command: 'C:\h\w\A40208F8\p\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20601.1\runtimes\any\native\adb\windows\adb.exe -s 017AY1S22P version'
[05:17:03] dbug: Working with Android Debug Bridge version 1.0.41
                 Version 29.0.6-6198805
                 Installed as C:\h\w\A40208F8\p\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20601.1\runtimes\any\native\adb\windows\adb.exe

[05:17:03] info: Attempting to remove apk 'net.dot.Microsoft.Extensions.FileProviders.Physical.Tests': 
[05:17:03] dbug: Executing command: 'C:\h\w\A40208F8\p\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20601.1\runtimes\any\native\adb\windows\adb.exe -s 017AY1S22P uninstall net.dot.Microsoft.Extensions.FileProviders.Physical.Tests'
[05:17:03] info: APK 'net.dot.Microsoft.Extensions.FileProviders.Physical.Tests' not on device.
[05:17:03] info: Attempting to install C:\h\w\A40208F8\w\AD510979\e\Microsoft.Extensions.FileProviders.Physical.Tests.apk: 
[05:17:03] dbug: Executing command: 'C:\h\w\A40208F8\p\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20601.1\runtimes\any\native\adb\windows\adb.exe -s 017AY1S22P install "C:\h\w\A40208F8\w\AD510979\e\Microsoft.Extensions.FileProviders.Physical.Tests.apk"'
[05:17:06] warn: It seems the package installation cache may be full on the device.  We'll try to reboot it before trying one more time.
                 Output:Exit code: 1
                 Standard Output:
                 Performing Streamed Install

                 Standard Error:
                 adb: failed to install C:\h\w\A40208F8\w\AD510979\e\Microsoft.Extensions.FileProviders.Physical.Tests.apk: Failure [INSTALL_FAILED_INSUFFICIENT_STORAGE: Scanning Failed.: Package net.dot.Microsoft.Extensions.FileProviders.Physical.Tests could not be assigned a valid UID]

[05:17:06] dbug: Executing command: 'C:\h\w\A40208F8\p\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20601.1\runtimes\any\native\adb\windows\adb.exe -s 017AY1S22P reboot'
[05:17:10] info: Waiting for device to be available (max 5 minutes)
[05:17:10] dbug: Executing command: 'C:\h\w\A40208F8\p\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20601.1\runtimes\any\native\adb\windows\adb.exe -s 017AY1S22P wait-for-device'
[05:17:19] dbug: 
[05:17:19] dbug: Executing command: 'C:\h\w\A40208F8\p\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20601.1\runtimes\any\native\adb\windows\adb.exe -s 017AY1S22P shell getprop sys.boot_completed'
[05:17:19] dbug: Executing command: 'C:\h\w\A40208F8\p\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20601.1\runtimes\any\native\adb\windows\adb.exe -s 017AY1S22P shell getprop sys.boot_completed'
[05:17:19] dbug: sys.boot_completed = ''
[05:17:29] dbug: Executing command: 'C:\h\w\A40208F8\p\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20601.1\runtimes\any\native\adb\windows\adb.exe -s 017AY1S22P shell getprop sys.boot_completed'
[05:17:29] dbug: sys.boot_completed = '1'
[05:17:39] dbug: Executing command: 'C:\h\w\A40208F8\p\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20601.1\runtimes\any\native\adb\windows\adb.exe -s 017AY1S22P install "C:\h\w\A40208F8\w\AD510979\e\Microsoft.Extensions.FileProviders.Physical.Tests.apk"'
[05:17:43] info: Successfully installed C:\h\w\A40208F8\w\AD510979\e\Microsoft.Extensions.FileProviders.Physical.Tests.apk.
[05:17:43] info: Killing all running processes for 'net.dot.Microsoft.Extensions.FileProviders.Physical.Tests': 
[05:17:43] dbug: Executing command: 'C:\h\w\A40208F8\p\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20601.1\runtimes\any\native\adb\windows\adb.exe -s 017AY1S22P shell am kill --user all net.dot.Microsoft.Extensions.FileProviders.Physical.Tests'
[05:17:43] dbug: Success!

[05:17:43] info: Starting instrumentation class 'net.dot.MonoRunner' on net.dot.Microsoft.Extensions.FileProviders.Physical.Tests
[05:17:43] dbug: Raw command: 'shell am instrument  -w net.dot.Microsoft.Extensions.FileProviders.Physical.Tests/net.dot.MonoRunner'
[05:17:43] dbug: Executing command: 'C:\h\w\A40208F8\p\microsoft.dotnet.xharness.cli\1.0.0-prerelease.20601.1\runtimes\any\native\adb\windows\adb.exe -s 017AY1S22P shell am instrument  -w net.dot.Microsoft.Extensions.FileProviders.Physical.Tests/net.dot.MonoRunner'
[05:18:03] info: Running instrumentation class net.dot.MonoRunner took 20.1445738 seconds
[05:18:03] dbug: Exit code: 0
                 Standard Output:
                 INSTRUMENTATION_RESULT: return-code=0
                 INSTRUMENTATION_RESULT: test-results-path=/storage/emulated/0/Android/data/net.dot.Microsoft.Extensions.FileProviders.Physical.Tests/files/Documents/testResults.xml
                 INSTRUMENTATION_CODE: 0

@mattleibow that fix also includes your suggestion of using sys.boot_completed for ensuring the device is booted so please try xharness >= 1.0.0-prerelease.20601.1.

premun commented 3 years ago

Looks like XHarness 20601.1 which should contain the fix from https://github.com/dotnet/xharness/issues/354 is still failing: https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-45486-merge-3c2f97f16f154752a9/Microsoft.Extensions.Configuration.EnvironmentVariables.Tests/console.74bfb490.log?sv=2019-07-07&se=2020-12-23T10:26:30Z&sr=c&sp=rl&sig=ImcNQ%2BZX5lvEc%2FHKcQ1cjkfhiIu5go%2F%2F6dcPcPw2aWw%3D

[02:29:12] fail: Error:
                 Exit code: 1
                 Standard Output:
                 Performing Streamed Install

                 Standard Error:
                 adb: failed to install C:\h\w\AB7309B6\w\BD3C0A3B\e\Microsoft.Extensions.Configuration.EnvironmentVariables.Tests.apk: cmd: Failure calling service package: Broken pipe (32)

[02:29:12] crit: Install failure: Test command cannot continue
XHarness exit code: 78
akoeplinger commented 3 years ago

We're checking for the exit code 224 for the "Broken pipe" error, looks like it can happen with exit code 1 as well :D

MattGal commented 3 years ago

Bummer, I am doing other Xharness tweaks now and will fold this in.

mattleibow commented 3 years ago

I don't think the current delay for booting is long enough on the DevOps agents. I get some more greens, but I still am having the same errors. I am adding back my 10 min wait and will confirm.

However, maybe this can be configurable via the CLI so that we never have to worry. If it is a really slow machine, then we can bump the number.

MattGal commented 3 years ago

I don't think the current delay for booting is long enough on the DevOps agents. I get some more greens, but I still am having the same errors. I am adding back my 10 min wait and will confirm.

However, maybe this can be configurable via the CLI so that we never have to worry. If it is a really slow machine, then we can bump the number.

Can you clarify what you mean by the devops agents? I'm unaware of any Android emulators on the Azure DevOps agents, this could be an unusual use case.

https://github.com/dotnet/xharness/pull/392/files is up for this newest broken pipe issue.

mattleibow commented 3 years ago

I install them myself using the android sdk manager and then use avd to create the emulators. It very well could be that they are not accelerated, but it does now work after extending the delay.

MattGal commented 3 years ago

Thanks for the details. In the case of most of the "Broken pipe" and "stuck offline" machines I've seen they stayed in this state until acted on externally (like restarting the adb server for the former and restarting the emulator / machine for the latter). If you have more suggestions about how to be reliable here I'd love to hear them (or see PRs)

mattleibow commented 3 years ago

With booting, I have seen some emulators (mostly the API 30) take about 240 seconds to boot up properly.

akoeplinger commented 3 years ago

Yeah I think waiting up to 5mins for sys.boot_completed is fine as a default and ideally it'd be overridable.

MattGal commented 3 years ago

I'm doing PR feedbacks after lunch, will stick that in.

mattleibow commented 3 years ago

I see there is a timeout arg for ios, maybe check to see if there is something we can do with that for consistency.

MattGal commented 3 years ago

I believe this is merged and hasn't been a problem, closing (feel free to reactivate)