Open zeptonaut opened 8 years ago
/cc @randalnephew
/cc @jbudorick @perezju @petrcermak
Fix up here: https://codereview.chromium.org/2236493003#
Expecting a decent number of broken tests.
Updated bug title to clarify that this is specific to telemetry.
Ah, sorry. Good call.
https://codereview.chromium.org/2236493003# is failing with a bunch of "[305/1094] telemetry.internal.browser.tab_unittest.TabTest.testTabBrowserIsRightBrowser failed unexpectedly 0.3019s: Traceback (most recent call last): File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/testing/browser_test_case.py", line 86, in setUpClass raise Exception('No browser found, cannot continue test.')"
I suspect that the code of of pushing the reference browser' apk to the remote device doesn't work well in parallel as the catapult android host is connected to 7 phones.
@Apeliotes can you help with investigating this bug?
Parallel installation should work, at least from devil's perspective.
It should work from the dependency manager's perspective as well. I think this went through the CQ when the reference builds were updated (the Windows bots are failing in the same way.) So it might be unrelated failures. I'm going to sent the cl through a dry run and see if the tests are still broken.
Thanks Kari for looking into this!
The problem appears to be that we are not prefetching the devil binaries. This isn't an issue on desktop, but means we'll always fail to prefetch on android.
Interesting! Anything we can do to remedy the problem?
Looks like https://codereview.chromium.org/2236493003/ still failed to submit, although it may have been due to ADB flakiness. Retrying now.
Thanks for the quick fix in my absence, Kari. I'll look into a less temporary solution.
Looks like it's still failing :-/
Today's interesting failure of the day: the device_forwarder binary is an asan build for some reason.
AdbShellCommandFailedError: (device: 06b3b31f003bf3cb) shell command run via adb failed on the device:
command: LD_LIBRARY_PATH=/data/local/tmp/forwarder/ /data/local/tmp/forwarder/device_forwarder --kill-server
exit status: 1
output:
- CANNOT LINK EXECUTABLE: could not load library "libclang_rt.asan-arm-android.so" needed by "/data/local/tmp/forwarder/device_forwarder"; caused by library "libclang_rt.asan-arm-android.so" not found
I think that the devices on that bot are in a bad state leftover from previous runs, which means it's time to bring parts of the device provisioning logic up from chromium into devil.
... apparently telemetry has its own copies of the host and device binaries for both md5sum and the forwarder (e.g.). This seems wrong, but somehow telemetry is calling them anyway.
edit: never mind, those are a red herring.
I think that is due to legacy reason. Telemetry should delegate the usage of those binaries to devil's API calls.
After some more investigation, I'm fairly certain this is a devil bug involving how we update the forwarder binary (and maybe md5sum binary) on the device. I've filed this as a separate issue.
(This isn't a problem on chromium bots because they run the wiping provision.)
John: would you mind pinging this bug when you've fixed that bug?
ping :) the fix landed last night.
Great - thanks! Running through the CQ again.
Looks like it's still having problems: https://codereview.chromium.org/2236493003/
yep, new problem though: it's failing to install chrome. Not yet sure why the package manager is being killed.
File "/b/build/slave/catapult/build/catapult/devil/devil/android/device_utils.py", line 643, in Install
reinstall=reinstall, permissions=permissions)
File "/b/build/slave/catapult/build/catapult/devil/devil/android/device_utils.py", line 693, in _InstallInternal
device_apk_paths = self._GetApplicationPathsInternal(package_name)
File "/b/build/slave/catapult/build/catapult/devil/devil/android/device_utils.py", line 488, in _GetApplicationPathsInternal
['pm', 'path', package], check_return=should_check_return)
File "/b/build/slave/catapult/build/catapult/devil/devil/android/decorators.py", line 51, in timeout_retry_wrapper
return impl()
File "/b/build/slave/catapult/build/catapult/devil/devil/android/decorators.py", line 47, in impl
return f(*args, **kwargs)
File "/b/build/slave/catapult/build/catapult/devil/devil/android/device_utils.py", line 898, in RunShellCommand
output = handle_large_output(cmd, large_output)
File "/b/build/slave/catapult/build/catapult/devil/devil/android/device_utils.py", line 876, in handle_large_output
return handle_large_command(cmd)
File "/b/build/slave/catapult/build/catapult/devil/devil/android/device_utils.py", line 858, in handle_large_command
return handle_check_return(cmd)
File "/b/build/slave/catapult/build/catapult/devil/devil/android/device_utils.py", line 849, in handle_check_return
return run(cmd)
File "/b/build/slave/catapult/build/catapult/devil/devil/android/device_utils.py", line 845, in run
return self.adb.Shell(cmd)
File "/b/build/slave/catapult/build/catapult/devil/devil/android/sdk/adb_wrapper.py", line 492, in Shell
command, output, status=status, device_serial=self._device_serial)
AdbShellCommandFailedError: (device: 06ad8e47003b6c40) shell command run via adb failed on the device:
command: pm path com.google.android.apps.chrome
exit status: 137
output:
- Killed
/cry
@jbudorick, is there any chance you could be a lifesaver and help with this one too?
yeah, see 2695. working on it now.
(I'm not sure that that'll fix the issue, but if it doesn't, it'll narrow down the possible causes.)
Thank you!
@zeptonaut the android trybot runs device provisioning now, so this might work. Give it another shot when you get a chance.
... although I had to disable the existing telemetry suite on the Android trybot, which started misbehaving last night. (I don't think that was related to the addition of provisioning, though.)
Hi John,
Just to clarify: is this good to try again now?
On Wed, Aug 24, 2016 at 8:58 AM, John Budorick notifications@github.com wrote:
... although I had to disable the existing telemetry suite on the Android trybot, which started misbehaving last night. (I don't think that was related to the addition of provisioning, though.)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/catapult-project/catapult/issues/2645#issuecomment-242053813, or mute the thread https://github.com/notifications/unsubscribe-auth/ABV5pealrrUpbpiuxdq4mHuw8Avqf63Uks5qjEAIgaJpZM4JhKul .
Charlie Andrews | Software Engineer | charliea@google.com
@zeptonaut I think so.
Great. Trying now.
On Mon, Aug 29, 2016 at 10:25 AM, John Budorick notifications@github.com wrote:
@zeptonaut https://github.com/zeptonaut I think so.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/catapult-project/catapult/issues/2645#issuecomment-243139376, or mute the thread https://github.com/notifications/unsubscribe-auth/ABV5pRBDmpY8q3ahstQKkKtCB9PgTzEbks5qkuvJgaJpZM4JhKul .
Charlie Andrews | Software Engineer | charliea@google.com
That's the linux bot; this is the android one.
More new problems:
Traceback (most recent call last):
File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/testing/browser_test_case.py", line 87, in setUpClass
raise Exception('No browser found, cannot continue test.')
Exception: No browser found, cannot continue test.
@zeptonaut there were issues w/ the weekend's reference build update that has since been rolled back. I'm not sure if those issues were the cause of your tryjob failure, but you could kick another tryjob to see?
Retrying now
It failed again, although it looks like this failure was on Android, not Linux (like the last one). Here's a link to the STDIO.
Good news: the reference build revert appears to have resolved the No browser found
issue. There are now a lot of interesting errors in that log:
Traceback (most recent call last):
File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/testing/browser_test_case.py", line 41, in WrappedMethod
method(self)
File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/internal/actions/seek_unittest.py", line 53, in testSeekWithAllSelector
action.RunAction(self._tab)
File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/internal/actions/seek.py", line 48, in RunAction
self._timeout_in_seconds)
File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/internal/actions/media_action.py", line 34, in WaitForEvent
timeout=timeout_in_seconds)
File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/core/util.py", line 94, in WaitFor
(timeout, GetConditionString()))
TimeoutException: Timed out while waiting 5s for util.WaitFor(lambda:
self.HasEventCompletedOrError(tab, selector, event_name),
timeout=timeout_in_seconds).
Traceback (most recent call last):
File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/testing/browser_test_case.py", line 41, in WrappedMethod
method(self)
File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/internal/actions/tap_unittest.py", line 22, in testTapSinglePage
self.assertEqual(1, self._tab.EvaluateJavaScript('valueToTest'))
AssertionError: 1 != 0
Traceback (most recent call last):
File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/testing/browser_test_case.py", line 41, in WrappedMethod
method(self)
File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/internal/backends/chrome/tab_list_backend_unittest.py", line 49, in testTabIdStableAfterTabCrash
self.assertRaises(KeyError, lambda: self.tabs.GetTabById(tabs[0].id))
AssertionError: KeyError not raised
Traceback (most recent call last):
File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/internal/platform/android_device_unittest.py", line 70, in testAdbNoDevicesReturnsNone
self.assertIsNone(android_device.GetDevice(finder_options))
File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/internal/platform/android_device.py", line 107, in GetDevice
if android_platform_options.android_blacklist_file:
AttributeError: 'NoneType' object has no attribute 'android_blacklist_file'
Traceback (most recent call last):
File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/internal/platform/profiler/android_profiling_helper_unittest.py", line 170, in setUp
self._device = browser_backend.device()
TypeError: 'DeviceUtils' object is not callable
atrace_tracing_agent.py
not finding unqualified adb
:Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "/b/build/slave/catapult/build/catapult/systrace/profile_chrome/atrace_tracing_agent.py", line 98, in _CollectData
self._RunATraceCommand('async_start')
File "/b/build/slave/catapult/build/catapult/systrace/profile_chrome/atrace_tracing_agent.py", line 88, in _RunATraceCommand
return self._RunAdbShellCommand(cmd)
File "/b/build/slave/catapult/build/catapult/systrace/profile_chrome/atrace_tracing_agent.py", line 84, in _RunAdbShellCommand
return cmd_helper.GetCmdOutput(cmd)
File "/b/build/slave/catapult/build/catapult/devil/devil/utils/cmd_helper.py", line 137, in GetCmdOutput
(_, output) = GetCmdStatusAndOutput(args, cwd, shell)
File "/b/build/slave/catapult/build/catapult/devil/devil/utils/cmd_helper.py", line 172, in GetCmdStatusAndOutput
args, cwd=cwd, shell=shell)
File "/b/build/slave/catapult/build/catapult/devil/devil/utils/cmd_helper.py", line 197, in GetCmdStatusOutputAndError
shell=shell, cwd=cwd)
File "/b/build/slave/catapult/build/catapult/devil/devil/utils/cmd_helper.py", line 97, in Popen
preexec_fn=lambda: signal.signal(signal.SIGPIPE, signal.SIG_DFL))
File "/usr/lib/python2.7/subprocess.py", line 710, in __init__
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1327, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
Traceback (most recent call last):
<module> at /b/build/slave/catapult/build/catapult/systrace/bin/adb_profile_chrome:14
sys.exit(main.main())
main at /b/build/slave/catapult/build/catapult/systrace/profile_chrome/main.py:162
write_json=options.json)
CaptureProfile at /b/build/slave/catapult/build/catapult/systrace/profile_chrome/profiler.py:130
return _GetResults(agents, output, compress, write_json, interval)
_GetResults at /b/build/slave/catapult/build/catapult/systrace/profile_chrome/profiler.py:75
f.write(trace_results[0].raw_data)
TypeError: must be string or buffer, not None
Traceback (most recent call last):
File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/testing/browser_test_case.py", line 41, in WrappedMethod
method(self)
File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/internal/platform/profiler/android_systrace_profiler_unittest.py", line 26, in testSystraceProfiler
result = profiler.CollectProfile()[0]
File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/internal/platform/profiler/android_systrace_profiler.py", line 64, in CollectProfile
self._browser_backend.StopTracing(trace_result_builder)
TypeError: StopTracing() takes exactly 1 argument (2 given)
Charlie: are you looking into any of those errors, or should I?
Unfortunately I just don't have the bandwidth to look into the problems ATM :-/
On Tue, Sep 6, 2016 at 12:59 PM, Kari Tearse notifications@github.com wrote:
Charlie: are you looking into any of those errors, or should I?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/catapult-project/catapult/issues/2645#issuecomment-245017245, or mute the thread https://github.com/notifications/unsubscribe-auth/ABV5pd1fbwR0EWZJ1Sw_a8zMVwsDzFmMks5qnZv5gaJpZM4JhKul .
Charlie Andrews | Software Engineer | charliea@google.com
@jbudorick @Apeliotes do either of you know where we landed with this?
@zeptonaut I haven't done anything with this since my previous comment.
@zeptonaut: do you think you would be able to:
That way we can get coverage on the tests that are passing, and enable the failures as we have time (and possibly in parallel).
Yeah, we are close to finishing this bug. I think a bunch of failed tests are because we have never has integration test coverage against actual Android device on CQ. So disabling the failed ones & get this land to stop the bleeding SGTM.
I'll do my best to prioritize this.
The weirdest thing about this is that it seems like the retry attempts just stop in the middle. Both in the old tryserver attempts (from a couple months ago) and last night's tryserver attempts, about 50 or so Telemetry tests fail the first time through. Then, on either the first or second retry of those tests, output just seems to abruptly end in the middle of the tests. I think my inclination is to err on the side of disabling things and getting this up and running, and making sure that we swing back around later to look at what's wrong with the tests that need disabling.
All the error messages seem to be " WebSocketException: Handshake Status 500". I think this worths further debugging.
Just to give an update: this is currently blocked on fixing a bug where systrace unit tests fail to shut down Chrome after the test is complete. The result of this is that Chrome instances from one test interfere with instances in future tests.
We identified this after seeing that all Telemetry unit tests pass when running with a local Android device. @jbudorick then tested and saw that Telemetry unit tests also pass when you rearrange the test steps so that Telemetry unit tests run first. He then added more and more test steps before the Telemetry unit tests until he found the first one that caused problems: the systrace unit tests.
He now has a CL out to @ChrisCraik that force closes the browser in the tear down steps for systrace unit tests. However, @ChrisCraik is OOO until Thursday, at which point we can hopefully get that CL submitted and get my CL submitted that enables Telemetry unit tests on the Android Catapult tryserver.
Huge thanks to @jbudorick for figuring out that systrace wasn't killing Chrome after its tests, leading to later failed Telemetry unit tests!
My CL yesterday enabled Telemetry tests to run on Android. As a sanity check, I ran a CL that fails a unit test on Android through the CQ this morning. The results:
Success! Marking this as fixed.
When I tried to run a new test that ran only on Android:
on the Catapult Android tryserver, I got the STDOUT message:
which seemed to suggest that no Android tests had ever run on the tryserver. @nedn confirmed this fear after looking at the logs.
He pointed out that the problem lie with the command that we're using to run the Catapult tests on Android:
In order for these tests to run on Android, we need:
This will require a recipe change, and will likely end up with fixing a bunch of tests that were broken because they haven't been run successfully on Android.