Catapult Android tryserver doesn't run Android telemetry tests

zeptonaut commented 8 years ago

When I tried to run a new test that ran only on Android:

@decorators.Enabled('android')
def testClockSyncAtrace(self):
  ps = self.CreateEmptyPageSet()
  ps.AddStory(TestTimelinebasedMeasurementPage(ps, ps.base_dir))

  options = tbm_module.Options()
  options.config.enable_atrace_trace = True
  options.config.enable_chrome_trace = True
  options.config.chrome_trace_config.SetDefaultOverheadFilter()
  options.SetTimelineBasedMetrics(['clockSyncLatencyMetric'])

  tbm = tbm_module.TimelineBasedMeasurement(options)
  results = self.RunMeasurement(tbm, ps, self._options)

  self.assertEquals(0, len(results.failures))

on the Catapult Android tryserver, I got the STDOUT message:

[79/1095] telemetry.web_perf.timeline_based_page_test_unittest.TimelineBasedPageTestTest.testClockSyncAtrace was skipped 0.0000s:
  Skipping testClockSyncAtrace (<bound method TimelineBasedPageTestTest.testClockSyncAtrace of <telemetry.web_perf.timeline_based_page_test_unittest.TimelineBasedPageTestTest testMethod=testClockSyncAtrace>>) because it is only enabled for android. You are running ['reference', 'linux', 'trusty', 'has tabs', 'linux-reference', 'trusty-reference', 'has tabs-reference'].

which seemed to suggest that no Android tests had ever run on the tryserver. @nedn confirmed this fear after looking at the logs.

He pointed out that the problem lie with the command that we're using to run the Catapult tests on Android:

python /b/build/slave/catapult/build/catapult/telemetry/bin/run_tests '--browser=reference' --start-xvfb

In order for these tests to run on Android, we need:

python /b/build/slave/catapult/build/catapult/telemetry/bin/run_tests '--browser=reference'  --device=android --start-xvfb

This will require a recipe change, and will likely end up with fixing a bunch of tests that were broken because they haven't been run successfully on Android.

zeptonaut commented 8 years ago

/cc @randalnephew

nedn commented 8 years ago

/cc @jbudorick @perezju @petrcermak

zeptonaut commented 8 years ago

Fix up here: https://codereview.chromium.org/2236493003#

Expecting a decent number of broken tests.

anniesullie commented 8 years ago

Updated bug title to clarify that this is specific to telemetry.

zeptonaut commented 8 years ago

Ah, sorry. Good call.

nedn commented 8 years ago

https://codereview.chromium.org/2236493003# is failing with a bunch of "[305/1094] telemetry.internal.browser.tab_unittest.TabTest.testTabBrowserIsRightBrowser failed unexpectedly 0.3019s: Traceback (most recent call last): File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/testing/browser_test_case.py", line 86, in setUpClass raise Exception('No browser found, cannot continue test.')"

I suspect that the code of of pushing the reference browser' apk to the remote device doesn't work well in parallel as the catapult android host is connected to 7 phones.

@Apeliotes can you help with investigating this bug?

jbudorick commented 8 years ago

Parallel installation should work, at least from devil's perspective.

Apeliotes commented 8 years ago

It should work from the dependency manager's perspective as well. I think this went through the CQ when the reference builds were updated (the Windows bots are failing in the same way.) So it might be unrelated failures. I'm going to sent the cl through a dry run and see if the tests are still broken.

nedn commented 8 years ago

Thanks Kari for looking into this!

Apeliotes commented 8 years ago

The problem appears to be that we are not prefetching the devil binaries. This isn't an issue on desktop, but means we'll always fail to prefetch on android.

zeptonaut commented 8 years ago

Interesting! Anything we can do to remedy the problem?

zeptonaut commented 8 years ago

Looks like https://codereview.chromium.org/2236493003/ still failed to submit, although it may have been due to ADB flakiness. Retrying now.

jbudorick commented 8 years ago

Thanks for the quick fix in my absence, Kari. I'll look into a less temporary solution.

zeptonaut commented 8 years ago

Looks like it's still failing :-/

jbudorick commented 8 years ago

Today's interesting failure of the day: the device_forwarder binary is an asan build for some reason.

  AdbShellCommandFailedError: (device: 06b3b31f003bf3cb) shell command run via adb failed on the device:
    command: LD_LIBRARY_PATH=/data/local/tmp/forwarder/  /data/local/tmp/forwarder/device_forwarder --kill-server
    exit status: 1
    output:
    - CANNOT LINK EXECUTABLE: could not load library "libclang_rt.asan-arm-android.so" needed by "/data/local/tmp/forwarder/device_forwarder"; caused by library "libclang_rt.asan-arm-android.so" not found

jbudorick commented 8 years ago

I think that the devices on that bot are in a bad state leftover from previous runs, which means it's time to bring parts of the device provisioning logic up from chromium into devil.

jbudorick commented 8 years ago

... apparently telemetry has its own copies of the host and device binaries for both md5sum and the forwarder (e.g.). This seems wrong, but somehow telemetry is calling them anyway.

edit: never mind, those are a red herring.

nedn commented 8 years ago

I think that is due to legacy reason. Telemetry should delegate the usage of those binaries to devil's API calls.

jbudorick commented 8 years ago

After some more investigation, I'm fairly certain this is a devil bug involving how we update the forwarder binary (and maybe md5sum binary) on the device. I've filed this as a separate issue.

(This isn't a problem on chromium bots because they run the wiping provision.)

zeptonaut commented 8 years ago

John: would you mind pinging this bug when you've fixed that bug?

jbudorick commented 8 years ago

ping :) the fix landed last night.

zeptonaut commented 8 years ago

Great - thanks! Running through the CQ again.

zeptonaut commented 8 years ago

Looks like it's still having problems: https://codereview.chromium.org/2236493003/

jbudorick commented 8 years ago

yep, new problem though: it's failing to install chrome. Not yet sure why the package manager is being killed.

    File "/b/build/slave/catapult/build/catapult/devil/devil/android/device_utils.py", line 643, in Install
      reinstall=reinstall, permissions=permissions)
    File "/b/build/slave/catapult/build/catapult/devil/devil/android/device_utils.py", line 693, in _InstallInternal
      device_apk_paths = self._GetApplicationPathsInternal(package_name)
    File "/b/build/slave/catapult/build/catapult/devil/devil/android/device_utils.py", line 488, in _GetApplicationPathsInternal
      ['pm', 'path', package], check_return=should_check_return)
    File "/b/build/slave/catapult/build/catapult/devil/devil/android/decorators.py", line 51, in timeout_retry_wrapper
      return impl()
    File "/b/build/slave/catapult/build/catapult/devil/devil/android/decorators.py", line 47, in impl
      return f(*args, **kwargs)
    File "/b/build/slave/catapult/build/catapult/devil/devil/android/device_utils.py", line 898, in RunShellCommand
      output = handle_large_output(cmd, large_output)
    File "/b/build/slave/catapult/build/catapult/devil/devil/android/device_utils.py", line 876, in handle_large_output
      return handle_large_command(cmd)
    File "/b/build/slave/catapult/build/catapult/devil/devil/android/device_utils.py", line 858, in handle_large_command
      return handle_check_return(cmd)
    File "/b/build/slave/catapult/build/catapult/devil/devil/android/device_utils.py", line 849, in handle_check_return
      return run(cmd)
    File "/b/build/slave/catapult/build/catapult/devil/devil/android/device_utils.py", line 845, in run
      return self.adb.Shell(cmd)
    File "/b/build/slave/catapult/build/catapult/devil/devil/android/sdk/adb_wrapper.py", line 492, in Shell
      command, output, status=status, device_serial=self._device_serial)
  AdbShellCommandFailedError: (device: 06ad8e47003b6c40) shell command run via adb failed on the device:
    command: pm path com.google.android.apps.chrome
    exit status: 137
    output:
    - Killed

zeptonaut commented 8 years ago

/cry

@jbudorick, is there any chance you could be a lifesaver and help with this one too?

jbudorick commented 8 years ago

yeah, see 2695. working on it now.

jbudorick commented 8 years ago

(I'm not sure that that'll fix the issue, but if it doesn't, it'll narrow down the possible causes.)

zeptonaut commented 8 years ago

Thank you!

jbudorick commented 8 years ago

@zeptonaut the android trybot runs device provisioning now, so this might work. Give it another shot when you get a chance.

jbudorick commented 8 years ago

... although I had to disable the existing telemetry suite on the Android trybot, which started misbehaving last night. (I don't think that was related to the addition of provisioning, though.)

zeptonaut commented 8 years ago

Hi John,

Just to clarify: is this good to try again now?

On Wed, Aug 24, 2016 at 8:58 AM, John Budorick notifications@github.com wrote:

... although I had to disable the existing telemetry suite on the Android trybot, which started misbehaving last night. (I don't think that was related to the addition of provisioning, though.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/catapult-project/catapult/issues/2645#issuecomment-242053813, or mute the thread https://github.com/notifications/unsubscribe-auth/ABV5pealrrUpbpiuxdq4mHuw8Avqf63Uks5qjEAIgaJpZM4JhKul .

Charlie Andrews | Software Engineer | charliea@google.com

jbudorick commented 8 years ago

@zeptonaut I think so.

zeptonaut commented 8 years ago

Great. Trying now.

On Mon, Aug 29, 2016 at 10:25 AM, John Budorick notifications@github.com wrote:

@zeptonaut https://github.com/zeptonaut I think so.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/catapult-project/catapult/issues/2645#issuecomment-243139376, or mute the thread https://github.com/notifications/unsubscribe-auth/ABV5pRBDmpY8q3ahstQKkKtCB9PgTzEbks5qkuvJgaJpZM4JhKul .

Charlie Andrews | Software Engineer | charliea@google.com

zeptonaut commented 8 years ago

Doh. Still seems to not be working: STDIO.

jbudorick commented 8 years ago

That's the linux bot; this is the android one.

More new problems:

  Traceback (most recent call last):
    File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/testing/browser_test_case.py", line 87, in setUpClass
      raise Exception('No browser found, cannot continue test.')
  Exception: No browser found, cannot continue test.

jbudorick commented 8 years ago

@zeptonaut there were issues w/ the weekend's reference build update that has since been rolled back. I'm not sure if those issues were the cause of your tryjob failure, but you could kick another tryjob to see?

zeptonaut commented 8 years ago

Retrying now

zeptonaut commented 8 years ago

It failed again, although it looks like this failure was on Android, not Linux (like the last one). Here's a link to the STDIO.

jbudorick commented 8 years ago

Good news: the reference build revert appears to have resolved the No browser found issue. There are now a lot of interesting errors in that log:

browser crashes, though nothing gets symbolized (unsurprisingly)
multiple instances of a timeout waiting for ... something in telemetry:

  Traceback (most recent call last):
    File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/testing/browser_test_case.py", line 41, in WrappedMethod
      method(self)
    File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/internal/actions/seek_unittest.py", line 53, in testSeekWithAllSelector
      action.RunAction(self._tab)
    File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/internal/actions/seek.py", line 48, in RunAction
      self._timeout_in_seconds)
    File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/internal/actions/media_action.py", line 34, in WaitForEvent
      timeout=timeout_in_seconds)
    File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/core/util.py", line 94, in WaitFor
      (timeout, GetConditionString()))
  TimeoutException: Timed out while waiting 5s for util.WaitFor(lambda:
                       self.HasEventCompletedOrError(tab, selector, event_name),
                   timeout=timeout_in_seconds).

a JS eval failure?

  Traceback (most recent call last):
    File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/testing/browser_test_case.py", line 41, in WrappedMethod
      method(self)
    File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/internal/actions/tap_unittest.py", line 22, in testTapSinglePage
      self.assertEqual(1, self._tab.EvaluateJavaScript('valueToTest'))
  AssertionError: 1 != 0

a tab existing when it shouldn't, AFAICT:

  Traceback (most recent call last):
    File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/testing/browser_test_case.py", line 41, in WrappedMethod
      method(self)
    File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/internal/backends/chrome/tab_list_backend_unittest.py", line 49, in testTabIdStableAfterTabCrash
      self.assertRaises(KeyError, lambda: self.tabs.GetTabById(tabs[0].id))
  AssertionError: KeyError not raised

multiple instances of telemetry trying to use the blacklist file though none was provided:

  Traceback (most recent call last):
    File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/internal/platform/android_device_unittest.py", line 70, in testAdbNoDevicesReturnsNone
      self.assertIsNone(android_device.GetDevice(finder_options))
    File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/internal/platform/android_device.py", line 107, in GetDevice
      if android_platform_options.android_blacklist_file:
  AttributeError: 'NoneType' object has no attribute 'android_blacklist_file'

telemetry trying to call DeviceUtils (the class)?

  Traceback (most recent call last):
    File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/internal/platform/profiler/android_profiling_helper_unittest.py", line 170, in setUp
      self._device = browser_backend.device()
  TypeError: 'DeviceUtils' object is not callable

atrace_tracing_agent.py not finding unqualified adb:

Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/b/build/slave/catapult/build/catapult/systrace/profile_chrome/atrace_tracing_agent.py", line 98, in _CollectData
    self._RunATraceCommand('async_start')
  File "/b/build/slave/catapult/build/catapult/systrace/profile_chrome/atrace_tracing_agent.py", line 88, in _RunATraceCommand
    return self._RunAdbShellCommand(cmd)
  File "/b/build/slave/catapult/build/catapult/systrace/profile_chrome/atrace_tracing_agent.py", line 84, in _RunAdbShellCommand
    return cmd_helper.GetCmdOutput(cmd)
  File "/b/build/slave/catapult/build/catapult/devil/devil/utils/cmd_helper.py", line 137, in GetCmdOutput
    (_, output) = GetCmdStatusAndOutput(args, cwd, shell)
  File "/b/build/slave/catapult/build/catapult/devil/devil/utils/cmd_helper.py", line 172, in GetCmdStatusAndOutput
    args, cwd=cwd, shell=shell)
  File "/b/build/slave/catapult/build/catapult/devil/devil/utils/cmd_helper.py", line 197, in GetCmdStatusOutputAndError
    shell=shell, cwd=cwd)
  File "/b/build/slave/catapult/build/catapult/devil/devil/utils/cmd_helper.py", line 97, in Popen
    preexec_fn=lambda: signal.signal(signal.SIGPIPE, signal.SIG_DFL))
  File "/usr/lib/python2.7/subprocess.py", line 710, in __init__
    errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1327, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

systrace trying to write None to a file:

Traceback (most recent call last):
  <module> at /b/build/slave/catapult/build/catapult/systrace/bin/adb_profile_chrome:14
    sys.exit(main.main())
  main at /b/build/slave/catapult/build/catapult/systrace/profile_chrome/main.py:162
    write_json=options.json)
  CaptureProfile at /b/build/slave/catapult/build/catapult/systrace/profile_chrome/profiler.py:130
    return _GetResults(agents, output, compress, write_json, interval)
  _GetResults at /b/build/slave/catapult/build/catapult/systrace/profile_chrome/profiler.py:75
    f.write(trace_results[0].raw_data)
TypeError: must be string or buffer, not None

telemetry not calling StopTracing correctly:

  Traceback (most recent call last):
    File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/testing/browser_test_case.py", line 41, in WrappedMethod
      method(self)
    File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/internal/platform/profiler/android_systrace_profiler_unittest.py", line 26, in testSystraceProfiler
      result = profiler.CollectProfile()[0]
    File "/b/build/slave/catapult/build/catapult/telemetry/telemetry/internal/platform/profiler/android_systrace_profiler.py", line 64, in CollectProfile
      self._browser_backend.StopTracing(trace_result_builder)
  TypeError: StopTracing() takes exactly 1 argument (2 given)

Apeliotes commented 8 years ago

Charlie: are you looking into any of those errors, or should I?

zeptonaut commented 8 years ago

Unfortunately I just don't have the bandwidth to look into the problems ATM :-/

On Tue, Sep 6, 2016 at 12:59 PM, Kari Tearse notifications@github.com wrote:

Charlie: are you looking into any of those errors, or should I?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/catapult-project/catapult/issues/2645#issuecomment-245017245, or mute the thread https://github.com/notifications/unsubscribe-auth/ABV5pd1fbwR0EWZJ1Sw_a8zMVwsDzFmMks5qnZv5gaJpZM4JhKul .

Charlie Andrews | Software Engineer | charliea@google.com

zeptonaut commented 8 years ago

@jbudorick @Apeliotes do either of you know where we landed with this?

jbudorick commented 8 years ago

@zeptonaut I haven't done anything with this since my previous comment.

anniesullie commented 7 years ago

@zeptonaut: do you think you would be able to:

Update your CL to disable all the tests that still fail.
File a bug to re-enable failing tests, with a list
Land the CL?

That way we can get coverage on the tests that are passing, and enable the failures as we have time (and possibly in parallel).

nedn commented 7 years ago

Yeah, we are close to finishing this bug. I think a bunch of failed tests are because we have never has integration test coverage against actual Android device on CQ. So disabling the failed ones & get this land to stop the bleeding SGTM.

zeptonaut commented 7 years ago

I'll do my best to prioritize this.

zeptonaut commented 7 years ago

The weirdest thing about this is that it seems like the retry attempts just stop in the middle. Both in the old tryserver attempts (from a couple months ago) and last night's tryserver attempts, about 50 or so Telemetry tests fail the first time through. Then, on either the first or second retry of those tests, output just seems to abruptly end in the middle of the tests. I think my inclination is to err on the side of disabling things and getting this up and running, and making sure that we swing back around later to look at what's wrong with the tests that need disabling.

nedn commented 7 years ago

All the error messages seem to be " WebSocketException: Handshake Status 500". I think this worths further debugging.

zeptonaut commented 7 years ago

Just to give an update: this is currently blocked on fixing a bug where systrace unit tests fail to shut down Chrome after the test is complete. The result of this is that Chrome instances from one test interfere with instances in future tests.

We identified this after seeing that all Telemetry unit tests pass when running with a local Android device. @jbudorick then tested and saw that Telemetry unit tests also pass when you rearrange the test steps so that Telemetry unit tests run first. He then added more and more test steps before the Telemetry unit tests until he found the first one that caused problems: the systrace unit tests.

He now has a CL out to @ChrisCraik that force closes the browser in the tear down steps for systrace unit tests. However, @ChrisCraik is OOO until Thursday, at which point we can hopefully get that CL submitted and get my CL submitted that enables Telemetry unit tests on the Android Catapult tryserver.

zeptonaut commented 7 years ago

Huge thanks to @jbudorick for figuring out that systrace wasn't killing Chrome after its tests, leading to later failed Telemetry unit tests!

My CL yesterday enabled Telemetry tests to run on Android. As a sanity check, I ran a CL that fails a unit test on Android through the CQ this morning. The results:

telemetry_runs

Success! Marking this as fixed.

catapult-project / catapult

Catapult Android tryserver doesn't run Android telemetry tests #2645