biancadanforth / tracking-protection-shield-study

A Shield study to determine the optimal messaging, if any, for Tracking Protection in Firefox.
0 stars 3 forks source link

Get an engineering review #26

Closed biancadanforth closed 6 years ago

biancadanforth commented 6 years ago

As part of a more rigorous review process for landing shield studies (currently modeled after this), this study should receive an engineering review.

Firefox Engineering has asked that this reviewer be a Firefox Peer.

This is a meta-issue for any engineering-related concerns.

Related issues:

Engineering reviewer: rhelmer

biancadanforth commented 6 years ago

Testing the study on the Try server

With @rhelmer's help, I was able to test this shield study on the try server (Nightly only).

First attempt: Cancelled in Nightly/Central branch

The first test ran the study as-is, but we discovered that was not going to work and the build was cancelled due to:

Second attempt: Success in Nightly/Central branch

Addon Configuration The addon was run through the try server configured to:

What tests were run ./mach try -p linux64,win64,macosx64 -u all -t all --artifact This runs all unit tests and perf (TALOS) tests on all desktop platforms.

Results

The results of the test can be found in Treeherder.

Performance Tests Results of our patch compared to mozilla-central (Nightly)

Non-Performance Tests

Note: Any tests of interest that failed in the try push can be run individually on my local machine without having to spin up cycles on the try server. Ex: ./mach test browser/components/sessionstore/test/browser_394759_behavior.js for MochiTests.

Next steps

Follow up tests in Nightly

Understand Responsiveness (Performance) Tests

Test on the Release branch Since this study will be for Firefox release, we attempted to run the same set of tests in the Release branch as in Nightly above, again using artifact builds. Unfortunately, the build steps for each test would fail before even getting to the test itself, so the tests were cancelled.

NOTE: rhelmer recommends testing in Nightly and locally with artifact builds as much as possible first, since that is faster/more cost-effective.

biancadanforth commented 6 years ago

Update for Next Step: Understand Responsiveness (Performance) Tests

I spoke with jmaher, here’s what he said:

On what Responsiveness tests measure

On whether our result is an outlier

On whether the result is significant screen shot 2018-02-01 at 1 27 55 pm

@rhelmer, what do you think we should do now that we know what this test measures and the extent of the regression? My thinking is that this regression is largely due to an inefficient re-implementation of Tracking Protection. We might be able to reduce the extent of regression by memoizing blocked third parties in WebRequest.onBeforeRequest, but I'm not sure how big a dent that will make, and if that is worth doing given our other priorities.

rhelmer commented 6 years ago

@rhelmer, what do you think we should do now that we know what this test measures and the extent of the regression? My thinking is that this regression is largely due to an inefficient re-implementation of Tracking Protection. We might be able to reduce the extent of regression by memoizing blocked third parties in WebRequest.onBeforeRequest, but I'm not sure how big a dent that will make, and if that is worth doing given our other priorities.

Yes I agree... the important thing is to note the regressions and what we believe is causing them. It seems unrealistic to expect 0 regressions when this is implemented as an add-on vs. as a built-in feature.

biancadanforth commented 6 years ago

Update for Next Step: Follow Up Tests in Nightly

Performance regressions on tspaint, sessionrestore and responsiveness

Non-Performance "leaked X windows" unit test failures

This patch and this patch were intended to try to address the memory leakage errors from many of the unit tests. Unfortunately, the Treeherder results for unit tests did not improve, so I decided to ask _6a68 (jhirsch) per rhelmer's suggestion. Here's what he suggested:

5fd5a3a4-2213-4bea-b4a1-f08756c40285

_6a68's recommendations:

biancadanforth commented 6 years ago

Update for Next Step: Test in Release

Issue #88 is attempting to fix the window leak errors from Try server unit tests, which are unfortunately a blocker for shipping this study.

I did manage to run the exact same set of tests against the exact same patch in the release branch of Firefox (up to this point these tests had been my patch against mozilla-central, aka Nightly). Here's the process for doing that for posterity.

Here are the Treeherder results for my patch (as of this commit in PR #89 ) in mozilla-central. 45 failed unit tests.

Here are the Treeherder results for my patch (same as above) in release. 43 failed unit tests.

The good news:

The bad news:

biancadanforth commented 6 years ago

Update: All memory leaks fixed with PR #89. Let's hope it stays that way. Here are the Treeherder results against release for that patch, compared to before. See #92 for the kinds of fixes that led to 0 failed tests for "leaked X windows".

Next step: What are these other 22 non-memory leak unit tests that are failing? Do we need to be concerned about them?

biancadanforth commented 6 years ago

Update on Next Step: What are these other 22 non-memory leak unit tests that are failing?

Note: the alphanumeric designation for each test varies each Try run for each build of Firefox (ex: Linux64 opt versus Linux64 debug, etc.), so the same test could have a different number (e.g. bc4 versus bc10) for two different builds. Some of the same tests are failing across builds. This list reflects all the unique tests that are failing.

UNEXPECTED FAILURES

EXPECTED FAILURES

Next steps

  1. Run a full (i.e. non-artifact) build against the release channel for this patch to see what test failures can be attributed to the fact that we were testing an artifact build.
  2. Reach out to myk@ to ask about the C1 test.
  3. Bisect the study to find why the bc2 test is failing. Failing that, ask dothayer@ or fgomes@.
biancadanforth commented 6 years ago

Update: Next step - Run a full (i.e. non-artifact) build on the try server against the release channel for this patch.

Full build Treeherder versus Artifact build Treeherder for changes including PR #89 .

I learned that my local build of Firefox saves changes I make in its own user profile, so when I had previously toggled Tracking Protection to be on "always", that affected subsequent ./mach builds and possibly Try testing. I have therefore re-run the original artifact build tests (the artifact link above in this comment). More specifically: there is a temporary profile used for ./mach run that is cleared out every time you ./mach clobber. ./mach run creates a new profile if it does not exist, and it goes into ./${OBJDIR}/tmp/scratch_user/. Local testing and try server testing use a new profile each time.

Comparison: Both builds had all the same bc test failures listed in the previous comment.

L10N:

X6:

5:

enUS:

biancadanforth commented 6 years ago

Update: Next step - Reach out to myk@ to ask about the C1 test.

Test summary per myk@

This test opens headless Firefox. As soon as Firefox is ready to browse to a page and as soon as the page is loaded, it takes a picture of the page and then immediately closes Firefox.

Whereas most tests start with a single window in Firefox (with a head!) open to about:blank, this test actually starts up Firefox in headless mode itself.

myk@'s thoughts on what could be causing the test to time out

Headless screenshots as a feature have historically seen quite a lot of race conditions on startup, as Firefox is not designed to start up and quit very quickly. We actually delay some startup functionality so we can show a page to the user as quickly as possible. Parts of Firefox’s initialization may fail because it doesn’t expect Firefox to quit so soon on startup.

My hypothesis is that there are async calls in your bootstrap.js startup method that haven't returned before the test finishes and closes Firefox.

An alternative hypothesis is: Is Firefox quitting at all? Is something blocking it from closing, because it's still in the process of being initialized? Or if there's something that's holding open a file in that Profile directory at the time Firefox quits?

My guess is that there is async telemetry running or some async setup aspect to the study itself that may not have returned when headless Firefox is closing.

One thing you could try is to disable Shield studies when in headless mode to see if the test passes. If this is controlled by a pref, then you could add that pref to the profile the test uses. Mochitest creates its own profile for these unit tests; but that may be too upstream, since it will affect all other tests. You could edit the copy pulled from mochitest by editing this test directly at or before this line. Read the mochitest file using OS.file, add a pref to it and write it back out to the new location.

Conclusion

There's already an intermittent bug failure for this test. Myk ran your patch locally several times and couldn't get the failure to reproduce. Myk also noticed this failure only occurs in my patch on opt builds, not on debug builds, which makes it much harder to track down.

The other test failure in this group about message counts is a totally different test that was grouped together. I tested this one locally for your patch, and it is passing repeatedly. So this is another intermittent bug.

After talking with rhelmer about this test, we agree it could be a bug in ShieldUtils.jsm, but since the test was intermittent against the same patch, it could also be a bug in the test itself. The risk here is that if a user is using Firefox in headless mode and taking screenshots where Firefox is only open for fractions of a second at a time while enrolled in this study, the study may not properly shut down.

What we're going to do about it

biancadanforth commented 6 years ago

Update - Next Step: Bisect the study to find why the bc2 test (browser/base/content/test/plugins/browser_pluginnotification.js) is failing. Failing that, ask dothayer@ or fgomes@.

TL;DR: The test first starts failing (intermittently) at this line when I add the WebRequest.onBeforeRequest listener. Since the failure is intermittent and the feature works with the addon installed under "normal" conditions, @rhelmer does not see this as a blocker. See below for a more detailed explanation.


I bisected the code and find where my code starts to fail in PR #94 . Go there for more details.

I talked with dothayer@, and he suspects the problem is a race condition. To show the flash plugin notification, there must be a "plugin binding attached" event. Currently the logic in this test trusts that this event will fire before the content task (the line just before the failure) returns, but it's not a given. It's possible that with an onBeforeRequest listener, this event (which may require that the request for flash comes through) is not always firing before that content task returns.

He suggested I try adding the line:

await new Promise(resolve => executeSoon(resolve));

just after the content task above to try to process any events in the queue before attempting to get the notification element. Unfortunately, I was still seeing intermittent failures -- i.e. the "plugin binding attach" event has not always taken place before the content task returns.

After speaking with rhelmer, given that this is an intermittent failure likely due to a race condition specific to the test and not a condition we will see when the addon is deployed (users do not generally load several flash pages within a fraction of a second), and given that the flash plugin notification feature does work correctly with the addon installed under these more "normal" conditions, we have opted to note the failure, but not treat it as a blocker.

biancadanforth commented 6 years ago

Final testing in Treeherder - Analysis:

Action items:

  1. Add manual checks to TESTPLAN.md for QA to
    • Manually test that about:home/about:newtab works and is not super slow.
    • Manually test that the SSL indicator turns on when a page loads from about:newtab.
    • @rhelmer , can you confirm this indicator is the green lock icon and that is should appear as you load a page from about:newtab?
    • Manually test that the zoom button works.
  2. Check Perfherder results for performance regressions.
    • Wondering if turning off browser.newtab.preload will cause any significant regressions.
  3. Look at Treeherder results for Release and see which of these failures could be flagged as intermittent/flaky/also failing.
  4. (Optional) Re-run some unit tests locally:
    • Run the following tests locally with and without the browser.newtab.preload turned back on in the study to see if it passes: (bc3, Linux x64 opt) dom/base/test/browser_aboutnewtab_process_selection.js, (bc6, Linux x64 Stylo Disabled Debug) browser/base/content/test/urlbar/browser_urlbarAboutHomeLoading.js, (bc3, OSX 10.10 opt) browser/base/content/test/newtab/browser_newtab_background_captures.js

Addendum:

I was having trouble with artifact builds both locally and remotely; the Treeherder results below confirm there is a bug in artifact builds currently. I have discussed this with nalexander in IRC, so he is aware. While I could not do an artifact build locally and had intermittent build failures for artifact builds on the try server, I could successfully create a non-artifact build locally and all non-artifact builds with my patch were successful on the try server.

Release channel, with study (v 1.0.3), artifact build Since today is merge day, per nalexander there is a bug in the build system for artifact builds (building locally or using a try build). I had intermittent build failures for ./mach try -p linux64,win64,macosx64 -u all -t all --artifact with an artifact try build. Here are the Treeherder results for an artifact build with my study as a patch.

_Release channel, no study, artifact build_ Also, per rhelmer's recommendation, I ran an artifact build try run without my patch on release with ./mach try -p linux64,win64,macosx64 -u all -t all --artifact. Here are the Treeherder results for an artifact build without my study.

biancadanforth commented 6 years ago

I discovered why I couldn't view any Perfherder results for my patch in Treeherder, thanks to help from osmose:

Background: I ran my study as a patch against Firefox Release (59) for a full, non-artifact build with the following ./mach try -p linux64,win64,macosx64 -u all -t all. This ran all performance tests ONCE. I am seeing Perfherder results of my patch compared to mozilla-release as having no baseline and Perfherder results of my patch compared to mozilla-central as having no overlap in which tests were performed.

What's going on?

Why I'm not seeing ANY baseline data for mozilla-release: The commit that was at the tip of the Release branch when I applied my patch has a Treeherder that had NO talos tests performed. mozilla-release is whatever branch the current release version of Firefox is based off of, so if I try to compare my TALOS tests to mozilla-release, I should expect to see nothing. Per osmose, a commit for Release that ran TALOS tests doesn't exist by default.

Why I'm not seeing any overlapping test results compared to mozilla-central: Firstly, comparing my patch on Release to mozilla-central isn't a good idea:

the release branch probably has a bunch of changes to which tests are run. Looks like all the tests central has that you don't have "stylo" in them. Also looking at the baseline tests, they have error ranges (e.g. +/- 0.5%), while yours do not. That's because the baseline tests were run several times, where mine were only run once each. This has to be made explicit in the try syntax to run tests multiple times.

What to do about it? To have comparable TALOS test results, I need a Try server run against Release that runs all the same tests as my patch does, and multiple times to get average values and a sense of repeatability/reliability of the test results.

  1. Push the parent commit of Release to the Try server running all TALOS tests at least 5 times.
  2. Push my patch added to that commit of Release to the Try server running all TALOS tests at least 5 times.

    If you go to https://mozilla-releng.net/trychooser/ and select "Both" for the build type, "linux64" for the platform, you get the following try syntax: ./mach try -b do -p linux64 -u none -t all Then, if you check the box on the right that says "To compare Talos numbers we need 5 retriggers" it adds --rebuild-talos: ./mach try -b do -p linux64 -u none -t all --rebuild-talos 5 Then those numbers will be comparable.

biancadanforth commented 6 years ago

Action Item: Look at Treeherder results for Release and see which of these failures could be flagged as intermittent/flaky/also failing.

Compared to the Treeherder results for the tip of Release on top of which I submitted my study as a patch, my study's Treeherder results have no failed tests in common. This means that these failures I am seeing are either intermittent, or the result of my patch.

I basically CMD + F searched for each of the 11 failed tests from Release in my summary document for the failed tests in my patch.

biancadanforth commented 6 years ago

Action Item: Check Perfherder results for performance regressions.

Perfherder results compared to the tip of Release

While my final patch is significantly slower than my mid-development patch; both Perfherder results had average performance regressions compared to the tip of either Nightly or Release without the patch between 10-15%.

See this spreadsheet (Sheet 2) for a comparison of performance tests from mid-development to final development. One reason why the final development results are considerably slower in an absolute sense than mid-development (outside of added complexity in general) is that I turned off browser.newtab.preload to be able to accurately measure the time the new tab page is opened.

One thing I could check to see is if turning browser.newtab.preload back on and re-running the same tests dramatically minimizes the regressions.

biancadanforth commented 6 years ago

Action Item: Update TESTPLAN.md

PR #188 adds 3 new tasks for QA to check for to ensure none of the unit test failures we were seeing have actually caused related functionality in Firefox to stop working.

biancadanforth commented 6 years ago

Okay per @rhelmer , the performance regressions are acceptable, and as long as QA verifies that the zoom button, SSL indicator and new tab page are functioning as normal with my add-on installed, we have Engineering sign-off.

biancadanforth commented 6 years ago

Performance Regression Testing Summary for Posterity

Aside: We had about 600,000 people in the Tracking Protection Messaging study. 75k people per branch for each of 4 branches for two distributions (one to new users; one to all users).

The only significant, repeatable performance regressions I had for the Tracking Protection Messaging study were largely in the 10-15% range compared to release without my patch using v4 StudyUtils. If you are seeing higher regressions than that, then I don’t think StudyUtils would be the primary cause.

I attribute my remaining performance regressions to implementation details of the study (turning off new tab preload, listening on every network request, having a less optimized implementation of Tracking Protection…).

The only performance improvement I made was to delay execution of my study-specific code during Firefox startup.