Closed biancadanforth closed 6 years ago
With @rhelmer's help, I was able to test this shield study on the try server (Nightly only).
The first test ran the study as-is, but we discovered that was not going to work and the build was cancelled due to:
Addon Configuration The addon was run through the try server configured to:
What tests were run
./mach try -p linux64,win64,macosx64 -u all -t all --artifact
This runs all unit tests and perf (TALOS) tests on all desktop platforms.
The results of the test can be found in Treeherder.
Performance Tests Results of our patch compared to mozilla-central (Nightly)
sessionrestore
tests
ts_paint
tests
Non-Performance Tests
addEventListener
and removeEventListener
code point to the same reference. It looks like the bind
method creates a new function reference each time it is called.Note: Any tests of interest that failed in the try push can be run individually on my local machine without having to spin up cycles on the try server. Ex: ./mach test browser/components/sessionstore/test/browser_394759_behavior.js
for MochiTests.
Follow up tests in Nightly
sessionstore-windows-restored
instead of browser-delayed-startup-finished
observer for adding the Feature in bootstrap.js
to see if TALOS results are improved (Issue #73).Understand Responsiveness
(Performance) Tests
Test on the Release branch Since this study will be for Firefox release, we attempted to run the same set of tests in the Release branch as in Nightly above, again using artifact builds. Unfortunately, the build steps for each test would fail before even getting to the test itself, so the tests were cancelled.
NOTE: rhelmer recommends testing in Nightly and locally with artifact builds as much as possible first, since that is faster/more cost-effective.
Update for Next Step: Understand Responsiveness (Performance) Tests
I spoke with jmaher, here’s what he said:
On what Responsiveness tests measure
On whether our result is an outlier
On whether the result is significant
@rhelmer, what do you think we should do now that we know what this test measures and the extent of the regression?
My thinking is that this regression is largely due to an inefficient re-implementation of Tracking Protection. We might be able to reduce the extent of regression by memoizing blocked third parties in WebRequest.onBeforeRequest
, but I'm not sure how big a dent that will make, and if that is worth doing given our other priorities.
@rhelmer, what do you think we should do now that we know what this test measures and the extent of the regression? My thinking is that this regression is largely due to an inefficient re-implementation of Tracking Protection. We might be able to reduce the extent of regression by memoizing blocked third parties in WebRequest.onBeforeRequest, but I'm not sure how big a dent that will make, and if that is worth doing given our other priorities.
Yes I agree... the important thing is to note the regressions and what we believe is causing them. It seems unrealistic to expect 0 regressions when this is implemented as an add-on vs. as a built-in feature.
Update for Next Step: Follow Up Tests in Nightly
Performance regressions on tspaint
, sessionrestore
and responsiveness
Non-Performance "leaked X windows" unit test failures
This patch and this patch were intended to try to address the memory leakage errors from many of the unit tests. Unfortunately, the Treeherder results for unit tests did not improve, so I decided to ask _6a68 (jhirsch) per rhelmer's suggestion. Here's what he suggested:
[...log]
link next to the error message. This log will show you the stack trace of the error, so you can see where in the test it failed. If you don’t see anything right around that error in the test that is related to the error, it might be that you violated a previous test’s post-condition that may not have been checked by this test. This is the case for the “leaked X windows” errors.
_6a68's recommendations:
universalSearch
is a global that I used to hold all the libraries I loaded into the window. I Cu.import everything onto that win.universalSearch global https://github.com/mozilla/universal-search/blob/fb3ae049d290f56ea7f3ab7a857273376d9232d8/lib/universal-search.js#L175. Just make sure never to unload Firefox JSMs like Console.jsm; that’s a global unload for all of Firefox!Update for Next Step: Test in Release
Issue #88 is attempting to fix the window leak errors from Try server unit tests, which are unfortunately a blocker for shipping this study.
I did manage to run the exact same set of tests against the exact same patch in the release branch of Firefox (up to this point these tests had been my patch against mozilla-central, aka Nightly). Here's the process for doing that for posterity.
Here are the Treeherder results for my patch (as of this commit in PR #89 ) in mozilla-central. 45 failed unit tests.
Here are the Treeherder results for my patch (same as above) in release. 43 failed unit tests.
The good news:
master
branch, Treeherder).The bad news:
Update: All memory leaks fixed with PR #89. Let's hope it stays that way. Here are the Treeherder results against release for that patch, compared to before. See #92 for the kinds of fixes that led to 0 failed tests for "leaked X windows".
Next step: What are these other 22 non-memory leak unit tests that are failing? Do we need to be concerned about them?
Update on Next Step: What are these other 22 non-memory leak unit tests that are failing?
Note: the alphanumeric designation for each test varies each Try run for each build of Firefox (ex: Linux64 opt versus Linux64 debug, etc.), so the same test could have a different number (e.g. bc4 versus bc10) for two different builds. Some of the same tests are failing across builds. This list reflects all the unique tests that are failing.
UNEXPECTED FAILURES
test_headless_screenshot.html
browser_pluginnotification.js
EXPECTED FAILURES
browser_all_files_referenced.js
browser_PageActions.js
browser/extensions/tracking-protection-messaging-study/test/browser/browser_tracking_protection_study_ui_check.js
http://archive.mozilla.org/pub/firefox/nightly/latest-mozilla-central/firefox-58.0.3.en-US.linux-x86_64.tar.bz2
, which is located in mozilla-central, not in release./builds/worker/workspace/build/src/layout/base/nsAutoLayoutPhase.cpp
test_ext_startup_cache.js
test_ext_webrequest_basic.html
Next steps
Update: Next step - Run a full (i.e. non-artifact) build on the try server against the release channel for this patch.
Full build Treeherder versus Artifact build Treeherder for changes including PR #89 .
I learned that my local build of Firefox saves changes I make in its own user profile, so when I had previously toggled Tracking Protection to be on "always", that affected subsequent ./mach build
s and possibly Try testing. I have therefore re-run the original artifact build tests (the artifact link above in this comment). More specifically: there is a temporary profile used for ./mach run
that is cleared out every time you ./mach clobber
. ./mach run
creates a new profile if it does not exist, and it goes into ./${OBJDIR}/tmp/scratch_user/
. Local testing and try server testing use a new profile each time.
Comparison:
Both builds had all the same bc
test failures listed in the previous comment.
L10N:
X6:
5:
enUS:
Update: Next step - Reach out to myk@ to ask about the C1 test.
Test summary per myk@
This test opens headless Firefox. As soon as Firefox is ready to browse to a page and as soon as the page is loaded, it takes a picture of the page and then immediately closes Firefox.
Whereas most tests start with a single window in Firefox (with a head!) open to about:blank
, this test actually starts up Firefox in headless mode itself.
myk@'s thoughts on what could be causing the test to time out
Headless screenshots as a feature have historically seen quite a lot of race conditions on startup, as Firefox is not designed to start up and quit very quickly. We actually delay some startup functionality so we can show a page to the user as quickly as possible. Parts of Firefox’s initialization may fail because it doesn’t expect Firefox to quit so soon on startup.
My hypothesis is that there are async calls in your bootstrap.js
startup
method that haven't returned before the test finishes and closes Firefox.
An alternative hypothesis is: Is Firefox quitting at all? Is something blocking it from closing, because it's still in the process of being initialized? Or if there's something that's holding open a file in that Profile directory at the time Firefox quits?
My guess is that there is async telemetry running or some async setup aspect to the study itself that may not have returned when headless Firefox is closing.
One thing you could try is to disable Shield studies when in headless mode to see if the test passes. If this is controlled by a pref, then you could add that pref to the profile the test uses. Mochitest creates its own profile for these unit tests; but that may be too upstream, since it will affect all other tests. You could edit the copy pulled from mochitest by editing this test directly at or before this line. Read the mochitest file using OS.file
, add a pref to it and write it back out to the new location.
Conclusion
There's already an intermittent bug failure for this test. Myk ran your patch locally several times and couldn't get the failure to reproduce. Myk also noticed this failure only occurs in my patch on opt builds, not on debug builds, which makes it much harder to track down.
The other test failure in this group about message counts is a totally different test that was grouped together. I tested this one locally for your patch, and it is passing repeatedly. So this is another intermittent bug.
After talking with rhelmer about this test, we agree it could be a bug in ShieldUtils.jsm, but since the test was intermittent against the same patch, it could also be a bug in the test itself. The risk here is that if a user is using Firefox in headless mode and taking screenshots where Firefox is only open for fractions of a second at a time while enrolled in this study, the study may not properly shut down.
What we're going to do about it
/Applications/FirefoxNightly.app/Contents/MacOS/firefox-bin -headless -screenshot https://developer.mozilla.com
works with and without my study installed (with an experimental branch running).Update - Next Step: Bisect the study to find why the bc2 test (browser/base/content/test/plugins/browser_pluginnotification.js) is failing. Failing that, ask dothayer@ or fgomes@.
TL;DR: The test first starts failing (intermittently) at this line when I add the WebRequest.onBeforeRequest
listener. Since the failure is intermittent and the feature works with the addon installed under "normal" conditions, @rhelmer does not see this as a blocker. See below for a more detailed explanation.
I bisected the code and find where my code starts to fail in PR #94 . Go there for more details.
I talked with dothayer@, and he suspects the problem is a race condition. To show the flash plugin notification, there must be a "plugin binding attached" event. Currently the logic in this test trusts that this event will fire before the content task (the line just before the failure) returns, but it's not a given. It's possible that with an onBeforeRequest
listener, this event (which may require that the request for flash comes through) is not always firing before that content task returns.
He suggested I try adding the line:
await new Promise(resolve => executeSoon(resolve));
just after the content task above to try to process any events in the queue before attempting to get the notification
element. Unfortunately, I was still seeing intermittent failures -- i.e. the "plugin binding attach" event has not always taken place before the content task returns.
After speaking with rhelmer, given that this is an intermittent failure likely due to a race condition specific to the test and not a condition we will see when the addon is deployed (users do not generally load several flash pages within a fraction of a second), and given that the flash plugin notification feature does work correctly with the addon installed under these more "normal" conditions, we have opted to note the failure, but not treat it as a blocker.
./mach try -p linux64,win64,macosx64 -u all -t all
(all desktop platforms, all unit tests, all performance tests).
Action items:
about:newtab
?browser.newtab.preload
will cause any significant regressions.browser.newtab.preload
turned back on in the study to see if it passes: (bc3, Linux x64 opt) dom/base/test/browser_aboutnewtab_process_selection.js, (bc6, Linux x64 Stylo Disabled Debug) browser/base/content/test/urlbar/browser_urlbarAboutHomeLoading.js, (bc3, OSX 10.10 opt) browser/base/content/test/newtab/browser_newtab_background_captures.jsAddendum:
I was having trouble with artifact builds both locally and remotely; the Treeherder results below confirm there is a bug in artifact builds currently. I have discussed this with nalexander in IRC, so he is aware. While I could not do an artifact build locally and had intermittent build failures for artifact builds on the try server, I could successfully create a non-artifact build locally and all non-artifact builds with my patch were successful on the try server.
Release channel, with study (v 1.0.3), artifact build
Since today is merge day, per nalexander there is a bug in the build system for artifact builds (building locally or using a try build). I had intermittent build failures for ./mach try -p linux64,win64,macosx64 -u all -t all --artifact
with an artifact try build. Here are the Treeherder results for an artifact build with my study as a patch.
_Release channel, no study, artifact build_
Also, per rhelmer's recommendation, I ran an artifact build try run without my patch on release with ./mach try -p linux64,win64,macosx64 -u all -t all --artifact
. Here are the Treeherder results for an artifact build without my study.
I discovered why I couldn't view any Perfherder results for my patch in Treeherder, thanks to help from osmose:
Background:
I ran my study as a patch against Firefox Release (59) for a full, non-artifact build with the following ./mach try -p linux64,win64,macosx64 -u all -t all
. This ran all performance tests ONCE. I am seeing Perfherder results of my patch compared to mozilla-release
as having no baseline and Perfherder results of my patch compared to mozilla-central
as having no overlap in which tests were performed.
What's going on?
Why I'm not seeing ANY baseline data for mozilla-release
:
The commit that was at the tip of the Release branch when I applied my patch has a Treeherder that had NO talos tests performed. mozilla-release
is whatever branch the current release version of Firefox is based off of, so if I try to compare my TALOS tests to mozilla-release
, I should expect to see nothing. Per osmose, a commit for Release that ran TALOS tests doesn't exist by default.
Why I'm not seeing any overlapping test results compared to mozilla-central
:
Firstly, comparing my patch on Release to mozilla-central
isn't a good idea:
the release branch probably has a bunch of changes to which tests are run. Looks like all the tests central has that you don't have "stylo" in them. Also looking at the baseline tests, they have error ranges (e.g. +/- 0.5%), while yours do not. That's because the baseline tests were run several times, where mine were only run once each. This has to be made explicit in the try syntax to run tests multiple times.
What to do about it? To have comparable TALOS test results, I need a Try server run against Release that runs all the same tests as my patch does, and multiple times to get average values and a sense of repeatability/reliability of the test results.
If you go to https://mozilla-releng.net/trychooser/ and select "Both" for the build type, "linux64" for the platform, you get the following try syntax: ./mach try -b do -p linux64 -u none -t all Then, if you check the box on the right that says "To compare Talos numbers we need 5 retriggers" it adds --rebuild-talos: ./mach try -b do -p linux64 -u none -t all --rebuild-talos 5 Then those numbers will be comparable.
Compared to the Treeherder results for the tip of Release on top of which I submitted my study as a patch, my study's Treeherder results have no failed tests in common. This means that these failures I am seeing are either intermittent, or the result of my patch.
I basically CMD + F searched for each of the 11 failed tests from Release in my summary document for the failed tests in my patch.
Perfherder results compared to the tip of Release
While my final patch is significantly slower than my mid-development patch; both Perfherder results had average performance regressions compared to the tip of either Nightly or Release without the patch between 10-15%.
See this spreadsheet (Sheet 2) for a comparison of performance tests from mid-development to final development. One reason why the final development results are considerably slower in an absolute sense than mid-development (outside of added complexity in general) is that I turned off browser.newtab.preload
to be able to accurately measure the time the new tab page is opened.
One thing I could check to see is if turning browser.newtab.preload
back on and re-running the same tests dramatically minimizes the regressions.
PR #188 adds 3 new tasks for QA to check for to ensure none of the unit test failures we were seeing have actually caused related functionality in Firefox to stop working.
Okay per @rhelmer , the performance regressions are acceptable, and as long as QA verifies that the zoom button, SSL indicator and new tab page are functioning as normal with my add-on installed, we have Engineering sign-off.
Aside: We had about 600,000 people in the Tracking Protection Messaging study. 75k people per branch for each of 4 branches for two distributions (one to new users; one to all users).
The only significant, repeatable performance regressions I had for the Tracking Protection Messaging study were largely in the 10-15% range compared to release without my patch using v4 StudyUtils. If you are seeing higher regressions than that, then I don’t think StudyUtils would be the primary cause.
I attribute my remaining performance regressions to implementation details of the study (turning off new tab preload, listening on every network request, having a less optimized implementation of Tracking Protection…).
The only performance improvement I made was to delay execution of my study-specific code during Firefox startup.
As part of a more rigorous review process for landing shield studies (currently modeled after this), this study should receive an engineering review.
Firefox Engineering has asked that this reviewer be a Firefox Peer.
This is a meta-issue for any engineering-related concerns.
Related issues:
72 (P2)
73 (P2, closed)
75 (P3)
88 (P1, closed)
Engineering reviewer: rhelmer