Open worldomonation opened 1 year ago
This has been an ongoing issue since the early days of the framework, but it's gotten a lot more frequent in the last few months.
There have been 8 occurrences for the editor/editor__revision.ts
spec alone for the last week:
and 13 cases in the editor__navbar.ts
spec in the same time period:
Enabled the pw:browser
debug level in Playwright and after the same build several times, I came across the crash.
This is the resulting log:
01:29:45
2022-12-03T01:29:45.152Z pw:browser [pid=4174][err] [1203/012945.152220:ERROR:render_process_host_impl.cc(5157)] Terminating render process for bad Mojo message: Received bad user message: Prefetch/IsValidCrossOrigin: frame origin mismatch
01:29:45
2022-12-03T01:29:45.152Z pw:browser [pid=4174][err] [1203/012945.152304:ERROR:bad_message.cc(29)] Terminating renderer for bad IPC message, reason 123
01:29:45
2022-12-03T01:29:45.153Z pw:browser writing frame 1670030985.146939
01:29:45
2022-12-03T01:29:45.153Z pw:browser writing frame 1670030985.148676
01:29:45
2022-12-03T01:29:45.153Z pw:browser writing frame 1670030985.148827
01:29:45
2022-12-03T01:29:45.155Z pw:browser writing frame 1670030985.151879
01:29:45
2022-12-03T01:29:45.158Z pw:browser writing frame 1670030985.151862
01:29:45
2022-12-03T01:29:45.162Z pw:browser writing frame 1670030985.156923
01:29:45
Event for import__site: hook_failure
01:29:45
ERROR: Site Import > beforeAll
01:29:45
01:29:45
page.goto: net::ERR_ABORTED; maybe frame was detached?
01:29:45
=========================== logs ===========================
01:29:45
navigating to "https://container-vigorous-khayyam.calypso.live/", waiting until "load"
01:29:45
============================================================
01:29:45
at TestAccount.authenticate (/home/teamcity-0/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/lib/test-account.ts:41:15)
01:29:45
at /home/teamcity-0/buildAgent/work/c4a9d5b38c1dacde/test/e2e/specs/tools/import__site.ts:18:3 {
01:29:45
name: 'Error'
01:29:45
}
Further logs:
01:29:45
Event for import__site: test_start
01:29:45
In test_fn_start, is browser connected: true
01:29:45
Event for import__site: test_fn_start
01:29:45
In test_fn_start, is browser connected: true
01:29:45
Event for import__site: test_done
01:29:45
Event for import__site: test_start
01:29:45
In test_fn_start, is browser connected: true
01:29:45
Event for import__site: test_fn_start
01:29:45
In test_fn_start, is browser connected: true
01:29:45
Event for import__site: test_done
01:29:45
Event for import__site: test_start
01:29:45
In test_fn_start, is browser connected: true
01:29:45
Event for import__site: test_fn_start
01:29:45
In test_fn_start, is browser connected: true
01:29:45
Event for import__site: test_done
01:29:45
Event for import__site: test_start
01:29:45
In test_fn_start, is browser connected: true
01:29:45
Event for import__site: test_fn_start
01:29:45
In test_fn_start, is browser connected: true
01:29:45
Event for import__site: test_done
01:29:45
Event for import__site: test_start
01:29:45
In test_fn_start, is browser connected: true
01:29:45
Event for import__site: test_fn_start
01:29:45
In test_fn_start, is browser connected: true
01:29:45
Event for import__site: test_done
01:29:45
Event for import__site: test_start
01:29:45
In test_fn_start, is browser connected: true
01:29:45
Event for import__site: test_fn_start
01:29:45
In test_fn_start, is browser connected: true
01:29:45
Event for import__site: test_done
01:29:45
Event for import__site: run_describe_finish
01:29:45
Event for import__site: run_describe_finish
Searching the web for the error messages don't lead to much, but here are some possibly related issues:
I wonder if this line is related.
I have a potential fix in #70970. After merging them, I will have to monitor for a bit.
Reopening to monitor.
Looks like #70970 did not really have the intended effect.
For example, in the support__popover-invalid
spec alone, we have seen 4 new instances of browser crashing:
Even today (December 13, 2022) there has been multiple occurrences of this across a variety of specs.
@dpasque I think it's time we try increasing the shm allocation for docker. I'm going to spin up a new PR for that.
Let's see if https://github.com/Automattic/wp-calypso/pull/71356 helps - all failure we've had today (2022/12/21) have been this failure.
Incidents of this issue have dropped off quite significantly, but still occurs occasionally.
Looking at the trend, this issue appears to occur with media related tests.
Still happening occasionally after increasing the shared memory to 4GB, albeit at a rate lower than it used to at 3GB.
Just adding what the newest trace stacks look like:
page.screenshot: Target crashed
=========================== logs ===========================
taking page screenshot
============================================================
page.screenshot: Target crashed
=========================== logs ===========================
taking page screenshot
============================================================
at JestEnvironmentPlaywright.teardown (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/jest-playwright-config/environment.ts:165:22)
at runTestInternal (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/jest-runner/build/runTest.js:461:5)
at runTest (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/jest-runner/build/runTest.js:475:34)
at Object.worker (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/jest-runner/build/testWorker.js:133:12)
I've been playing around with this on my 🎩 rotation, and I think I have some next steps to try...
Adding some more logs!
Got it to fail, here is what the console logging showed:
2023-04-07T16:06:28.554Z pw:browser <launching> /home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/playwright-core/.local-browsers/chromium-1055/chrome-linux/chrome --disable-field-trial-config --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-back-forward-cache --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-component-update --no-default-browser-check --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=ImprovedCookieControls,LazyFrameLoading,GlobalMediaControls,DestroyProfileOnBrowserClose,MediaRouter,DialMediaRouteProvider,AcceptCHFrame,AutoExpandDetailsElement,CertificateTransparencyComponentUpdater,AvoidUnnecessaryBeforeUnloadCheckSync,Translate --allow-pre-commit-input --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --disable-sync --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --no-service-autorun --export-tagged-pdf --headless --hide-scrollbars --mute-audio --blink-settings=primaryHoverType=2,availableHoverTypes=2,primaryPointerType=4,availablePointerTypes=4 --no-sandbox --user-data-dir=/home/teamcity-2/buildAgent/temp/buildTmp/playwright_chromiumdev_profile-ae3r7K --remote-debugging-pipe --no-startup-window
2023-04-07T16:06:28.573Z pw:browser <launched> pid=3191
2023-04-07T16:06:28.757Z pw:browser [pid=3191][err] [0407/160628.755709:ERROR:bus.cc(399)] Failed to connect to the bus: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2023-04-07T16:06:28.837Z pw:browser [pid=3191][err] [0407/160628.836220:ERROR:bus.cc(399)] Failed to connect to the bus: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2023-04-07T16:06:28.837Z pw:browser [pid=3191][err] [0407/160628.800476:WARNING:sandbox_linux.cc(393)] InitializeSandbox() called with multiple threads in process gpu-process.
2023-04-07T16:06:28.838Z pw:browser [pid=3191][err] [0407/160628.836346:ERROR:bus.cc(399)] Failed to connect to the bus: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2023-04-07T16:06:28.844Z pw:browser [pid=3191][err] [0407/160628.840861:WARNING:bluez_dbus_manager.cc(247)] Floss manager not present, cannot set Floss enable/disable.
2023-04-07T16:06:33.693Z pw:browser <launching> /home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/playwright-core/.local-browsers/ffmpeg-1008/ffmpeg-linux -loglevel error -f image2pipe -avioflags direct -fpsprobesize 0 -probesize 32 -analyzeduration 0 -c:v mjpeg -i - -y -an -r 25 -c:v vp8 -qmin 0 -qmax 50 -crf 8 -deadline realtime -speed 8 -b:v 1M -threads 1 -vf pad=800:450:0:0:gray,crop=800:450:0:0 /home/teamcity-2/buildAgent/temp/buildTmp/431fea4297650193e2792c6b1385ee6c.webm
2023-04-07T16:06:33.770Z pw:browser <launched> pid=4686
console.log
TestAccount:gutenbergSimpleSiteUser => Logging in via Login Page
at TestAccount.log (../../packages/calypso-e2e/src/lib/test-account.ts:205:12)
2023-04-07T16:06:34.383Z pw:browser [pid=3191][err] [0407/160634.383470:ERROR:render_process_host_impl.cc(5252)] Terminating render process for bad Mojo message: Received bad user message: Prefetch/IsValidCrossOrigin: frame origin mismatch
2023-04-07T16:06:34.384Z pw:browser [pid=3191][err] [0407/160634.383549:ERROR:bad_message.cc(29)] Terminating renderer for bad IPC message, reason 123
ERROR: Theme: Preview And Activate > beforeAll
page.goto: net::ERR_ABORTED; maybe frame was detached?
=========================== logs ===========================
navigating to "https://container-flamboyant-hoover.calypso.live/log-in", waiting until "load"
============================================================
at LoginPage.visit (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/lib/pages/login-page.ts:34:26)
at TestAccount.logInViaLoginPage (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/lib/test-account.ts:64:3)
at TestAccount.authenticate (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/lib/test-account.ts:47:4)
at /home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/test/e2e/specs/appearance/theme__details-preview-activate.ts:38:3 {
name: 'Error'
}
Beginning page.screenshot()...
2023-04-07T16:06:34.582Z pw:browser [pid=3191] <gracefully close start>
2023-04-07T16:06:34.600Z pw:browser [pid=4686] <gracefully close start>
2023-04-07T16:06:34.617Z pw:browser Closing stdin...
2023-04-07T16:06:34.617Z pw:browser ffmpeg finished input.
2023-04-07T16:06:34.658Z pw:browser [pid=3191] <process did exit: exitCode=0, signal=null>
2023-04-07T16:06:34.658Z pw:browser [pid=3191] starting temporary directories cleanup
2023-04-07T16:06:34.668Z pw:browser [pid=3191] finished temporary directories cleanup
2023-04-07T16:06:34.668Z pw:browser [pid=3191] <gracefully close end>
2023-04-07T16:06:34.796Z pw:browser [pid=4686] <process did exit: exitCode=0, signal=null>
2023-04-07T16:06:34.796Z pw:browser ffmpeg onkill exitCode=0 signal=null
2023-04-07T16:06:34.796Z pw:browser [pid=4686] starting temporary directories cleanup
2023-04-07T16:06:34.796Z pw:browser [pid=4686] finished temporary directories cleanup
2023-04-07T16:06:34.796Z pw:browser [pid=4686] <gracefully close end>
And here is what the Playwright API logs showed:
2023-04-07T16:06:33.631Z 946 api info: => browser.newPage started
2023-04-07T16:06:33.659Z 946 api info: <= browser.newPage succeeded
2023-04-07T16:06:33.661Z 946 api info: => browser.newPage started
2023-04-07T16:06:33.805Z 946 api info: <= browser.newPage succeeded
2023-04-07T16:06:33.807Z 946 api info: => page.on started
2023-04-07T16:06:33.808Z 946 api info: => tracing.start started
2023-04-07T16:06:33.812Z 946 api info: <= page.on succeeded
2023-04-07T16:06:33.901Z 946 api info: <= tracing.start succeeded
2023-04-07T16:06:33.923Z 946 api info: => browserContext.clearCookies started
2023-04-07T16:06:33.929Z 946 api info: <= browserContext.clearCookies succeeded
2023-04-07T16:06:33.991Z 946 api info: => page.route started
2023-04-07T16:06:33.995Z 946 api info: <= page.route succeeded
2023-04-07T16:06:33.996Z 946 api info: => page.goto started
2023-04-07T16:06:34.406Z 946 api info: <= page.goto failed
2023-04-07T16:06:34.470Z 946 api info: => tracing.stop started
2023-04-07T16:06:34.573Z 946 api info: <= tracing.stop succeeded
2023-04-07T16:06:34.576Z 946 api info: => page.screenshot started
2023-04-07T16:06:34.580Z 946 api info: <= page.screenshot failed
2023-04-07T16:06:34.581Z 946 api info: => browser.close started
2023-04-07T16:06:34.668Z 946 api info: <= browser.close succeeded
For comparison, here is a console log from a healthy executing test:
2023-04-07T16:06:28.476Z pw:browser <launching> /home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/playwright-core/.local-browsers/chromium-1055/chrome-linux/chrome --disable-field-trial-config --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-back-forward-cache --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-component-update --no-default-browser-check --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=ImprovedCookieControls,LazyFrameLoading,GlobalMediaControls,DestroyProfileOnBrowserClose,MediaRouter,DialMediaRouteProvider,AcceptCHFrame,AutoExpandDetailsElement,CertificateTransparencyComponentUpdater,AvoidUnnecessaryBeforeUnloadCheckSync,Translate --allow-pre-commit-input --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --disable-sync --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --no-service-autorun --export-tagged-pdf --headless --hide-scrollbars --mute-audio --blink-settings=primaryHoverType=2,availableHoverTypes=2,primaryPointerType=4,availablePointerTypes=4 --no-sandbox --user-data-dir=/home/teamcity-2/buildAgent/temp/buildTmp/playwright_chromiumdev_profile-6bb9AZ --remote-debugging-pipe --no-startup-window
2023-04-07T16:06:28.511Z pw:browser <launched> pid=3045
2023-04-07T16:06:28.741Z pw:browser [pid=3045][err] [0407/160628.740159:ERROR:bus.cc(399)] Failed to connect to the bus: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2023-04-07T16:06:28.741Z pw:browser [pid=3045][err] [0407/160628.740703:ERROR:bus.cc(399)] Failed to connect to the bus: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2023-04-07T16:06:28.742Z pw:browser [pid=3045][err] [0407/160628.740781:ERROR:bus.cc(399)] Failed to connect to the bus: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2023-04-07T16:06:28.854Z pw:browser [pid=3045][err] [0407/160628.854251:WARNING:bluez_dbus_manager.cc(247)] Floss manager not present, cannot set Floss enable/disable.
2023-04-07T16:06:33.657Z pw:browser [pid=3045][err] [0407/160628.886265:WARNING:sandbox_linux.cc(393)] InitializeSandbox() called with multiple threads in process gpu-process.
2023-04-07T16:06:33.686Z pw:browser <launching> /home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/playwright-core/.local-browsers/ffmpeg-1008/ffmpeg-linux -loglevel error -f image2pipe -avioflags direct -fpsprobesize 0 -probesize 32 -analyzeduration 0 -c:v mjpeg -i - -y -an -r 25 -c:v vp8 -qmin 0 -qmax 50 -crf 8 -deadline realtime -speed 8 -b:v 1M -threads 1 -vf pad=800:450:0:0:gray,crop=800:450:0:0 /home/teamcity-2/buildAgent/temp/buildTmp/cae3e409edeb015dcf966d7ca4024c9c.webm
2023-04-07T16:06:33.717Z pw:browser <launched> pid=4575
console.log
TestAccount:siteEditorSimpleSiteUser => Logging in via Login Page
at TestAccount.log (../../packages/calypso-e2e/src/lib/test-account.ts:205:12)
2023-04-07T16:07:02.556Z pw:browser [pid=3045] <gracefully close start>
2023-04-07T16:07:02.560Z pw:browser [pid=4575] <gracefully close start>
2023-04-07T16:07:02.560Z pw:browser Closing stdin...
2023-04-07T16:07:02.560Z pw:browser ffmpeg finished input.
2023-04-07T16:07:02.578Z pw:browser [pid=4575] <process did exit: exitCode=0, signal=null>
2023-04-07T16:07:02.578Z pw:browser ffmpeg onkill exitCode=0 signal=null
2023-04-07T16:07:02.578Z pw:browser [pid=4575] starting temporary directories cleanup
2023-04-07T16:07:02.579Z pw:browser [pid=4575] finished temporary directories cleanup
2023-04-07T16:07:02.579Z pw:browser [pid=4575] <gracefully close end>
2023-04-07T16:07:02.588Z pw:browser [pid=3045] <process did exit: exitCode=0, signal=null>
2023-04-07T16:07:02.588Z pw:browser [pid=3045] starting temporary directories cleanup
2023-04-07T16:07:02.627Z pw:browser [pid=3045] finished temporary directories cleanup
2023-04-07T16:07:02.627Z pw:browser [pid=3045] <gracefully close end>
So comparing the two, this seems to be the real error:
2023-04-07T16:06:34.383Z pw:browser [pid=3191][err] [0407/160634.383470:ERROR:render_process_host_impl.cc(5252)] Terminating render process for bad Mojo message: Received bad user message: Prefetch/IsValidCrossOrigin: frame origin mismatch
2023-04-07T16:06:34.384Z pw:browser [pid=3191][err] [0407/160634.383549:ERROR:bad_message.cc(29)] Terminating renderer for bad IPC message, reason 123
ERROR: Theme: Preview And Activate > beforeAll
page.goto: net::ERR_ABORTED; maybe frame was detached?
=========================== logs ===========================
navigating to "https://container-flamboyant-hoover.calypso.live/log-in", waiting until "load"
============================================================
at LoginPage.visit (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/lib/pages/login-page.ts:34:26)
at TestAccount.logInViaLoginPage (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/lib/test-account.ts:64:3)
at TestAccount.authenticate (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/lib/test-account.ts:47:4)
at /home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/test/e2e/specs/appearance/theme__details-preview-activate.ts:38:3 {
name: 'Error'
}
So, this confirms my suspicion that the screenshot taking process is not actually the root issue! It's just exposing a page
that is in a bad state from an earlier error.
So, I'm not entirely sure what these errors mean haha... But here's where the error is thrown in the Chromium source code:
Gonna keep digging
So, I'm not entirely sure how the IPC / Mojo communication (which appears to be how Chromium processes talk to each other) plays a roll...
But this might be suggesting that some iframe on our login page is violating our frame security (same-origin) policy? But this only happens very rarely? I'm not sure what that might be 🤔
This is the only iframe I'm seeing on the login page:
https://public-api.wordpress.com/wp-admin/rest-proxy/?v=2.0#https://container-flamboyant-hoover.calypso.live
Going back through a bunch of recent versions of this crash, I actually was able to find in the Playwright traces that they are all similarly crashes when trying to do a navigation, often the same ERR_ABORTED
error, sometimes it manifests in other ways. It always seems to happen in the beforeAll
hooks of scripts too.
So, pending discovering what it causing some initial navigations to fail in these specific conditions, I wonder if we can get around this with just some more defensive coding.
For one, we should be using try/catches in our teardown code to catch any failed page operations. This way, we'll make sure we are actually seeing the real errors that crash the page, not the errors thrown by trying to act on a failed page.
Secondly, we may want to wrap our initial authentication or landing in a couple of retry loops, so that if there's an unexpected crash right on the first navigation, we just try again.
@noahtallen or @WunderBart, I'm wondering if you have any quick insight on the error as outlined in this comment. No worries if not, just curious if that rings any initial bells as you might be more familiar with the Calypso / DotCom security policies than I am 😁
No quick insights unfortunately, but a big fan of capturing as much log info as we can in normal runs!
You're right, the only iframe in that scenario I can think of is the rest proxy. 🤔 Though a side point is that calypso.live really doesn't work well when 3rd party cookies are disabled, just because *.calypso.live is a weird different origin from normal production. (When testing calypso.live, I have to turn off enhanced tracking protection in firefox, for example)
Sounds good, thanks for weighing in @noahtallen! 🙇 You bring up a good point, and I'm also realizing we tend to only see these errors on our TC builds that use Calypso.live. That makes me think that maybe a good next step is adding some retry logic to our initial navigation/login, and see if that smooths things out!
Just catching up on this conversation and adding a few of my observations.
calypso.live
target is true. I don't think I've ever encountered this crash locally (as it hits staging when proxied/production if not proxied) and in the Pre-Release Tests (as it hits wpcalypso.wordpress.com
). On retries, I am not sure where we can add retrie. The browser startup is handled in environment.ts
while the initial navigation to the login page is handled in test-accounts.ts
. Since the actual page object itself crashes upon navigation to the /log-in
endpoint, I'm not sure how to "cross bridges" from POMs to the framework code itself.
👋 Sorry for such a late response, somehow I missed this one.
I'm wondering (and please ignore if you've already checked this scenario) if it's not the same case as https://github.com/microsoft/playwright/issues/18137.
From the explaining comment:
The intent was to authenticate user and redirect to them to /bar.
However, what happens in practice, the polling mechanism of the app detects that the user was > authenticated, and attempts to redirect them to post-login page.
The end result was that sometimes when baz() executes we would be on /home and sometimes on > /baz.
...which seems to be describing our problem as well. The proposed solution looks pretty straightforward:
The reason for adding page.reload() after authenticate() is to ensure that before we are calling goto there are no more outstanding redirects.
So it should be sufficient to reload the page before calling the goto
. There's also another solution proposed right in the next comment:
await authenticate(page);
await page.waitForURL('/home');
await page.goto('/bar');
await baz();
Not sure which of the above would be more viable in our case, but maybe we could try with simple reload
before goto
to clear the page's state and see what happens?
@WunderBart ahhh, I didn't see that issue, that seems exactly like it!!! 😄 Nice find!
I think either approach would be viable, and hopefully that does the trick! CC @karenroldan who is 🎩 rotation this week, in case you have some time to squeeze in this fix! Otherwise, I can pick it up later 👍
Excellent find @WunderBart! We can give this a try later today or next week, and see if it helps reduce or eliminate this intermittent failure.
CC @karenroldan who is 🎩 rotation this week, in case you have some time to squeeze in this fix
I can make time for this next week!
Spec file
various
TeamCity ID
9073509
Logs