Automattic / wp-calypso

The JavaScript and API powered WordPress.com
https://developer.wordpress.com
GNU General Public License v2.0
12.4k stars 1.98k forks source link

Flaky E2E: Jest failed to execute suite #70674

Open worldomonation opened 1 year ago

worldomonation commented 1 year ago

Spec file

various

TeamCity ID

9073509

Logs

page.screenshot: Target crashed
=========================== logs ===========================
taking page screenshot
============================================================
page.screenshot: Target crashed
=========================== logs ===========================
taking page screenshot
============================================================
    at JestEnvironmentPlaywright.handleTestEvent (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/jest-playwright-config/environment.ts:344:28)
    at dispatch (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/jest-circus/build/state.js:75:5)
    at runAndTransformResultsToJestFormat (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/jest-circus/build/legacy-code-todo-rewrite/jestAdapterInit.js:234:3)
    at jestAdapter (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/jest-circus/build/legacy-code-todo-rewrite/jestAdapter.js:82:19)
    at runTestInternal (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/jest-runner/build/runTest.js:389:16)
    at runTest (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/jest-runner/build/runTest.js:475:34)
    at Object.worker (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/jest-runner/build/testWorker.js:133:12)
worldomonation commented 1 year ago

This has been an ongoing issue since the early days of the framework, but it's gotten a lot more frequent in the last few months.

There have been 8 occurrences for the editor/editor__revision.ts spec alone for the last week: image

and 13 cases in the editor__navbar.ts spec in the same time period: image

worldomonation commented 1 year ago

Enabled the pw:browser debug level in Playwright and after the same build several times, I came across the crash.

This is the resulting log:

01:29:45 
  2022-12-03T01:29:45.152Z pw:browser [pid=4174][err] [1203/012945.152220:ERROR:render_process_host_impl.cc(5157)] Terminating render process for bad Mojo message: Received bad user message: Prefetch/IsValidCrossOrigin: frame origin mismatch

01:29:45 
  2022-12-03T01:29:45.152Z pw:browser [pid=4174][err] [1203/012945.152304:ERROR:bad_message.cc(29)] Terminating renderer for bad IPC message, reason 123

01:29:45 
  2022-12-03T01:29:45.153Z pw:browser writing frame 1670030985.146939

01:29:45 
  2022-12-03T01:29:45.153Z pw:browser writing frame 1670030985.148676

01:29:45 
  2022-12-03T01:29:45.153Z pw:browser writing frame 1670030985.148827

01:29:45 
  2022-12-03T01:29:45.155Z pw:browser writing frame 1670030985.151879

01:29:45 
  2022-12-03T01:29:45.158Z pw:browser writing frame 1670030985.151862

01:29:45 
  2022-12-03T01:29:45.162Z pw:browser writing frame 1670030985.156923

01:29:45 
  Event for import__site: hook_failure

01:29:45 
  ERROR: Site Import > beforeAll

01:29:45 
  

01:29:45 
   page.goto: net::ERR_ABORTED; maybe frame was detached?

01:29:45 
  =========================== logs ===========================

01:29:45 
  navigating to "https://container-vigorous-khayyam.calypso.live/", waiting until "load"

01:29:45 
  ============================================================

01:29:45 
      at TestAccount.authenticate (/home/teamcity-0/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/lib/test-account.ts:41:15)

01:29:45 
      at /home/teamcity-0/buildAgent/work/c4a9d5b38c1dacde/test/e2e/specs/tools/import__site.ts:18:3 {

01:29:45 
    name: 'Error'

01:29:45 
  }
worldomonation commented 1 year ago

Further logs:

01:29:45 
  Event for import__site: test_start

01:29:45 
  In test_fn_start, is browser connected: true

01:29:45 
  Event for import__site: test_fn_start

01:29:45 
  In test_fn_start, is browser connected: true

01:29:45 
  Event for import__site: test_done

01:29:45 
  Event for import__site: test_start

01:29:45 
  In test_fn_start, is browser connected: true

01:29:45 
  Event for import__site: test_fn_start

01:29:45 
  In test_fn_start, is browser connected: true

01:29:45 
  Event for import__site: test_done

01:29:45 
  Event for import__site: test_start

01:29:45 
  In test_fn_start, is browser connected: true

01:29:45 
  Event for import__site: test_fn_start

01:29:45 
  In test_fn_start, is browser connected: true

01:29:45 
  Event for import__site: test_done

01:29:45 
  Event for import__site: test_start

01:29:45 
  In test_fn_start, is browser connected: true

01:29:45 
  Event for import__site: test_fn_start

01:29:45 
  In test_fn_start, is browser connected: true

01:29:45 
  Event for import__site: test_done

01:29:45 
  Event for import__site: test_start

01:29:45 
  In test_fn_start, is browser connected: true

01:29:45 
  Event for import__site: test_fn_start

01:29:45 
  In test_fn_start, is browser connected: true

01:29:45 
  Event for import__site: test_done

01:29:45 
  Event for import__site: test_start

01:29:45 
  In test_fn_start, is browser connected: true

01:29:45 
  Event for import__site: test_fn_start

01:29:45 
  In test_fn_start, is browser connected: true

01:29:45 
  Event for import__site: test_done

01:29:45 
  Event for import__site: run_describe_finish

01:29:45 
  Event for import__site: run_describe_finish
worldomonation commented 1 year ago

Searching the web for the error messages don't lead to much, but here are some possibly related issues:

I wonder if this line is related.

worldomonation commented 1 year ago

I have a potential fix in #70970. After merging them, I will have to monitor for a bit.

worldomonation commented 1 year ago

Reopening to monitor.

worldomonation commented 1 year ago

Looks like #70970 did not really have the intended effect.

For example, in the support__popover-invalid spec alone, we have seen 4 new instances of browser crashing: image

Even today (December 13, 2022) there has been multiple occurrences of this across a variety of specs. image

worldomonation commented 1 year ago

@dpasque I think it's time we try increasing the shm allocation for docker. I'm going to spin up a new PR for that.

worldomonation commented 1 year ago

Let's see if https://github.com/Automattic/wp-calypso/pull/71356 helps - all failure we've had today (2022/12/21) have been this failure.

worldomonation commented 1 year ago

Incidents of this issue have dropped off quite significantly, but still occurs occasionally.

Looking at the trend, this issue appears to occur with media related tests.

image

image

worldomonation commented 1 year ago

Still happening occasionally after increasing the shared memory to 4GB, albeit at a rate lower than it used to at 3GB.

dpasque commented 1 year ago

Just adding what the newest trace stacks look like:

page.screenshot: Target crashed
      =========================== logs ===========================
      taking page screenshot
      ============================================================
 page.screenshot: Target crashed
      =========================== logs ===========================
      taking page screenshot
      ============================================================
          at JestEnvironmentPlaywright.teardown (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/jest-playwright-config/environment.ts:165:22)
          at runTestInternal (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/jest-runner/build/runTest.js:461:5)
          at runTest (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/jest-runner/build/runTest.js:475:34)
          at Object.worker (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/jest-runner/build/testWorker.js:133:12)
dpasque commented 1 year ago

I've been playing around with this on my 🎩 rotation, and I think I have some next steps to try...

dpasque commented 1 year ago

Adding some more logs!

Got it to fail, here is what the console logging showed:

2023-04-07T16:06:28.554Z pw:browser <launching> /home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/playwright-core/.local-browsers/chromium-1055/chrome-linux/chrome --disable-field-trial-config --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-back-forward-cache --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-component-update --no-default-browser-check --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=ImprovedCookieControls,LazyFrameLoading,GlobalMediaControls,DestroyProfileOnBrowserClose,MediaRouter,DialMediaRouteProvider,AcceptCHFrame,AutoExpandDetailsElement,CertificateTransparencyComponentUpdater,AvoidUnnecessaryBeforeUnloadCheckSync,Translate --allow-pre-commit-input --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --disable-sync --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --no-service-autorun --export-tagged-pdf --headless --hide-scrollbars --mute-audio --blink-settings=primaryHoverType=2,availableHoverTypes=2,primaryPointerType=4,availablePointerTypes=4 --no-sandbox --user-data-dir=/home/teamcity-2/buildAgent/temp/buildTmp/playwright_chromiumdev_profile-ae3r7K --remote-debugging-pipe --no-startup-window
2023-04-07T16:06:28.573Z pw:browser <launched> pid=3191
2023-04-07T16:06:28.757Z pw:browser [pid=3191][err] [0407/160628.755709:ERROR:bus.cc(399)] Failed to connect to the bus: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2023-04-07T16:06:28.837Z pw:browser [pid=3191][err] [0407/160628.836220:ERROR:bus.cc(399)] Failed to connect to the bus: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2023-04-07T16:06:28.837Z pw:browser [pid=3191][err] [0407/160628.800476:WARNING:sandbox_linux.cc(393)] InitializeSandbox() called with multiple threads in process gpu-process.
2023-04-07T16:06:28.838Z pw:browser [pid=3191][err] [0407/160628.836346:ERROR:bus.cc(399)] Failed to connect to the bus: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2023-04-07T16:06:28.844Z pw:browser [pid=3191][err] [0407/160628.840861:WARNING:bluez_dbus_manager.cc(247)] Floss manager not present, cannot set Floss enable/disable.
2023-04-07T16:06:33.693Z pw:browser <launching> /home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/playwright-core/.local-browsers/ffmpeg-1008/ffmpeg-linux -loglevel error -f image2pipe -avioflags direct -fpsprobesize 0 -probesize 32 -analyzeduration 0 -c:v mjpeg -i - -y -an -r 25 -c:v vp8 -qmin 0 -qmax 50 -crf 8 -deadline realtime -speed 8 -b:v 1M -threads 1 -vf pad=800:450:0:0:gray,crop=800:450:0:0 /home/teamcity-2/buildAgent/temp/buildTmp/431fea4297650193e2792c6b1385ee6c.webm
2023-04-07T16:06:33.770Z pw:browser <launched> pid=4686
  console.log
    TestAccount:gutenbergSimpleSiteUser => Logging in via Login Page

      at TestAccount.log (../../packages/calypso-e2e/src/lib/test-account.ts:205:12)

2023-04-07T16:06:34.383Z pw:browser [pid=3191][err] [0407/160634.383470:ERROR:render_process_host_impl.cc(5252)] Terminating render process for bad Mojo message: Received bad user message: Prefetch/IsValidCrossOrigin: frame origin mismatch
2023-04-07T16:06:34.384Z pw:browser [pid=3191][err] [0407/160634.383549:ERROR:bad_message.cc(29)] Terminating renderer for bad IPC message, reason 123
ERROR: Theme: Preview And Activate > beforeAll

 page.goto: net::ERR_ABORTED; maybe frame was detached?
=========================== logs ===========================
navigating to "https://container-flamboyant-hoover.calypso.live/log-in", waiting until "load"
============================================================
    at LoginPage.visit (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/lib/pages/login-page.ts:34:26)
    at TestAccount.logInViaLoginPage (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/lib/test-account.ts:64:3)
    at TestAccount.authenticate (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/lib/test-account.ts:47:4)
    at /home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/test/e2e/specs/appearance/theme__details-preview-activate.ts:38:3 {
  name: 'Error'
} 

Beginning page.screenshot()...
2023-04-07T16:06:34.582Z pw:browser [pid=3191] <gracefully close start>
2023-04-07T16:06:34.600Z pw:browser [pid=4686] <gracefully close start>
2023-04-07T16:06:34.617Z pw:browser Closing stdin...
2023-04-07T16:06:34.617Z pw:browser ffmpeg finished input.
2023-04-07T16:06:34.658Z pw:browser [pid=3191] <process did exit: exitCode=0, signal=null>
2023-04-07T16:06:34.658Z pw:browser [pid=3191] starting temporary directories cleanup
2023-04-07T16:06:34.668Z pw:browser [pid=3191] finished temporary directories cleanup
2023-04-07T16:06:34.668Z pw:browser [pid=3191] <gracefully close end>
2023-04-07T16:06:34.796Z pw:browser [pid=4686] <process did exit: exitCode=0, signal=null>
2023-04-07T16:06:34.796Z pw:browser ffmpeg onkill exitCode=0 signal=null
2023-04-07T16:06:34.796Z pw:browser [pid=4686] starting temporary directories cleanup
2023-04-07T16:06:34.796Z pw:browser [pid=4686] finished temporary directories cleanup
2023-04-07T16:06:34.796Z pw:browser [pid=4686] <gracefully close end>

And here is what the Playwright API logs showed:

2023-04-07T16:06:33.631Z 946 api info: => browser.newPage started
2023-04-07T16:06:33.659Z 946 api info: <= browser.newPage succeeded
2023-04-07T16:06:33.661Z 946 api info: => browser.newPage started
2023-04-07T16:06:33.805Z 946 api info: <= browser.newPage succeeded
2023-04-07T16:06:33.807Z 946 api info: => page.on started
2023-04-07T16:06:33.808Z 946 api info: => tracing.start started
2023-04-07T16:06:33.812Z 946 api info: <= page.on succeeded
2023-04-07T16:06:33.901Z 946 api info: <= tracing.start succeeded
2023-04-07T16:06:33.923Z 946 api info: => browserContext.clearCookies started
2023-04-07T16:06:33.929Z 946 api info: <= browserContext.clearCookies succeeded
2023-04-07T16:06:33.991Z 946 api info: => page.route started
2023-04-07T16:06:33.995Z 946 api info: <= page.route succeeded
2023-04-07T16:06:33.996Z 946 api info: => page.goto started
2023-04-07T16:06:34.406Z 946 api info: <= page.goto failed
2023-04-07T16:06:34.470Z 946 api info: => tracing.stop started
2023-04-07T16:06:34.573Z 946 api info: <= tracing.stop succeeded
2023-04-07T16:06:34.576Z 946 api info: => page.screenshot started
2023-04-07T16:06:34.580Z 946 api info: <= page.screenshot failed
2023-04-07T16:06:34.581Z 946 api info: => browser.close started
2023-04-07T16:06:34.668Z 946 api info: <= browser.close succeeded
dpasque commented 1 year ago

For comparison, here is a console log from a healthy executing test:

2023-04-07T16:06:28.476Z pw:browser <launching> /home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/playwright-core/.local-browsers/chromium-1055/chrome-linux/chrome --disable-field-trial-config --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-back-forward-cache --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-component-update --no-default-browser-check --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=ImprovedCookieControls,LazyFrameLoading,GlobalMediaControls,DestroyProfileOnBrowserClose,MediaRouter,DialMediaRouteProvider,AcceptCHFrame,AutoExpandDetailsElement,CertificateTransparencyComponentUpdater,AvoidUnnecessaryBeforeUnloadCheckSync,Translate --allow-pre-commit-input --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --disable-sync --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --no-service-autorun --export-tagged-pdf --headless --hide-scrollbars --mute-audio --blink-settings=primaryHoverType=2,availableHoverTypes=2,primaryPointerType=4,availablePointerTypes=4 --no-sandbox --user-data-dir=/home/teamcity-2/buildAgent/temp/buildTmp/playwright_chromiumdev_profile-6bb9AZ --remote-debugging-pipe --no-startup-window
2023-04-07T16:06:28.511Z pw:browser <launched> pid=3045
2023-04-07T16:06:28.741Z pw:browser [pid=3045][err] [0407/160628.740159:ERROR:bus.cc(399)] Failed to connect to the bus: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2023-04-07T16:06:28.741Z pw:browser [pid=3045][err] [0407/160628.740703:ERROR:bus.cc(399)] Failed to connect to the bus: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2023-04-07T16:06:28.742Z pw:browser [pid=3045][err] [0407/160628.740781:ERROR:bus.cc(399)] Failed to connect to the bus: Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory
2023-04-07T16:06:28.854Z pw:browser [pid=3045][err] [0407/160628.854251:WARNING:bluez_dbus_manager.cc(247)] Floss manager not present, cannot set Floss enable/disable.
2023-04-07T16:06:33.657Z pw:browser [pid=3045][err] [0407/160628.886265:WARNING:sandbox_linux.cc(393)] InitializeSandbox() called with multiple threads in process gpu-process.
2023-04-07T16:06:33.686Z pw:browser <launching> /home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/node_modules/playwright-core/.local-browsers/ffmpeg-1008/ffmpeg-linux -loglevel error -f image2pipe -avioflags direct -fpsprobesize 0 -probesize 32 -analyzeduration 0 -c:v mjpeg -i - -y -an -r 25 -c:v vp8 -qmin 0 -qmax 50 -crf 8 -deadline realtime -speed 8 -b:v 1M -threads 1 -vf pad=800:450:0:0:gray,crop=800:450:0:0 /home/teamcity-2/buildAgent/temp/buildTmp/cae3e409edeb015dcf966d7ca4024c9c.webm
2023-04-07T16:06:33.717Z pw:browser <launched> pid=4575
  console.log
    TestAccount:siteEditorSimpleSiteUser => Logging in via Login Page

      at TestAccount.log (../../packages/calypso-e2e/src/lib/test-account.ts:205:12)

2023-04-07T16:07:02.556Z pw:browser [pid=3045] <gracefully close start>
2023-04-07T16:07:02.560Z pw:browser [pid=4575] <gracefully close start>
2023-04-07T16:07:02.560Z pw:browser Closing stdin...
2023-04-07T16:07:02.560Z pw:browser ffmpeg finished input.
2023-04-07T16:07:02.578Z pw:browser [pid=4575] <process did exit: exitCode=0, signal=null>
2023-04-07T16:07:02.578Z pw:browser ffmpeg onkill exitCode=0 signal=null
2023-04-07T16:07:02.578Z pw:browser [pid=4575] starting temporary directories cleanup
2023-04-07T16:07:02.579Z pw:browser [pid=4575] finished temporary directories cleanup
2023-04-07T16:07:02.579Z pw:browser [pid=4575] <gracefully close end>
2023-04-07T16:07:02.588Z pw:browser [pid=3045] <process did exit: exitCode=0, signal=null>
2023-04-07T16:07:02.588Z pw:browser [pid=3045] starting temporary directories cleanup
2023-04-07T16:07:02.627Z pw:browser [pid=3045] finished temporary directories cleanup
2023-04-07T16:07:02.627Z pw:browser [pid=3045] <gracefully close end>
dpasque commented 1 year ago

So comparing the two, this seems to be the real error:

2023-04-07T16:06:34.383Z pw:browser [pid=3191][err] [0407/160634.383470:ERROR:render_process_host_impl.cc(5252)] Terminating render process for bad Mojo message: Received bad user message: Prefetch/IsValidCrossOrigin: frame origin mismatch
2023-04-07T16:06:34.384Z pw:browser [pid=3191][err] [0407/160634.383549:ERROR:bad_message.cc(29)] Terminating renderer for bad IPC message, reason 123
ERROR: Theme: Preview And Activate > beforeAll

 page.goto: net::ERR_ABORTED; maybe frame was detached?
=========================== logs ===========================
navigating to "https://container-flamboyant-hoover.calypso.live/log-in", waiting until "load"
============================================================
    at LoginPage.visit (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/lib/pages/login-page.ts:34:26)
    at TestAccount.logInViaLoginPage (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/lib/test-account.ts:64:3)
    at TestAccount.authenticate (/home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/packages/calypso-e2e/src/lib/test-account.ts:47:4)
    at /home/teamcity-2/buildAgent/work/c4a9d5b38c1dacde/test/e2e/specs/appearance/theme__details-preview-activate.ts:38:3 {
  name: 'Error'
} 

So, this confirms my suspicion that the screenshot taking process is not actually the root issue! It's just exposing a page that is in a bad state from an earlier error.

dpasque commented 1 year ago

So, I'm not entirely sure what these errors mean haha... But here's where the error is thrown in the Chromium source code:

https://source.chromium.org/chromium/chromium/src/+/main:content/browser/loader/prefetch_url_loader_service.cc;l=258

Gonna keep digging

dpasque commented 1 year ago

So, I'm not entirely sure how the IPC / Mojo communication (which appears to be how Chromium processes talk to each other) plays a roll...

But this might be suggesting that some iframe on our login page is violating our frame security (same-origin) policy? But this only happens very rarely? I'm not sure what that might be 🤔

dpasque commented 1 year ago

This is the only iframe I'm seeing on the login page:

https://public-api.wordpress.com/wp-admin/rest-proxy/?v=2.0#https://container-flamboyant-hoover.calypso.live

dpasque commented 1 year ago

Going back through a bunch of recent versions of this crash, I actually was able to find in the Playwright traces that they are all similarly crashes when trying to do a navigation, often the same ERR_ABORTED error, sometimes it manifests in other ways. It always seems to happen in the beforeAll hooks of scripts too.

So, pending discovering what it causing some initial navigations to fail in these specific conditions, I wonder if we can get around this with just some more defensive coding.

For one, we should be using try/catches in our teardown code to catch any failed page operations. This way, we'll make sure we are actually seeing the real errors that crash the page, not the errors thrown by trying to act on a failed page.

Secondly, we may want to wrap our initial authentication or landing in a couple of retry loops, so that if there's an unexpected crash right on the first navigation, we just try again.

dpasque commented 1 year ago

@noahtallen or @WunderBart, I'm wondering if you have any quick insight on the error as outlined in this comment. No worries if not, just curious if that rings any initial bells as you might be more familiar with the Calypso / DotCom security policies than I am 😁

noahtallen commented 1 year ago

No quick insights unfortunately, but a big fan of capturing as much log info as we can in normal runs!

You're right, the only iframe in that scenario I can think of is the rest proxy. 🤔 Though a side point is that calypso.live really doesn't work well when 3rd party cookies are disabled, just because *.calypso.live is a weird different origin from normal production. (When testing calypso.live, I have to turn off enhanced tracking protection in firefox, for example)

dpasque commented 1 year ago

Sounds good, thanks for weighing in @noahtallen! 🙇 You bring up a good point, and I'm also realizing we tend to only see these errors on our TC builds that use Calypso.live. That makes me think that maybe a good next step is adding some retry logic to our initial navigation/login, and see if that smooths things out!

worldomonation commented 1 year ago

Just catching up on this conversation and adding a few of my observations.

  1. The observation that the crash only appears to happen on builds that use the calypso.live target is true. I don't think I've ever encountered this crash locally (as it hits staging when proxied/production if not proxied) and in the Pre-Release Tests (as it hits wpcalypso.wordpress.com).
  2. Playwright runs with third-party cookies completely disabled and cannot be enabled. I'm not certain whether this is a legitimate bug in Chromium and I am more than happy to report this upstream, though the challenge in that is they typically ask us for a minimally reproducible snippet...which in theory is our login process. But there may be additional parts involved (eg. our particular CI/docker image setup) that they may not be able to replicate.
  3. Defensive coding seems to be the right way to solve this problem, especially with the time risk of going upstream and digging really deep.

On retries, I am not sure where we can add retrie. The browser startup is handled in environment.ts while the initial navigation to the login page is handled in test-accounts.ts. Since the actual page object itself crashes upon navigation to the /log-in endpoint, I'm not sure how to "cross bridges" from POMs to the framework code itself.

WunderBart commented 1 year ago

👋 Sorry for such a late response, somehow I missed this one.

I'm wondering (and please ignore if you've already checked this scenario) if it's not the same case as https://github.com/microsoft/playwright/issues/18137.

From the explaining comment:

The intent was to authenticate user and redirect to them to /bar.

However, what happens in practice, the polling mechanism of the app detects that the user was > authenticated, and attempts to redirect them to post-login page.

The end result was that sometimes when baz() executes we would be on /home and sometimes on > /baz.

...which seems to be describing our problem as well. The proposed solution looks pretty straightforward:

The reason for adding page.reload() after authenticate() is to ensure that before we are calling goto there are no more outstanding redirects.

So it should be sufficient to reload the page before calling the goto. There's also another solution proposed right in the next comment:

await authenticate(page);
await page.waitForURL('/home');
await page.goto('/bar');
await baz();

Not sure which of the above would be more viable in our case, but maybe we could try with simple reload before goto to clear the page's state and see what happens?

dpasque commented 1 year ago

@WunderBart ahhh, I didn't see that issue, that seems exactly like it!!! 😄 Nice find!

I think either approach would be viable, and hopefully that does the trick! CC @karenroldan who is 🎩 rotation this week, in case you have some time to squeeze in this fix! Otherwise, I can pick it up later 👍

worldomonation commented 1 year ago

Excellent find @WunderBart! We can give this a try later today or next week, and see if it helps reduce or eliminate this intermittent failure.

karenroldan commented 1 year ago

CC @karenroldan who is 🎩 rotation this week, in case you have some time to squeeze in this fix

I can make time for this next week!