cloudflare / next-on-pages

CLI to build and develop Next.js apps for Cloudflare Pages
https://www.npmjs.com/package/@cloudflare/next-on-pages
MIT License
1.29k stars 125 forks source link

[šŸ› Bug]: React contexts break when loading pages in parallel #805

Closed ostenbom closed 3 months ago

ostenbom commented 4 months ago

next-on-pages environment related information

System:
    Platform: darwin
    Arch: arm64
    Version: Darwin Kernel Version 23.5.0: Wed May  1 20:12:58 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6000
    CPU: (10) arm64 Apple M1 Pro
    Memory: 32 GB
    Shell: /bin/zsh
Package Manager Used: npm (10.7.0)

Relevant Packages:
    @cloudflare/next-on-pages: 1.11.3
    vercel: 34.2.7
    next: 14.2.4

Description

When accessing any next pages that use contexts in parallel we get 500 Internal Server Error with:

āœ˜ [ERROR] TypeError: Cannot read properties of null (reading 'useContext')

This has been a very tricky bug to hunt down, but we experience this when running wrangler locally, in our end-to-end test suites, and we have metrics that show us that this affects our users in production.

The pages must be loaded in quick succession. There must be several pages that are visited, not the same page.

To reproduce this error, in the reproduction repo run:

  1. Run npm run preview in one tab.
  2. In the next tab npx playwright test --repeat-each=5
  3. Observe TypeError: Cannot read properties of null (reading 'useContext') in the logs + 500 internal server error

Reproduction

https://github.com/ostenbom/reproduce-next-parallel-context

Pages Deployment Method

Direct Upload (wrangler pages publish or the @cloudflare/pages-action GitHub Action)

Pages Deployment ID

No response

Additional Information

No response

Would you like to help?

dario-piotrowicz commented 4 months ago

Hi @ostenbom thanks for the issue šŸ™‚ and thanks a lot for the minimal reproduction! that's always extremely helpful and appreciated šŸ™

Unfortunately I am not really able to reproduce the issue šŸ˜“ I tried reproducing it following your repro instructions and that didn't work for me, I've also tweaked your playwright test in the following way:

import { test, expect, Page } from "@playwright/test";

const numOfPages = 50;

test(`open ${numOfPages} tabs and navigate to pages in parallel`, async ({
  browser,
}) => {
  // Create a new browser context
  const context = await browser.newContext();

  // Open three new pages (tabs)
  const pages: Page[] = await Promise.all(
    new Array(numOfPages).fill(null).map(() => context.newPage())
  );

  // URLs to navigate to
  const url1 = "http://localhost:8788/thing1";
  const url2 = "http://localhost:8788/thing2";

  // // Navigate to the pages in parallel
  await Promise.all(
    pages.map((page, i) => page.goto(i % 2 === 0 ? url2 : url1))
  );

  // Optional: Add assertions to verify the pages have loaded correctly
  for (const page of pages) {
    await expect(page.getByText("Hello, World!")).toBeVisible();
  }

  // Close the context
  await context.close();
});

(so opening 50 tabs in parallel instead of 3) and I still could not reproduce the issue šŸ˜“

I am guessing that this could be machine/OS dependent.... but I figured I'd check with you, what do you think?

ostenbom commented 4 months ago

Thanks for taking a look @dario-piotrowicz!

We deployed the app to https://my-parallel-app.pages.dev and pushed another commit to make it easier to test against that:

+  const host = "https://my-parallel-app.pages.dev";
+  // const host = "http://localhost:8788";
   // URLs to navigate to
-  const url1 = "http://localhost:8788/thing1";
-  const url2 = "http://localhost:8788/thing2";
-  const url3 = "http://localhost:8788/thing1";
+  const url1 = `${host}/thing1`;
+  const url2 = `${host}/thing2`;
+  const url3 = `${host}/thing1`;

Strange that it didn't show up in your preview - we're running the previews on mac machines if that helps.

Can you get the same behaviour from the production deployment? We're getting the behaviour consistently across machines and seeing it in our production logs. It's got to be a parallelism issue and then the timings can differ - so starting more browsers could be making the requests further away from each other?

dario-piotrowicz commented 4 months ago

so starting more browsers could be making the requests further away from each other?

I mean, I added more tabs just because I wasn't hitting the issue with only 3 šŸ˜…

I did try again with 3 tabs and also your production deployment, and still nothing, if I increase significantly the number of tabs I do get some failures (sometimes): Screenshot 2024-06-26 at 14 27 36

I imagine these are the ones you were expecting me to see? But I guess I won't see the context error logged since this is happening in your production application right? šŸ˜•

dario-piotrowicz commented 4 months ago

Strange that it didn't show up in your preview - we're running the previews on mac machines if that helps.

šŸ¤· šŸ˜“ I'm on a macbook M1 pro

ostenbom commented 4 months ago

But I guess I won't see the context error logged since this is happening in your production application right? šŸ˜•

That's exactly the behaviour! If you run with --headed, what you will see is "Internal Server Error". Which isn't much help without the logs šŸ˜…

This is a straight deployment of the repository using the npm run deploy command there - if you want to see the logs / errors you should hopefully be able to deploy your own copy?

dario-piotrowicz commented 4 months ago

But I guess I won't see the context error logged since this is happening in your production application right? šŸ˜•

That's exactly the behaviour! If you run with --headed, what you will see is "Internal Server Error". Which isn't much help without the logs šŸ˜…

This is a straight deployment of the repository using the npm run deploy command there - if you want to see the logs / errors you should hopefully be able to deploy your own copy?

I see šŸ™‚

I wonder why it takes that many parallel tabs on my end šŸ˜…

Yes I can deploy my own copy.... but it's not really going to help much debugging this šŸ˜“ I wish I had a way to locally reproduce this instead! šŸ˜“

Anyways thanks a lot for the repro and patience here šŸ™‚ , I'll try to see what I can do on my end šŸ‘

dario-piotrowicz commented 4 months ago

Screenshot 2024-06-27 at 13 03 26

šŸ‘

dario-piotrowicz commented 4 months ago

@ostenbom I've dug into the issue and I am quite sure that this is a workerd issue, in #819 I've implemented a workaround which should hopefully work quite well, I'll also check with the workerd team to see if things can be improved on their end (making my workaround in the future hopefully unnecessary).

If you want please check out the PR's prerelease and let me know if this works well for you šŸ™‚ (it definitely seems to be fixing the issue for me when testing it with your minimal reproduction).

ostenbom commented 4 months ago

Thank you @dario-piotrowicz! This has been a real head scratcher for us but works with both the reproduction and our own internal tests! ā­ ā­

dario-piotrowicz commented 4 months ago

@ostenbom nice! I'm glad it solved your issue :smiley:

I'll look into merging the fix and release it as soon as I can (I need to get the darn e2es to pass :sweat:) :+1:

juanferreras commented 4 months ago

I wasn't able to reproduce this issue locally yet ā€“ but I'm curious on this @ostenbom.

function useMyContext() {
+ console.log("React", typeof React, React?.version, typeof React?.useContext);
  const context = React.useContext(MyContext);

If you add that log in ā€“ when running it locally, do you see the same React version when it works and when it doesn't? Locally I'm seeing React object 18.3.0-canary-14898b6a9-20240318 function, which makes me wonder if there's some mismatch happening (e.g. npm resolves to 18.3.1 but next is swapping it to 18.3.0-canary-<hash>-<date>)

This issue you're seeing has a long history in the Next.js repo, most of the times when there are multiple versions of react (e.g. from dependencies incorrectly anchoring it or declaring it) https://github.com/vercel/next.js/discussions/43577.

I'm not seeing multiple versions of react on the reproducible example but Next.js does swap to a canary release when using app router so it's definitely an awkward territory to debug.

I'm surprised to see the global state pollution as the potential root cause, as I'd assume the same thing happens on AWS lambdas outside of cold starts when deploying to Vercel. Is there anything else that could make the deployment in Cloudflare pin to the wrong React version (or wrong React bundle e.g. the RSC one instead of the others)?

I wonder if doing something similar to server-only / client-only (https://github.com/vercel/next.js/blob/canary/packages/next/src/compiled/client-only/package.json#L13-L19) but with console.log could also shed any extra insight on what might be happening when it fails.

james-elicx commented 4 months ago

Related: https://github.com/vercel/next.js/issues/67694

ostenbom commented 3 months ago

Re-opening since there seems to be more going on here than potentially just the cloudflare-workers runtime issue!

thinhle-agilityio commented 3 months ago

+1 We also got this issue. Hopefully we can get your release soon @dario-piotrowicz

dario-piotrowicz commented 3 months ago

@thinhle-agilityio sorry for the delay, yeah we should be good to go, I'll make sure to get the PR reviewed, merged and a release done by early next week šŸ™‚šŸ‘

dario-piotrowicz commented 3 months ago

@ostenbom PR merged šŸ˜„

Sorry it took so long, I've been hitting different issues with the solution and whatnot šŸ˜“

You can find the fix in the latest next-on-pages beta release, I will make a stable release early next week if that's ok šŸ™‡ (this time early next week for real! šŸ˜…)

(If you can double check and let me know if it's all working for you in the beta release that's be much appreciated šŸ™ )