Quansight / Quansight-website

💻 Source code for Quansight Labs website
https://labs.quansight.org
21 stars 50 forks source link

[BUG] - Non-deterministic `TypeError: Cannot read property 'firstName' of undefined` during LLC build #353

Closed bskinn closed 2 years ago

bskinn commented 2 years ago

What site is this for?

Quansight LLC

Expected behavior

No response

Actual behavior

On some builds of the LLC website, the build fails with TypeError: Cannot read property 'firstName' of undefined errors for some, but not all, published blog posts. Here are three example failing builds: one, two, three.

The error/failure is not deterministic. Redeploying a build that fails from this error can sometimes succeed in the future, though it will often take 5+ redeployments before a successful build.

For reference, a full traceback for a representative error:

Error occurred prerendering page "/post/extending-numba-types-for-clean-fast-code". Read more: https://nextjs.org/docs/messages/prerender-error
--
14:10:42.587 | TypeError: Cannot read property 'firstName' of undefined
14:10:42.587 | at /vercel/path0/dist/apps/consulting/.next/server/chunks/629.js:245:141
14:10:42.587 | at Array.map (<anonymous>)
14:10:42.587 | at getBlogArticlesProps (/vercel/path0/dist/apps/consulting/.next/server/chunks/629.js:232:65)
14:10:42.587 | at getLibraryTiles (/vercel/path0/dist/apps/consulting/.next/server/chunks/629.js:299:83)
14:10:42.587 | at getStaticProps (/vercel/path0/dist/apps/consulting/.next/server/pages/post/[slug].js:174:123)
14:10:42.587 | at processTicksAndRejections (internal/process/task_queues.js:95:5)
14:10:42.587 | at async renderToHTML (/vercel/path0/node_modules/next/dist/server/render.js:492:20)
14:10:42.587 | at async /vercel/path0/node_modules/next/dist/export/worker.js:253:36
14:10:42.587 | at async Span.traceAsyncFn (/vercel/path0/node_modules/next/dist/trace/trace.js:79:20)

Originally, I thought there was something wrong with a specific team member in /team. In the leadup to launch, when I was in the course of trying to publish all of the migrated blog posts, the build error would occur whenever I had either of the two posts by Adam Lewis, "Panel/Holoviews Learning Aid" and/or "Spatial Filtering at Scale with Dask and Spatialpandas", set in the Published state. If both were Unpublished, the build would consistently succeed.

To emphasize again: the TypeError would occur on more posts than just these two posts I thought were problematic.

This assumption, that Adam Lewis's author entry was the problem, was reinforced when I switched the author of one of these posts to Dharhas, set that post to Published, and observed successful builds. If I then switched back to Adam Lewis as author, the build failure would recur.


Later on, after @kherma's fix for #310 was implemented, the same build errors started to occasionally occur again, even with Adam Lewis's posts set as Unpublished. I noted this in https://github.com/Quansight/Quansight-website/pull/310#issuecomment-1174962115 and the following conversation. It happened infrequently enough, though, that I decided it wasn't worth trying to fix before launch -- I would just manually kick off redeploys as needed until I got a successful build.


One final observation: as best I know, these build errors ONLY occur on builds that are configured to use only Published Storyblok content -- -staging builds triggered on the exact same Github code and Storyblok content, which pick up both Published and Unpublished content, do not experience this error. As an example, see the following two builds, from one point in the midstream of @gabalafou's diagnosis efforts for the problem (commit 78e192e):


The non-determinism here makes me think that this may be a race condition in the pre-rendering step: somehow, author metadata information is being requested from some object(s) that will hold it in the future, but that have not yet been populated at the time of these errors. Given the appearance of getLibraryTiles in the traceback, it seems likely that the problem is coming from the library-population code for /library, where the page collects the information it needs in order to show the grid of library items.

I'm at a loss for why the error would only occur for builds with only Published content. Perhaps there's something in the logic where the Next.js code is querying the /post items for (Un)published status that is contributing to a possible race condition? Or triggering early/late population of the author metadata?

If it is a race condition, perhaps putting some sort of retryer or setTimeout() on the call to getAuthorName() in getBlogArticlesProps.ts, and possibly also to the similar call in getLibraryLinksProps.ts, might help? These would probably only be band-aids, though, not solutions.

How to Reproduce the problem?

No response

Anything else?

No response

gabalafou commented 2 years ago

I'm seeing something that seems to support Brian's suggestion:

somehow, author metadata information is being requested from some object(s) that will hold it in the future, but that have not yet been populated at the time of these errors

I created PR #352 to add some logging when author.content is undefined, and what you can see in the build that fails for quansight-consulting-published is that the author field points to some kind of SHA, for example 5c56c772-bc6d-4e3d-aae9-e5d202fcf70a.

bskinn commented 2 years ago

IIRC, aren't JavaScript Promises represented by SHAs/UIDs(?) of that sort?

I looked at the Next.js code yesterday for the GetStaticProps type, and noticed the return type is Promise<GetStaticPropsResult<P>> | GetStaticPropsResult<P>, so it seems to that this would fit the hypothesis also.

bskinn commented 2 years ago

Ok, whatever that hash is, it's not pointing to a Promise -- this console.log() call doesn't ever seem to be hit. Seems it's just a string UID of some kind, returned instead of the actual author data.

Two further confusing things:

  1. Why are there so many hits on this logging call, which (more or less) reports when a non-Promise author is processed? In this build it was hit 193 times.
  2. In that same build, the articles/authors that hit this logging call aren't the same as the articles where the TypeError occurred:

    • Articles with an invalid author (six occurrences):
      • Extending Numba Types for Clean, Fast Code (occurs 3x!)
      • Working across Panel and ipywidgets ecosystems (occurs 2x)
      • WIll Python Be # 1 Forever? (occurs 1x)
    • Articles with TypeError (six occurrences):
      • why-we-are-excited-about-jupyterlab-3-0-dynamic-extensions
      • rapids-cucim-porting-scikit-image-code-to-the-gpu
      • quick-dashboarding-with-panel
      • extending-numba-types-for-clean-fast-code
      • acceleration-in-python-which-is-right-for-your-project
      • working-across-panel-and-ipywidgets-ecosystems

    It appears that rendering of each library tile requires scanning across all of the blog posts, and for any given post where the TypeError occurs, it may stem from a failure to correctly populate author in any post on the site?