facebook / docusaurus

Easy to maintain open source documentation websites.
https://docusaurus.io
MIT License
55.03k stars 8.25k forks source link

Core D2 dependency filename changes between builds breaks Netlify indexing #3383

Closed sserrata closed 2 years ago

sserrata commented 3 years ago

🐛 Bug Report

Docs/files that are otherwise unchanged between builds are marked as changed when the runtime~main.<hash>.js filename changes. This occurs since all generated HTML files import runtime~main.<hash>.js. Since static site hosts like Netlify rely on file hashes for indexing this results in files incorrectly getting marked as changed between builds, which can greatly increase the overall build/deploy time. Our team noticed this behavior after our D2 site grew beyond 1K docs.

Have you read the Contributing Guidelines on issues?

Yes.

To Reproduce

(Write your steps here:)

  1. Build your D2 site, i.e. yarn run build.
  2. Dump file hashes of generated files/HTML in build dir, e.g. cd build && find . -type f -exec md5 "{}" \; | sort
  3. Change/edit/add any file, e.g. doc, page, blog, etc.
  4. Repeat steps 1 and 2.
  5. Compare output of steps 2 and 4 using a diff tool and notice all the additional files that have changed, e.g. code --diff before_change_hashes.txt after_change_hashes.txt (example using vscode)

Expected behavior

Only docs/files that were intentionally changed between builds should be modified.

Actual Behavior

All static HTML files that import any or all of the following dependencies are modified when dependency filenames change following a build. This appears to be caused when the <hash> portion of each filename is changed between builds.

Depending on the size of the D2 site, this could potentially introduce many more modified files than expected between builds, which could render indexing by hosting/build sites like Netlify, GitHub Actions, et al., ineffective.

In the following screenshot, note all the changed files despite only ./docs/contributing/index.html actually being modified:

diff_between_builds

Your Environment

Reproducible Demo

Can be reproduced on any D2 site.

slorber commented 3 years ago

Thanks for reporting this.

Does this happen on Docusaurus 2 website too?

Does this happen if you change no md files at all?

Only one md file?


Since static site hosts like Netlify rely on file hashes for indexing this results in files incorrectly getting marked as changed between builds, which can greatly increase the overall build/deploy time.

Not sure to understand about all that. To me the main problem is that hashes changes to URLs changes and thus more assets caching are invalidated.

Can you explain how it impacts build/deploy time and by how much?

sserrata commented 3 years ago

Does this happen on Docusaurus 2 website too?

I would imagine it does, as all D2 sites appear to generate these dependencies.

Does this happen if you change no md files at all?

No, it only seems to happen after changing an md file, CSS or a custom page, i.e. /src/pages. Basically, whatever is necessary to regenerate runtime~main.<hash>.js and other dependencies.

Only one md file?

Yes, the screenshot I referenced in this issue was produced after editing a single md file. As you can see, building after modifying a single md file resulted in modifying all other doc and blog files, in addition to the root index.html and 404.html files.

Not sure to understand about all that. To me the main problem is that hashes changes to URLs changes and thus more assets caching are invalidated.

Yes, that's the problem. Caching is invalidated after the URLs change, since that results in a different hash.

Can you explain how it impacts build/deploy time and by how much?

In the case of our largest site, https://xsoar.pan.dev, changing one file can result in as many as 2500+ "new files to upload" being detected by Netlify. Netlify performs additional processing of HTML before uploading to their CDN which can add as much as 15 or more minutes to the build time. Basically, without the benefit of caching/indexing, each deploy is treated as a brand new deploy.

slorber commented 3 years ago

I see, thanks.

Not sure to have time to check that problem currently but will keep it in mind.

Curious:

I find it surprising to see a difference of 15min between builds just for the Netlify processing/upload :o At the same time your site seems to be quite large

Maybe we could investigate something like incremental build (like Gatsby) to see if it's possible to rebuild faster, but it's unlikely we'll have time to do this soon (would be after 2.0.0 RC)

sserrata commented 3 years ago
  • what is the size of your site?

Roughly 1500 docs and growing.

  • how many docs (including versioned)?

See above. We aren't currently using the versioning feature.

  • what's the build time without any change? (ie you press "redeploy" on netlify)

It varies, but it can still take as much as 26 minutes total to complete the Netlify build. When files changed, we can expect that time to increase +10-15 minutes or more.

  • what's the build time with a single doc change?

+10-15 more minutes with a single file change, depending on which dependencies need to be regenerated, i.e. renamed.

I find it surprising to see a difference of 15min between builds just for the Netlify processing/upload :o At the same time your site seems to be quite large

We're working with Netlify closely on this. Their post processing can be tweaked but even with everything off they still perform some "processing" of static files before uploading to CDN.

Maybe we could investigate something like incremental build (like Gatsby) to see if it's possible to rebuild faster, but it's unlikely we'll have time to do this soon (would be after 2.0.0 RC)

That sounds intriguing. I was also wondering if you and the team have considered moving away from webpack? Or, at least, moving away from including a hash in the core JS dependency filenames. If the filenames are static, meaning they don't change between builds, this problem goes away. It's something to consider because I'm sure caching/indexing is important to all D2 users/sites, it just so happens we're one of the first to grow to a scale large enough to notice the bug.

Please let me know if there's anything I can help with to improve our understanding of this issue.

P.S. If you or another contributor could help point me to where in the codebase the core dependency filenames are generated it would be greatly appreciated! I've been having a difficult time figuring it out.

slorber commented 3 years ago

Thanks for the feedback, that's probably one of the largest Docusaurus site :)

  • what's the build time without any change? (ie you press "redeploy" on netlify)

It varies, but it can still take as much as 26 minutes total to complete the Netlify build. When files changed, we can expect that time to increase +10-15 minutes or more.

  • what's the build time with a single doc change?

+10-15 more minutes with a single file change, depending on which dependencies need to be regenerated, i.e. renamed.

I find it surprising to see a difference of 15min between builds just for the Netlify processing/upload :o At the same time your site seems to be quite large

We're working with Netlify closely on this. Their post processing can be tweaked but even with everything off they still perform some "processing" of static files before uploading to CDN.

Really hope we could improve these build times

was also wondering if you and the team have considered moving away from webpack?

v2.0 is still in alpha and we are focusing on the final release. Removing Webpack at this point in v2 would be an annoying breaking change for v2 early adopters and plugin authors so I don't see it happening. I'm working with Facebook for a few months on Docusaurus, can't say for them what the plans are for v3+ of Docusaurus, but I guess moving out of Webpack is a possibility. Also worth checking the new possibilities offered by native ESM support and toolings like Snowpack and Vite, but it's unlikely to happen anytime soon imho.

Or, at least, moving away from including a hash in the core JS dependency filenames. If the filenames are static, meaning they don't change between builds, this problem goes away. It's something to consider because I'm sure caching/indexing is important to all D2 users/sites, it just so happens we're one of the first to grow to a scale large enough to notice the bug.

We focus on shipping i18n and better versioning first, as it's a blocker for releasing v2.0 RC, but I think we should try to solve this annoying problem (as it's not blocking v2 RC, but is still important). For now I think the caching story on Docusaurus is not so good and has not received enough investments, and most people actually use Docusaurus without setting and host cache headers. As Netlify etc automatically enable Etags, performances are still ok though, but we should definitively explain how to optimize Docusaurus site hosting performances.

About including a hash in filename, this can enable immutable caching (see https://github.com/facebook/docusaurus/issues/3156). But maybe this hash is not required for all files and we can look for a Webpack config that produces a more "stable" output on docs changes.

Please let me know if there's anything I can help with to improve our understanding of this issue.

P.S. If you or another contributor could help point me to where in the codebase the core dependency filenames are generated it would be greatly appreciated! I've been having a difficult time figuring it out.

If you want to investigate, help is welcome, because I have to continue my work on i18n this month, and am the only fulltime maintainer.

Our monorepo is not very hard to work with and contribute to. If your changes work on Docusaurus 2 website I guess it should also work for your site too.

Unfortunately, I am no Webpack guru, so I don't have any particular insight on what might be the solution, but it should probably be here:

https://github.com/facebook/docusaurus/blob/master/packages/docusaurus/src/webpack

Easy steps to contribute on this:

sserrata commented 3 years ago

Thanks for the tips. I believe the following module is where the dependency file names are generated (at least where the hash portion gets added):

https://github.com/facebook/docusaurus/blob/master/packages/docusaurus/src/webpack/base.ts#L65

About including a hash in filename, this can enable immutable caching (see #3156). But maybe this hash is not required for all files and we can look for a Webpack config that produces a more "stable" output on docs changes.

I might be misunderstanding but I thought immutable hashing wasn't reliant on filename, but on including a cryptographic hash in the script/link tag. If base.ts really is the module where these file names are generated, I'm noticing the final filename only includes the first 8 chars of the hash. I'm going to experiment omitting the hash from the filenames to see how that affects functionality/performance.

slorber commented 3 years ago

I might be misunderstanding but I thought immutable hashing wasn't reliant on filename, but on including a cryptographic hash in the script/link tag.

I'm not sure what you mean, but if the content of the file changes, we indeed want the filename to change. Changing the filename is a common practice to automatically invalidate http cache, used by a lot of people (including Gatsby etc). We really want to keep this because it has clear benefits.

The issue is that if one doc changes, only a few output filenames should be modified (the ones related to the modified doc), not many/most/all of them (the behavior we seem to have).

Sometimes having a shared file to change (like runtime-) produce a cascade of other filenames to change. We should ensure to prevent that behavior to happen.

slorber commented 3 years ago

This NextJS related discussion about advantages of Webpack 5 is interesting, regarding the deterministic output

https://stackoverflow.blog/2020/10/07/qa-with-the-creators-of-next-js-on-version-9-5/

sserrata commented 3 years ago

Hi @slorber! Deterministic IDs looks like a game changer with respect to the caching issue described here. Is upgrading to Webpack 5 on the D2 roadmap? Do you know of a good way to upgrade webpack 5 in dev to test?

slorber commented 3 years ago

Hi @slorber! Deterministic IDs looks like a game changer with respect to the caching issue described here. Is upgrading to Webpack 5 on the D2 roadmap? Do you know of a good way to upgrade webpack 5 in dev to test?

@sserrata unfortunately not, we need to ship i18n first and then move to beta. I don't know how much of a breaking change Webpack v5 would be, particularly for plugin authors that implement the configureWebpack lifecycle, so we may want this migration to be Docusaurus v3? Or postpone Docusaurus v2 RC to Webpack 5 migration? Don't know yet but hope to work on this in the upcoming months after i18n is merged.

slorber commented 3 years ago

@sserrata a Webpack 5 PR is ready for review and I published a canary release: https://github.com/facebook/docusaurus/pull/4089

If you preserve the node_modules/.cache folder across 2 builds, the 2nd build should be much faster. Let me know if that works better for you.

I'm not sure however that the deterministic output of Webpack 5 will satisfy you, as the runtime chunk hash still seems to changes every time a file is modified. Let me know if things improve but I guess it's somehow "normal", as the HTML files (can't be cached because they have a stable URL) should all see the "new SPA".

slorber commented 3 years ago

@sserrata if you don't use cache-control headers (like "immutable") to hashed assets on your CDN, you can try to remove the hash from js filenames output.

On Netlify, it will still provide etags-based caching, which is not too bad imho (and I guess most users don't even set more aggressive caching headers on their CDN)

If you use configureWebpack(), you can pass a config such as:

{
  output: {
    filename: "[name].js";
  }
}

As far as I see, the chunks under /assets/... will remain hashed (so you can still cache them aggressively) but the runtime/main files won't change anymore (which means it's unsafe to cache them aggressively, but it looks fine to me.

If this setup works fine for you, I think we could make this a default for Docusaurus.

Docusaurus is not a typical webpack app: it has a single entrypoint for all the pages, so any page modification modify this entrypoint.

sserrata commented 3 years ago

Thanks @slorber, we'll begin testing ASAP. I appreciate the tips.

@glicht, FYI