dgp1130 / rules_prerender

A Bazel rule set for prerending HTML pages.
13 stars 0 forks source link

Better tree shaking for `prerender_pages()` JS #26

Closed dgp1130 closed 1 year ago

dgp1130 commented 3 years ago

Spin off from #1.

Currently, prerender_multi_page_bundled() is somewhat suboptimal in generating bundled JavaScript. Every page declares some JavaScript modules it needs, each of which imports its own graph of dependencies. However, all the JS modules for every page are combined into a single entry point, which is then bundled by Rollup. That singular bundle is then used for all generated HTML pages.

This means that if a single page imports module A, and no other page uses A, they will all still bundle A, download it, and even execute it. This is additional bundle size which is not desirable or useful to users. It also means devs need to write their modules to be conservative about content on the page, as they can't guarantee a certain structure. For example, if A is only ever injected to modify an element with id="B", A may still be included on other pages that don't have element B. That means A needs to be written to detect this case and do nothing, when it really should have never been included in the first place.

Currently this is a limitation of rollup_bundle(). Users generate a directory with a set of HTML files which is not known until execution time, when the user writes PrerenderResource.of(/* ... */). However, rollup_bundle() expects the entry points to be declared in BUILD with each entry point known at loading/analysis time. This is impossible with our current design unless we require all users to pre-declare all the files they will output, which would get ugly fast.

I filed https://github.com/bazelbuild/rules_nodejs/issues/2471 with a FR for rollup_bundle() to accept entry points as an execution-time argument. This is a bit of a specific use case, but I found a semi-reasonable way to implement it. For now, progress on this is blocked on that.

dgp1130 commented 1 year ago

Now that we're on @aspect_rules_rollup, I took a pass at a PR to add this feature. Unfortunately I think it would complicate the contract of rollup_bundle() too much to be a good idea in @aspect_rules_rollup. The main problem is that the desired contract would be something like:

# my_app/BUILD.bazel

rollup_bundle(
    name = "bundle",
    entry_points = ":entry_points",
    deps = ["foo.js", "bar.js", "baz.js"],
)

make_tree_artifact(
    name = "entry_points",
    srcs = [
        "foo.js",
        "bar.js",
    ],
)

The problem with this is that any tree artifact is necessarily a subdirectory of the package. In this case, the entry points actually live at my_app/entry_points/{foo,bar}.js, yet baz.js (which is not an entry point) lives at my_app/baz.js like a normal ~human being~ file.

To work around this, you'd need to generate the import statements with the understanding that they will be coming from a different directory. I could definitely do that, though it would be way to weird and nuanced for @aspect_rules_rollup. I decided to try a custom implementation of rollup_bundle() for my purposes which captured the full transitive closure of files, wrote them a TreeArtifact which is re-rooted in the package, and then generate a separate TreeArtifact of entry points. The entry points tree just defines which paths we should be starting from, but they are actually passed into Rollup flags as file paths relative to the transitive closure TreeArtifact. This way we end up with a file path like:

.../execroot/bazel-bin/k8-fastbuild/bin/
    path/to/pkg/
        my_rollup_bundle_target__deps/
            path/to/pkg/
                foo.js
                bar.js
                baz.js
                ...

Then Rollup receives paths like path/to/pkg/my_rollup_bundle_target__deps/path/to/pkg/foo.js, and any relative imports from foo.js should work. I got things laid out correctly and invoking Rollup with the right path, but it still fails to find the file for some reason. If I put the same path (minus the bin dir prefix) into a Bash action, I'm able to cat the file, so I'm not sure what Rollup is doing wrong here?

Attempt is in ref/execution-time-entry-points-attempt-1

I'd need to figure out node_modules/ and external repositories which is where things get complicated. But I realized that if I'm building out my own implementation anyways, is there much value in an entry point TreeArtifact? I could just generate a manifest of entry points, invoke a custom tool, and have that tool parse the manifest then spawn a Rollup subprocess running over all the entry points. I was originally trying to use the TreeArtifact filled with only entry points as a trick to make args.add_all() do the right thing and avoid such a tool so it would make more sense to include in @aspect_rules_rollup. But if this is too complicated and nuanced to do that anyways, then there's not much value in taking advantage of args.add_all().

My second attempt took this idea and just made a standalone tool which called Rollup via a subprocess and accepted a list of entry points as a manifest file. With that, I was able to generate a TreeArtifact of only entry point files, but which only included their individual dependencies. This definitely seems like it could work. I'll need to clean it up some more and then try to insert it into the prerender_pages() macro. I'll likely have to update //tools/binaries/script_entry_generator to //tools/binaries/bundle_manifest_generator or something like that. Alternatively I could merge that tool directly into the bundler. Have to see how it goes.

Second attempt is in ref/execution-time-entry-points-attempt-2.

dgp1130 commented 1 year ago

I continued experimenting today and quickly realized that I hadn't considered how the entry point manifest gets generated. A single manifest of entry points isn't quite good enough, because a single HTML page might include multiple includeScript() calls. The page should only include a single entry point which imports all those included scripts. We can't express that via a single manifest of entry points, since the "entry points" are synthetically generated from includeScript() calls and need to be associated with a specific .html file.

I updated annotation extractor to keep track of which HTML file each annotation is extracted from, and then the script entry generator creates an entry point for each HTML file, instead of one for the whole site. We can then pass these entry points to Rollup to bundle and inject them back into the site. Ironically, since we need to generate a distinct entry point for each HTML file, we need a TreeArtifact of just entry points anyways, so args.add_all() works just fine for adding the files to Rollup. We do need to include an extra ../ to escape the TreeArtifact. I think it's reasonable to expect that imports should resolve relative to wherever that TreeArtifact is, and since that's passed as an input to rollup_bundle(), it's actually not that weird a contract. It's just that make_tree_artifact() in my example doesn't make a whole lot of sense, because putting source files in a tree inherently moves them to a new directory. Maybe it would be worth making a PR to @aspect_rules_rollup? I did need to use args.add_all(map_each = lambda entry: "...", allow_closure = True), which might be a concern given that storing data in closures like that could leak more memory than necessary. I wasn't even aware Bazel supported that now, but apparently it does. I think we could use that only when actually necessary from passing it a TreeArtifact, so it's probably feasible to land in @aspect_rules_rollup.

I was able to get a successful test case with two different pages on the same site including different sets of scripts. These two pages managed to only include the JavaScript they actually needed, and properly use a shared chunk for the pieces which were consistent between them. Output looks great, just need to land some changes to get there.

This third attempt is in ref/execution-time-entry-points-attempt-3.

dgp1130 commented 1 year ago

I was able to land all the changes necessary to address this issue and even added two new examples as regression tests to ensure that includeScript() calls don't leak between pages and that shared chunks are properly shared at runtime.

One unexpected hiccup is that I had to delete the custom bundling example. Previously we generated a single entry point which imported all the scripts referenced by includeScript(). That meant it was reasonably feasible for someone to plug in a custom bundler which processed that singular entry point and inject the bundle into all the pages. However, now that each page has its own entry point and those entry points aren't known until execution-time, it means that there is no easy to bundle JavaScript outside of prerender_pages().

If we update rollup_bundle() to accept a TreeArtifact, that could potentially be used, but I'm not too worried about it right now. If the divergence from @aspect_rules_rollup causes issues in the future, I can re-evaluate.