[native_assets_builder] Add `hook/link.dart` // Tree shaking

dcharkes commented 11 months ago

The hook/build.dart is run before kernel compilation.

We should consider running a hook/link.dart after kernel compilation. In AOT we would have access to tree-shaking information that would enable us to:

Not include an asset at all if it's not referenced from Dart code.
Shrinking the contents of an asset if only a subset of its contents are used.

Use cases:

If a dynamic library (native code asset) is not used at all, don't bundle it.
For localization, we want to bundle a message file, and we want to tree-shake that message file based on what is reachable from the code. (If we would have icons, it would probably follow something similar to locazation.) https://github.com/dart-lang/sdk/blob/main/pkg/compiler/doc/resource_identifiers.md

For use case 2, we want to standardize the annotations and output format for what is reachable, so that any user can implement tree-shaking for any type of resource they want to bundle (e.g. icons, images, localization, nothing is special.)

Flutter tracking issue:

https://github.com/flutter/flutter/issues/146263

Dart tracking issue:

https://github.com/dart-lang/sdk/issues/55369

dcharkes commented 11 months ago

Notes from discussion with @mkustermann:

Maybe we should not

conflate user-defines and tree-shaking "inverse meta data", and
run builds after kernel compilation.

Instead, we should consider

having user-defines being fully global (not scoped by only having them visible to dependencies),
doing "build" in order from dependencies to dependents before the Dart build, and
doing "tree-shaking" in reverse order after the Dart build.

Step 2 and 3 would be similar to C++ builds where the build-step is separate from the linking step.

The data-flow in 2 would be as meta-data currently works. The data-flow in 3 would be in reverse order ("reverse meta data").

A reason to make it work similar to a C++ build is that it would be easier to migrate the logic in Bazel, if Dart were to ever move to the Bazel build system. Also g3 requires custom build rules instead of a Dart script, so staying closer to what other build systems do is beneficial there as well.

This approach should also cover the use case for package:messages. Packages with a data/messages.json would declare it in the the tree-shaking phase, and package:messages would pick up this inverse metadata and output the data assets.

We have multiple options for what file this tree-shaking script should be in:

link.dart / tree_shake.dart (next to build.dart)
build.dart and introduce a --linking phase.

@mosuem Any thoughts about this approach?

dcharkes commented 11 months ago

Having a build phase and tree-shake phase does raise the question what the CLI interface should be for the tree-shake phase.

Should both the build phase and tree-shake phase be able to output code assets and data assets? Or should the code and data assets output in the build phase be input to the tree-shake phase?
Should the build meta-data also flow to the tree-shake phase? Or should info from build -> tree-shake be a separate data structure? Or should it only be data/code assets? (Data assets can be arbitrary files, so it could contain anything.)

Theoretically, we could do the code tree-shaking also in the tree-shake phase instead of baking it in to the SDK.

mosuem commented 11 months ago

What does in reverse order mean?

But generally, I believe this would correspond to the initial idea of having a build_after.dart.

I have no thoughts on whether having a second script is better or worse than different calls to build.dart. Also, there is the still the question of whether the latter would be implemented through a CLI or an implementation of an interface having build(); and tree_shake(); methods. An advantage to different scripts we discussed before would be the lower overhead of running only the available scripts, instead of running a build.dart with possibly empty instructions.

dcharkes commented 11 months ago

What does in reverse order mean?

Dependencies get the "meta data" / "config" from the dependants. E.g. package:foo depends on package:messages, so the config from package:foo data/messages.json flows to package:messages.

dcharkes commented 11 months ago

Notes from discussion with @mosuem:

The interaction between build.dart and link.dart should probably be as follows:

build.dart outputs code/data assets if they should not be tree-shaken
build.dart outputs "meta data" which are keyed on target package link.dart for tree-shaking by that package link.dart
link.dart outputs code/data assets

build.dart meta data output is keyed on target package, and link.dart of the target package gets the meta data. link.dart output from one package is not visible to other link.dart (We don't see any any situation in which one package's link.dart depends on another package's link.dart.)

BuildOutput should contain Map<String packageName, Object config> linkInput (and BuildOutput should have what it currently already has). LinkInput should contain List<Object> inputFromBuild in addition to some of what BuildInput already has (target OS, etc.) LinkOutput should contain data & code assets.

Data assets output by either build.dart or link.dart should be bundled in Dart and Flutter apps and accessible by https://api.flutter.dev/flutter/services/AssetBundle-class.html. TODO: figure out what the different bundles mean in Flutter.

The implementation plan would have multiple stages:

Add build.dart to dart2js and dart2wasm. It can't do anything yet because we don't support js/wasm code assets.
Add data assets to build.dart, add AssetBundle to a dart: library, and consume build.dart data assets in Flutters AssetBundle.
Add link.dart CLI to this package, and introduce a linking phase in both DartDev/FlutterTools for tree shaking.

edit: This plan only works when deferred loading is not considered. Deferred loading means something else in dart2js than in Flutter, which makes it hard to abstract over it.

mosuem commented 10 months ago

Sounds good. The deferred loading can be just delegated to the packages own AssetBundle implementation based on a the mapping between loading units and call sites reported from that packages link.dart.

dcharkes commented 10 months ago

Sounds good. The deferred loading can be just delegated to the packages own AssetBundle implementation based on a the mapping between loading units and call sites reported from that packages link.dart.

If I remember correctly, the shaker would output:

splitting_info.json contains K OR dart-library-uri OR package-name -> file-name
messages_1.json K -> V
messages_2.json K -> V

We still need to flash out how that works with the different deferred loading mechanisms in Flutter and dart2js.

Putting the actual keys in splitting_info.json is going to blow up that file, which nullifies the benefits of deferred loading.
package-name would be preferential, because that would only be one const string. However, I believe flutter splits on dart library. I don't know about dart2js.
library uri (dart file name) seems to be a middle ground, but you'd really only want to use that with code-gen. You don't want users to write their library uri inside the dart file the whole time. (With @Natives we default the optional assetId argument to default to library uri to circumvent this issue.)

Did you give this aspect more thought yet @mosuem?

dcharkes commented 10 months ago

Or did we want to depend on "component name"? But also that requires code generation. E.g. a package author does not know what the component name is where their file ends up in when that package is used inside an app that defines a split in components.

mosuem commented 10 months ago

If I remember correctly, the shaker would output:

splitting_info.json contains K OR dart-library-uri OR package-name -> file-name

messages_1.json K -> V

messages_2.json K -> V

The shaker would output whatever the package needs to load the component. So for package:messages, the link.dart would would build a MessagesResourcesAssets object using the information from the LinkInput/BuildOutput. This object could then be serialized into the AssetBundle. See the example at the example PR. In this case, it builds a Map<int, int> from message indices to parts. Probably doing it the other way around, Map<int, Set<int>> parts to set of message indices, is smaller (but has a worse lookup time, so we might want to build the Map<int, int> at startup).

Putting the actual keys in splitting_info.json is going to blow up that file, which nullifies the benefits of deferred loading.

I don't think it is such a large file, it should be basically number of used message indices*sizeof(int).

dcharkes commented 5 months ago

Having build.dart specify the linker that an asset is destined for covers our initial use cases. We can later extend it to have applications overwrite that. https://github.com/dart-lang/native/issues/994#issuecomment-1999363039

dcharkes commented 3 months ago

Not include an asset at all if it's not referenced from Dart code.

Shrinking the contents of an asset if only a subset of its contents are used.

Both of these can be done in a link hook.

And the first one of these can be done by the Dart/Flutter SDK if all uses of a certain asset ID are const.

@mosuem suggested we should have some kind of enum for BundlingBehavior to control whether the Dart/Flutter SDK throw assets out if their assetId does not occur as a const in the program:

alwaysKeep The link hook writer will take care of throwing the full asset out if it is not used based on the tree shaking information
onlyKeepIfConstReferenced The Dart and Flutter SDK may throw this asset out if there is no reference to it with a const assetId. (NativeCodeAssets currently do not support an API that enables dlopen with a dynamic asset id, @Native is always const asset id, so NativeCodeAssets always work like this.)

@mkustermann suggested to tie this behavior to what hook the asset is output from:

build hooks are onlyKeepIfConstReferenced by default.
link hooks are alwaysKeep by default. The link hook writer will have thought about whether the asset needs to be included already.

mosuem commented 3 months ago

I would even introduce two modes instead of onlyKeepIfConstReferenced, for a total of three possible values.

enum Treeshake {
  /// Never treeshake by the SDK, only manually in link hooks.
  never,
  /// Treeshake if `loadBytes` never has a non-const arguments passed
  /// and no const argument matches the asset id.
  safe,
  /// Treeshake even if `loadBytes` has non-const arguments passed, but
  /// none of the known const ones matches the asset id.
  unsafe;
}

When using the third mode, it is entirely possible that assets are treeshaken which are actually used, which would only be noticed at runtime. But in the safe mode, any package in the dependency tree which does a single loadBytes with a non-const argument poisons the treeshaking information, and results in nothing being treeshaken.

+1 on the defaults.

dcharkes commented 3 months ago

safe

And then we should probably add a warning on the stderr or stdout when building and one of the hooks reports assets with safe but there are dynamic uses of loadBytes somewhere in the app. "Warning: your app size might have significantly increased due to ....".

dart-lang / native

[native_assets_builder] Add `hook/link.dart` // Tree shaking #153