dart-lang / native

Dart packages related to FFI and native assets bundling.
BSD 3-Clause "New" or "Revised" License
115 stars 39 forks source link

[native_assets_builder] Add `hook/link.dart` // Tree shaking #153

Open dcharkes opened 11 months ago

dcharkes commented 11 months ago

The hook/build.dart is run before kernel compilation.

We should consider running a hook/link.dart after kernel compilation. In AOT we would have access to tree-shaking information that would enable us to:

  1. Not include an asset at all if it's not referenced from Dart code.
  2. Shrinking the contents of an asset if only a subset of its contents are used.

Use cases:

  1. If a dynamic library (native code asset) is not used at all, don't bundle it.
  2. For localization, we want to bundle a message file, and we want to tree-shake that message file based on what is reachable from the code. (If we would have icons, it would probably follow something similar to locazation.) https://github.com/dart-lang/sdk/blob/main/pkg/compiler/doc/resource_identifiers.md

For use case 2, we want to standardize the annotations and output format for what is reachable, so that any user can implement tree-shaking for any type of resource they want to bundle (e.g. icons, images, localization, nothing is special.)

Flutter tracking issue:

Dart tracking issue:

dcharkes commented 11 months ago

Notes from discussion with @mkustermann:

Maybe we should not

  1. conflate user-defines and tree-shaking "inverse meta data", and
  2. run builds after kernel compilation.

Instead, we should consider

  1. having user-defines being fully global (not scoped by only having them visible to dependencies),
  2. doing "build" in order from dependencies to dependents before the Dart build, and
  3. doing "tree-shaking" in reverse order after the Dart build.

Step 2 and 3 would be similar to C++ builds where the build-step is separate from the linking step.

The data-flow in 2 would be as meta-data currently works. The data-flow in 3 would be in reverse order ("reverse meta data").

A reason to make it work similar to a C++ build is that it would be easier to migrate the logic in Bazel, if Dart were to ever move to the Bazel build system. Also g3 requires custom build rules instead of a Dart script, so staying closer to what other build systems do is beneficial there as well.

This approach should also cover the use case for package:messages. Packages with a data/messages.json would declare it in the the tree-shaking phase, and package:messages would pick up this inverse metadata and output the data assets.

We have multiple options for what file this tree-shaking script should be in:

@mosuem Any thoughts about this approach?

dcharkes commented 11 months ago

Having a build phase and tree-shake phase does raise the question what the CLI interface should be for the tree-shake phase.

Theoretically, we could do the code tree-shaking also in the tree-shake phase instead of baking it in to the SDK.

mosuem commented 11 months ago

What does in reverse order mean?

But generally, I believe this would correspond to the initial idea of having a build_after.dart.

I have no thoughts on whether having a second script is better or worse than different calls to build.dart. Also, there is the still the question of whether the latter would be implemented through a CLI or an implementation of an interface having build(); and tree_shake(); methods. An advantage to different scripts we discussed before would be the lower overhead of running only the available scripts, instead of running a build.dart with possibly empty instructions.

dcharkes commented 11 months ago

What does in reverse order mean?

Dependencies get the "meta data" / "config" from the dependants. E.g. package:foo depends on package:messages, so the config from package:foo data/messages.json flows to package:messages.

dcharkes commented 11 months ago

Notes from discussion with @mosuem:

The interaction between build.dart and link.dart should probably be as follows:

  1. build.dart outputs code/data assets if they should not be tree-shaken
  2. build.dart outputs "meta data" which are keyed on target package link.dart for tree-shaking by that package link.dart
  3. link.dart outputs code/data assets

build.dart meta data output is keyed on target package, and link.dart of the target package gets the meta data. link.dart output from one package is not visible to other link.dart (We don't see any any situation in which one package's link.dart depends on another package's link.dart.)

BuildOutput should contain Map<String packageName, Object config> linkInput (and BuildOutput should have what it currently already has). LinkInput should contain List<Object> inputFromBuild in addition to some of what BuildInput already has (target OS, etc.) LinkOutput should contain data & code assets.

Data assets output by either build.dart or link.dart should be bundled in Dart and Flutter apps and accessible by https://api.flutter.dev/flutter/services/AssetBundle-class.html. TODO: figure out what the different bundles mean in Flutter.

The implementation plan would have multiple stages:

  1. Add build.dart to dart2js and dart2wasm. It can't do anything yet because we don't support js/wasm code assets.
  2. Add data assets to build.dart, add AssetBundle to a dart: library, and consume build.dart data assets in Flutters AssetBundle.
  3. Add link.dart CLI to this package, and introduce a linking phase in both DartDev/FlutterTools for tree shaking.

edit: This plan only works when deferred loading is not considered. Deferred loading means something else in dart2js than in Flutter, which makes it hard to abstract over it.

mosuem commented 10 months ago

Sounds good. The deferred loading can be just delegated to the packages own AssetBundle implementation based on a the mapping between loading units and call sites reported from that packages link.dart.

dcharkes commented 10 months ago

Sounds good. The deferred loading can be just delegated to the packages own AssetBundle implementation based on a the mapping between loading units and call sites reported from that packages link.dart.

If I remember correctly, the shaker would output:

We still need to flash out how that works with the different deferred loading mechanisms in Flutter and dart2js.

Did you give this aspect more thought yet @mosuem?

dcharkes commented 10 months ago

Or did we want to depend on "component name"? But also that requires code generation. E.g. a package author does not know what the component name is where their file ends up in when that package is used inside an app that defines a split in components.

mosuem commented 10 months ago

If I remember correctly, the shaker would output:

  • splitting_info.json contains K OR dart-library-uri OR package-name -> file-name
  • messages_1.json K -> V
  • messages_2.json K -> V

The shaker would output whatever the package needs to load the component. So for package:messages, the link.dart would would build a MessagesResourcesAssets object using the information from the LinkInput/BuildOutput. This object could then be serialized into the AssetBundle. See the example at the example PR. In this case, it builds a Map<int, int> from message indices to parts. Probably doing it the other way around, Map<int, Set<int>> parts to set of message indices, is smaller (but has a worse lookup time, so we might want to build the Map<int, int> at startup).

  • Putting the actual keys in splitting_info.json is going to blow up that file, which nullifies the benefits of deferred loading.

I don't think it is such a large file, it should be basically number of used message indices*sizeof(int).

dcharkes commented 5 months ago

Having build.dart specify the linker that an asset is destined for covers our initial use cases. We can later extend it to have applications overwrite that. https://github.com/dart-lang/native/issues/994#issuecomment-1999363039

dcharkes commented 3 months ago
  • Not include an asset at all if it's not referenced from Dart code.
  • Shrinking the contents of an asset if only a subset of its contents are used.

Both of these can be done in a link hook.

And the first one of these can be done by the Dart/Flutter SDK if all uses of a certain asset ID are const.

@mosuem suggested we should have some kind of enum for BundlingBehavior to control whether the Dart/Flutter SDK throw assets out if their assetId does not occur as a const in the program:

@mkustermann suggested to tie this behavior to what hook the asset is output from:

mosuem commented 3 months ago

I would even introduce two modes instead of onlyKeepIfConstReferenced, for a total of three possible values.

enum Treeshake {
  /// Never treeshake by the SDK, only manually in link hooks.
  never,
  /// Treeshake if `loadBytes` never has a non-const arguments passed
  /// and no const argument matches the asset id.
  safe,
  /// Treeshake even if `loadBytes` has non-const arguments passed, but
  /// none of the known const ones matches the asset id.
  unsafe;
}

When using the third mode, it is entirely possible that assets are treeshaken which are actually used, which would only be noticed at runtime. But in the safe mode, any package in the dependency tree which does a single loadBytes with a non-const argument poisons the treeshaking information, and results in nothing being treeshaken.

+1 on the defaults.

dcharkes commented 3 months ago

safe

And then we should probably add a warning on the stderr or stdout when building and one of the hooks reports assets with safe but there are dynamic uses of loadBytes somewhere in the app. "Warning: your app size might have significantly increased due to ....".