Data assets feature - Githubissues

dart-lang / sdk

The Dart SDK, including the VM, dart2js, core libraries, and more.

https://dart.dev

BSD 3-Clause "New" or "Revised" License

9.99k stars 1.54k forks source link

Data assets feature #54003

Open dcharkes opened 8 months ago

dcharkes commented 8 months ago

Problems:

Flutter has an assets system, Dart does not. This is problematic when wanting to access bytes at runtime in a platform independent way.
Various types of assets could benefit from minifying and/or tree-shaking. For example a translation package might only want include translations in a translations.json asset with keys that are actually reachable from Dart code after Dart AOT optimizations.

High level outline:

Bring AssetBundle to dart:
Make build.dart (from https://github.com/dart-lang/sdk/issues/50565) be able to output data assets besides code assets.
Introduce a link.dart that runs after dart AOT compilation that can do minifying and tree-shaking for data assets.

The tree-shaking info that link.dart would be getting is const values from arguments to static function calls and constructors:

// package:foo/foo.dart

void main() {
  for(int i = 0; i < 10; i++){
    foo(i, 'foo');
    foo(i, 'bar');
  }
}

@someTreeShakingAnnotationToBeDecided
foo (int i, string s) {
  // ...
}

[
  {
    "uri": "package:foo/foo.dart",
    "name": "foo",
    "callers" : [
      { "s" : "foo" },
      { "s" : "bar" },
    ]    
  }
]

Some WIP format in the test cases: https://dart-review.googlesource.com/c/sdk/+/329961

More details to follow. Filing an issue so we have somewhere to point to.

lrhn commented 8 months ago

If we introduce something to support a featuire like AssetBundle, we should consider doing something fairly simple and low-level, that a more complicated class like AssetBundle can be designed on top of.

We used to have Resource, which was removed because it didn't work with AoT compilation. It was a way to access files through package: URIs, which only makes sense when running in the same environment as compilation.

We'll need some abstraction between (location of) the raw bytes and the runtime system, which allows for including the resources in the deployment unit, and which can be tree-shaken to avoid including unnecessary resources.

If it's something that independent packages can use to get their own resources included in programs that use the package, then there also needs to be some kind of name-spacing — which is where using package: URIs worked so well, it clearly assigns a unique prefix to any resource introduced by a package.

mosuem commented 8 months ago

@lrhn : Yes, the idea is to introduce an interface which other frameworks can implement themselves, such as the already existing implementation from Flutter. The implementation to allow dart build to ship assets can be (hopefully) fairly simple.

Regarding the package: URIs, this is also how Flutter does it I believe, at least for images.

xvrh commented 8 months ago

That would be SO cool if we could tree-shake assets in flutter (https://github.com/flutter/flutter/issues/64106)

It probably requires to use a generated ID in the code instead of a dynamic string to reference the asset (this is also a win btw).

// Use a generated id to reference assets (cfr Android way)
Image.asset(R.images.my_icon);

// instead of
Image.asset('assets/images/icon.png');

SantiiRepair commented 7 months ago

So what could be a solution?

For example when I compile my project using dart compile exe project.dart and add files that are used in the project these files are not added to the compiled file, it is a problem if I want my executable to work as a service on any computer

lrhn commented 7 months ago

My, very simplistic, suggestion would be:

Introduce a Resource super-class:

abstract final class Resource<T> {
  external const Resource(String url);
  Future<T> load();
}

Introduce a set of specialized resource kinds:

/// A resource contanining a sequence of bytes.
abstract final class ByteResource implements Resource<Uint8List> {
  external const factory BytesResource(String url);

  /// The length in bytes of this resource.
  Future<int> get length;

  /// An unmodifiable view of the bytes of the resource.
  Future<Uint8List> load();

  /// An unmodifiable buffer of the bytes of the resource.
  ///
  /// Can then be viewed as any type of typed data, like 
  /// ```dart
  /// var doubles = (await res.loadBuffer()).asFloat64List();
  /// ```
  Future<ByteBuffer> loadBuffer();

  /// Read the resource into an existing byte list.
  Future<Uint8List> loadInto(Uint8List target, [int offset = 0]);

  /// Read a range of the resource into an existing byte list.
  Future<Uint8List> loadRangeInto(int start, int end, Uint8List target, [int offset = 0]);
}

/// A resource containing a Dart [String].
abstract final class StringResource implements Resource<String> {
  /// A string resource, loaded from [url] and decoded using [encoding].
  ///
  /// Encoding *must* be one of [utf8], [latin1] or [ascii].
  // (TODO: Add `utf16` decoder, then also allow `utf16`, `utf16.littleEndian` and `utf16.bigEndian`.)
  external const factory StringResource(String url, [Encoding encoding = utf8]);
  /// Length of the string, in UTF-16 code units.
  Future<int> get length;
  /// Load the content of the string.
  Future<String> load();
}

/// A resource containing JSON data.
abstract final class JsonResource implements Resource<Object?> {
  /// The [url] must refer to a file contining JSON source, which must be UTF-8 encoded.
  external const factory JsonResource(String url);

  /// Read the JSON file into an unmodifiable Dart JSON value.
  ///
  /// A JSON value is either a `List` of JSON values, a `Map` from strings to JSON values,
  /// or a simple [String], [num], [bool] or `null` value.
  Future<Object?> load();
}

Then you specify a resource by declaring a constant:

const myStringFromFile = StringResource('package:my_package/src/res/text_file.txt');

It'll be a compile-time error to use the constructor in a non-const way. The constants can be tree-shaken as any other constant.

Whichever constanta are still left after compilation, the files their urls point to are included in the distributable, in a way such that myStringFromFile.load can load it. The compiler and runtime gets to decide how and where. Data can be put into the .data segment of the executable, as unmodifiable, if that helps.

It's up to the runtime to decide whether to cache the file contents on first load or not, or for how long, which format the content is stored in. For example, it can be compressed. JSON can be massively compressed if it has the same structure many times, and since we control the format, we can parse the file as JSON at compile time, store it in a specialized binary format, and read it back from there (possibly even providing a cheap view on top of the compressed structure, instead of building a a structure using normal Dart objects.)

The one thing I'd consider is whether to support synchronous access. I'd probably have separate resource classes like StringSyncResource for that, with a synchronous load method. Then the compiler/linker can decide how that's best implemented. For the web, async resources can be lazy-loaded, while sync resources must be part of the initial deployment.

SantiiRepair commented 7 months ago

My, very simplistic, suggestion would be:

Introduce a Resource super-class:
abstract final class Resource<T> {
  external const Resource(String url);
  Future<T> load();
}
Introduce a set of specialized resource kinds:
/// A resource contanining a sequence of bytes.
abstract final class ByteResource implements Resource<Uint8List> {
  external const factory BytesResource(String url);

  /// The length in bytes of this resource.
  Future<int> get length;

  /// An unmodifiable view of the bytes of the resource.
  Future<Uint8List> load();

  /// An unmodifiable buffer of the bytes of the resource.
  ///
  /// Can then be viewed as any type of typed data, like 
  /// ```dart
  /// var doubles = (await res.loadBuffer()).asFloat64List();
  /// ```
  Future<ByteBuffer> loadBuffer();

  /// Read the resource into an existing byte list.
  Future<Uint8List> loadInto(Uint8List target, [int offset = 0]);

  /// Read a range of the resource into an existing byte list.
  Future<Uint8List> loadRangeInto(int start, int end, Uint8List target, [int offset = 0]);
}

/// A resource containing a Dart [String].
abstract final class StringResource implements Resource<String> {
  /// A string resource, loaded from [url] and decoded using [encoding].
  ///
  /// Encoding *must* be one of [utf8], [latin1] or [ascii].
  // (TODO: Add `utf16` decoder, then also allow `utf16`, `utf16.littleEndian` and `utf16.bigEndian`.)
  external const factory StringResource(String url, [Encoding encoding = utf8]);
  /// Length of the string, in UTF-16 code units.
  Future<int> get length;
  /// Load the content of the string.
  Future<String> load();
}

/// A resource containing JSON data.
abstract final class JsonResource implements Resource<Object?> {
  /// The [url] must refer to a file contining JSON source, which must be UTF-8 encoded.
  external const factory JsonResource(String url);

  /// Read the JSON file into an unmodifiable Dart JSON value.
  ///
  /// A JSON value is either a `List` of JSON values, a `Map` from strings to JSON values,
  /// or a simple [String], [num], [bool] or `null` value.
  Future<Object?> load();
}
Then you specify a resource by declaring a constant:
const myStringFromFile = StringResource('package:my_package/src/res/text_file.txt');
It'll be a compile-time error to use the constructor in a non-const way. The constants can be tree-shaken as any other constant.

Whichever constanta are still left after compilation, the files their urls point to are included in the distributable, in a way such that myStringFromFile.load can load it. The compiler and runtime gets to decide how and where. Data can be put into the .data segment of the executable, as unmodifiable, if that helps.

It's up to the runtime to decide whether to cache the file contents on first load or not, or for how long, which format the content is stored in. For example, it can be compressed. JSON can be massively compressed if it has the same structure many times, and since we control the format, we can parse the file as JSON at compile time, store it in a specialized binary format, and read it back from there (possibly even providing a cheap view on top of the compressed structure, instead of building a a structure using normal Dart objects.)

The one thing I'd consider is whether to support synchronous access. I'd probably have separate resource classes like StringSyncResource for that, with a synchronous load method. Then the compiler/linker can decide how that's best implemented. For the web, async resources can be lazy-loaded, while sync resources must be part of the initial deployment.

I think it is the same as Isolate

dcharkes commented 2 months ago

Data asset as `(Pointer<Void>, int lengthInBytes)`

@mosuem I believe we should not only have Uint8List as a type, but also (Pointer<Void>, int lengthInBytes). (And Future<Uint8List> and Future<(Pointer<Void>, int)>.

Reasoning: One might want to pass a data asset to native code. And if the Dart type is Uint8List, we can't see that it's an external typed data (and we must assume that it could be in the Dart heap and be moved by GC).

Data asset as `File` ?

Some C APIs actually want a file path instead of the buffer of bytes. I'm not entirely sure if we are able to support this. We could give a file path if, and only if the embedder actually has the file on disk. If the asset is embedded in something else (a zip file, a data section in assembly, ...) there wont be a file path. So the only thing someone could possibly do is write the file manually to disk with File.writeFromBytes. Also, if we ever get data assets on the web backends, then there is no File type at all.

Of course, having to manually write the file to disk if the embedder already has the file on disk is maybe also undesirable. So maybe we should consider allowing File and Future<File>. And the embedder API should then have DataAsset_fopen and char* DataAsset_file_path. With the latter one returning the symlink resolved absolute path if it exists and otherwise nullptr.

(And then for the web backends we might be interested in somehow exposing a file to C code compiled to WASM via the WASI interface.)

API options

Just to pen down the thoughts about the API @mosuem and I have discussed the last couple of days. We have two options:

class AssetBundle {
  external Uint8List loadBytesSync(assetId);

  external Future<Uint8List> loadBytes(assetId);

  external (Pointer<Void>, int) loadBytesAsPointerSync(assetId);

  external Future<(Pointer<Void>, int)> loadBytesAsPointer(assetId);

  external bool availableAsFile(assetId);

  // No need for a sync and async variant.
  // Non-null if `availableAsFile` returns true.
  external File? asFile(assetId);
}

pro: discoverability (auto complete on methods)
pro: enables dynamic assetIds (the link hook writer has to make sure that these assets are bundled)
con: not aligned with @Native external functions/symbols
con: File in the API makes it incompatible with non-VM

@Data(assetId)
external UInt8List loadMyAsset();

@Data(assetId)
external Future<UInt8List> loadMyAsset();

@Data(assetId)
external (Pointer<Void>, int) loadMyAsset();

@Data(assetId)
external Future<(Pointer<Void>, int)> loadMyAsset();

@Data(assetId)
external File? get myAsset;

// On the web backends ?
@Data(assetId)
external Blob? get myAsset;

pro: aligned with @Native external functions/symbols
pro: does not conflict with data assets on the web.
con: does not enable dynamic asset Ids
con: discoverability

We can have option 3, which patches up requirement for having dynamic assets by making assetId optional in the @Data annotation, and allowing an argument on the definition:

@Data()
external UInt8List loadMyAsset(String assetId)

@Data()
external File? myAsset(String assetId)

// ...

pro: aligned with @Native external functions/symbols
pro: does not conflict with data assets on the web.
pro: enables dynamic asset Ids
con: discoverability

mosuem commented 2 months ago

con: File in the API makes it incompatible with non-VM

That could be helped by having different AssetBundles with different APIs

class AssetBundle {
  external Uint8List loadBytesSync(assetId);

  external Future<Uint8List> loadBytes(assetId);
}

class PointerAssetBundle {
  external (Pointer<Void>, int) loadBytesAsPointerSync(assetId);

  external Future<(Pointer<Void>, int)> loadBytesAsPointer(assetId);
}

class FileAssetBundle {
  external bool availableAsFile(assetId);

  // No need for a sync and async variant.
  // Non-null if `availableAsFile` returns true.
  external File? asFile(assetId);
}

dcharkes commented 2 months ago

con: File in the API makes it incompatible with non-VM

That could be helped by having different AssetBundles with different APIs

class AssetBundle {
  external Uint8List loadBytesSync(assetId);

  external Future<Uint8List> loadBytes(assetId);
}

class PointerAssetBundle {
  external (Pointer<Void>, int) loadBytesAsPointerSync(assetId);

  external Future<(Pointer<Void>, int)> loadBytesAsPointer(assetId);
}

class FileAssetBundle {
  external bool availableAsFile(assetId);

  // No need for a sync and async variant.
  // Non-null if `availableAsFile` returns true.
  external File? asFile(assetId);
}

I guess these bundles should then live in different places, and the ones that do not use File due to being available on the web, cannot refer in doc comments to the one with File.

Con: bad discoverability due to possible ways to interact being spread out.

Side question for the AssetBundles, should all methods be static? Otherwise we can have dynamic invocations on an object. With static method we know if there are non-const invocations, with instance methods we have to always assume there are dynamic invocations. (Hence why half of the FFI is static methods and the other half is extension methods which are also static.)

lrhn commented 1 month ago

Another possible API is:

abstract interface class Asset {
  const Asset(String key) = _SystemAsset; 
  Future<ByteBuffer> loadBytes({bool readOnly = true});
}

Then you need to invoke the Asset constructor to have an asset, and const invocations are easily found.

But that's not much different from just a top-level external ByteBuffer loadAssetBytes(String key, {bool readOnly = true});

The loadBytes -> ByteBuffer is the only operation we need, assuming all assets are byte sequences. However, consider if assets could be typed:

a "byte asset" is a (well-aligned) chunk of memory, but
a "string asset" is something the system can load as a string, without you needing to know how it's stored, that's between the linker and the runtime system doing what's most efficient. (Fx, store as UTF-16, load by creating external string backed by loaded bytes.)

Then we will need a load-function per asset kind.

I'm a little worried that the keys are just strings, but I guess the linker will complain if two assets have the same name, and any statically detectable asset access can be checked against available assets at compile-time.

mosuem commented 1 month ago

I'm a little worried that the keys are just strings, but I guess the linker will complain if two assets have the same name, and any statically detectable asset access can be checked against available assets at compile-time.

I would also like typed keys, something like an enum, but this would require codegen when adding an asset to be able to use it in the application. Having typed keys with a const constructor would help in finding usages of asset loading for tracking in resources.json... :)

I added the suggestion to go/dart-assets.

dcharkes commented 1 month ago

I'm a little worried that the keys are just strings,

The are namespaces per package package:my_package/foo.txt.

any statically detectable asset access can be checked against available assets at compile-time.

That is a good idea, we should do that once we have the resources.json integration.

lrhn commented 1 month ago

namespaces per package

Is that optional, or does the framework providing the assets enforce that the string has that form?

If the latter, can you compile a file with assets, if that file doesn't have a package:URI, or is not in a package directory?

(Still means that someone can access an asset of another package if they know the name. Probably no way to avoid that without some trickery.)

dcharkes commented 1 month ago

Still means that someone can access an asset of another package if they know the name.

That is a feature for native code assets. We want to avoid bundling two identical dynamic libraries. Also for native code, if static linking is used, all code lives in the same namespace. So trying to create encapsulation would start to create weird behavior when a different linking mode is used.

I am not entirely sure if we should have the same goal with data assets or not.

Maybe we should consider making an asset with id package:my_package/foo.txt accessible from all code, but an asset with id package:my_package:_foo.txt only accessible from Dart code in package:my_package.

Is that optional, or does the framework providing the assets enforce that the string has that form?

Declaring assets in build and link hooks does enforce this. The usage from Dart does not (yet).

lrhn commented 1 month ago

only accessible from Dart code in package:my_package

That would require a loadBytes(String key) to check where it's called from. Let's just not do that. (I'd argue that it's not enough to check the top of the stack, you'd have to check the entire stack to see if the call was initiated by code from that package, otherwise you cant use helper libraries. And that's assuming you can even say which package code comes from after compiling.)

Giving a warning for a constant key string is fine. Trying to enforce runtime security is not worth it.

dart-lang / sdk

Data assets feature #54003

Data asset as (Pointer<Void>, int lengthInBytes)

Data asset as File ?

API options

Data asset as `(Pointer<Void>, int lengthInBytes)`

Data asset as `File` ?