image package works slower when compiled with dart2native compared to JIT #39367

Open mindplay-dk opened 4 years ago

mindplay-dk commented 4 years ago

I was (am) pretty excited about the dart2native announcement, and decided to test it.

Where I would really expect this to shine, is with the sort of heavy number crunching that generally makes scripting languages fall short or delegate the heavy work to C.

So I installed the image package version 2.1.8, and wrote a very basic script:

import 'dart:io';
import 'package:image/image.dart';

int calculate() {
  var stopwatch = new Stopwatch()..start();

  var image = decodeImage(File('input.jpg').readAsBytesSync());

  Image thumbnail = copyResize(image, width: 200, interpolation: Interpolation.average);


  return stopwatch.elapsed.inMilliseconds;

And a basic console front-end:

import 'package:image_test/image_test.dart' as image_test;

main(List<String> arguments) {
  print('Time taken: ${image_test.calculate()}!');

I'm feeding it a big photo of 5760 x 3840 px, and as you can see, I'm using presumably the most expensive Interpolation algo available.

Run this with the VM:

> dart bin\main.dart
Time taken: 1262!

Let me interject here and say, this is by far the fastest I've ever seen any scripting language resize an image of this size - this library is single-threaded, so that is really incredibly fast! Kudos on delivering probably one of the fastest scripting language VMs ever created! 🤩

But (obviously?) I was expecting this to be even faster when compiled to a native binary.

So I built it:

> dart2native bin\main.dart -o bin\image_test.exe
Generated: c:\workspace\dart\image-test\bin\image_test.exe

And ran it:

> bin\image_test
Time taken: 2084!

Almost 80% slower?

I ran both many times, and the results are pretty consistent.

I also pulled up a CPU monitor, and it does look like the Dart VM uses more CPU power - I see a spike on two CPU cores, whereas with the compiled binary, I see a spike only on a single core. Presumably the code runs single-threaded on the Dart VM, and the second CPU core spike is due to the VM making optimizations or doing garbage collection on the fly or something?

Anyhow, this result is more than a little surprising to me. 🤔

Note that I'm using the 64-bit Windows build of the with Dart VM version 2.6.1 (Mon Nov 11 13:12:24 2019 +0100) - perhaps this isn't fully optimized for Windows yet?

Or perhaps the compiler has not been optimized for raw number crunching yet? I suppose the VM has been around for a lot longer and the native compiler is still very new, so maybe the VM has optimizations that the native compiler doesn't have yet?

mraleph commented 4 years ago

But (obviously?) I was expecting this to be even faster when compiled to a native binary.

It is a common misconception - which comes often enough that we should probably add it to an FAQ (/cc @mit-mit).

AOT and JIT compilation have different performance trade offs. JIT has access to accurate runtime profile of your application (including information about which parts of the code are hot, which classes are allocated and which receiver types are seem by each individual call site). Using this information JIT can speculate and produce very good machine code tailored for how your program is actually running. This speculation does not even have to be correct for an arbitrary inputs to your program - because JIT can always fallback to a slower version dynamically. That is why JIT usually gives you very good peak performance. However you have to pay for this with startup and warmup latency - which is visible when you need to run a lot of code before your application puts the first pixel on the screen.

AOT has a different story - it does not actually know how your code will run. It has to look at the application as whole, run various global analyses and try to recover information that JIT gets by observing the execution. It can't speculate - it has to produce the code that is guaranteed to work. Sometimes AOT can figure things out, sometimes it falls short of following the flow of types through the program and has to produce generic and rather inefficient code.

You might ask here: wait, is not Dart statically typed? why do we even need any sort of global analyses?

The answer to this question is: yes, Dart is statically typed, but static types don't necessarily give you enough information to produce good code. Take for example a variable v of static type List<int> . This variable can contain any of the following const [10], [10], Uint8List(1) and Int32List(1) (and more!). Which means in general case an access v[0] needs to be compiled in a way that supports all of them - which is rather inefficient compared to an element access specialised for a particular list type would look like.

This just scratches to surface of the problem - in reality situation is even more complex.

That said we do try to bring difference between AOT and JIT down as much as possible where it matters.

mit-mit commented 4 years ago

Updating FAQ in https://github.com/dart-lang/site-www/pull/2098

mindplay-dk commented 4 years ago

Yep, I understand all of that.

And for complex functions, I would expect the JIT might be faster.

But for very simple functions, just sheer number crunching, AOT ought to be faster, since it doesn't need to do any of the run-time analyses or optimizations that the JIT needs to do.

And once you enter a very long loop, you know the data-type of the list before you start processing it, so at the very least, that should be faster?

It isn't:

> dart bin/resize.dart

1150 decodeJpg
743 copyResize
888 encodeJpg

> dart2native bin/resize.dart -o bin/resize.exe
> bin\resize.exe

1787 decodeJpg
1050 copyResize
1432 encodeJpg

I'd expect copyResize to be faster, at least?

The code is accurately type-hinted here and here to avoid e.g. type-checking list elements, so AOT really ought to be faster at least for this case, I think?

There should be enough static information available in this case for an AOT to at least beat the JIT on a tight closed loop with well-known types?

I'm not trying to be poignant here, but if AOT is going to be consistently slower than JIT, why even compile to native binary in the first place? Wouldn't it be more efficient to compile to bytecode and link the JIT run-time into the executable?

Wouldn't it be considerably less work and maintenance, too? You have the bytecode compiler and JIT run-time available anyhow - I'm sure maintaining a cross-platform binary back-end for the language is a pretty substantial effort.

Beyond producing stand-alone executables, what value proposition does dart2native have over the VM?

I was (am) excited about being able to produce stand-alone executables, but maybe compiling to native binaries isn't the best or simplest approach? If you could simply embed the JIT engine in a a stand-alone executable instead, we'd have the same portability, ease of deployment, better performance, access to reflection, etc. without any further ado.

Perhaps the main benefit of a native binary over an embedded VM approach would be the smaller file size - but is that very important in this day and age? My example with a web server that resizes images comes out around 8 megabytes anyway. I don't know how big an embedded JIT would be, but the bytecode likely would be a few hundred KB, so for many common use-cases, I suspect an embedded VM might even be comparable in terms of size?

mraleph commented 4 years ago

And once you enter a very long loop, you know the data-type of the list before you start processing it, so at the very least, that should be faster?

To get peak performance you need to know data-type of the list at compile time - knowing that list type is invariant of the loop could theoretically help, but you would still have some sort of virtual dispatch in the loop itself.

[Note that type annotation Uint8List does not yield enough information to enable fastest possible way to access the list because Uint8List has multiple representations, at the very list it could be a normal Uint8List and it can be a view into another typed list].

But for very simple functions, just sheer number crunching, AOT ought to be faster, since it doesn't need to do any of the run-time analyses or optimizations that the JIT needs to do.

Again it is not a straightforward comparison. JIT for example has the chance to speculate on bitwidth of the numbers involved. AOT has to be conservative and prove things.

In general we do have a problem that our AOT compiler does not produce the best numeric code for tight loops (especially with integers) and this is something that we plan to eventually fix.

I looked at the code generated for copyResize - it is true that we can't produce the best code for accessing sData - but I don't think it is the biggest performance sync in the code. I think the biggest issue is that we don't keep r, g, b and a properly unboxed and that we do some pretty bad stuff with si because it is both used in arithmetic and in the indexing operation. I have filed a couple of issues to fix that.

if AOT is going to be consistently slower than JIT, why even compile to native binary in the first place?

As I have indicated before - we would really like to bring AOT performance as close to the JIT performance as possible. We are working on it continuously. It takes time because it is not a trivial problem. It is much easier to make a fast JIT for a language like Dart than a fast AOT, especially if you take certain additional constraints like code size into account. (Dart AOT was originally created for mobile devices - so every byte counts).

The reason to use AOT in the first place is low latency startup and good performance (it might not be as high as JIT performance in all cases, but it is still good enough for many kinds of applications). Also you can use AOT in places where you can't use JIT (e.g. iOS).

If you don't care about startup latency and care about peak performance - then you should certainly use JIT at the moment.

ConsoleTVs commented 4 years ago

Just found out this issue, happening the same here


My code is nearly 80 lines, fully type annotated (no dynamics) and makes use of const / final variables and const constructors and fixed length lists when possible. Also, the part of the code where most of the time is spent is on a switch (this is normally optimized into a jump table in some compilers).

I think more consideration should be given to optimizations, specially since it took a few time to compile the dart code AOT (I thought part of it was due to optimizations?).

I will leave the code here, in case you may use it for further improvements, Have a nice day!


ansarizafar commented 4 years ago

We would really like to bring AOT performance as close to the JIT performance as possible. We are working on it continuously.

There is a ray of hope in this sentence and just because of this I am switching to Dart AOT for full stack(back-end, front-end) development. Having said that, I think we can try to learn from other statically typed AOT complied languages like Go, Rust, Crystal, and Nim.

ConsoleTVs commented 4 years ago

The issue is not about learning other languages. I can code in almost 20. The thing is that as far as I see, the AOT compilation is only meant for start-up sensible apps. However, most people expect run-time performance rather than startup performance.

mraleph commented 4 years ago

@ConsoleTVs you can replace List<int> with Int32List to speedup AOT version of the code. We currently loose some type information in the backend to produce good code (filed https://github.com/dart-lang/sdk/issues/39515 to track fixing that)

ConsoleTVs commented 4 years ago

Great to hear! This could make the AOT version run faster, taking 12.380 s instead of 16.945 s (don't take those numbers seriously, those are not accurate). Still an improvement!

ansarizafar commented 4 years ago

In this new server less world of Aws Lamda, startup performance, run-time performance and CPU/Memory efficiency are important.

ConsoleTVs commented 4 years ago

In this new server less world of Aws Lamda, startup performance, run-time performance and CPU/Memory efficiency are important.

What are you trying to prove?

ansarizafar commented 4 years ago


be-thomas commented 4 years ago

May I ask, why g++ is able to produce so much faster code with AOT. Beating almost every JIT in existence. Yet Dart, with all those static types and ahead of time information becomes incompetent in front of JIT.

The languages using JIT, actually have their complex logic(and often their core library) implemented in statically typed and AOT compiled language.

And the false propaganda I have seen here, right after dart's AOT release is that JIT is faster than AOT.

even Java's JIT cannot beat C/C++'s AOT. All the optimizations that the JIT is busy doing, is enough to slow it down below the AOT speed. AOT is awaiting optimizations.

ConsoleTVs commented 4 years ago

@thomasb892 Because of this: We currently loose some type information in the backend to produce good code

mraleph commented 4 years ago


May I ask, why g++ is able to produce so much faster code with AOT.

Because g++ is compiling C++, which is a much lower-level language. Imagine you write something like this in C++:

struct S {
  int f;

int foo(int a, std::vector<int>& b, S* p) {
  return a + b[0] + p->f;  

When a C++ compiler compiles this function it does not have to worry that a can be nullptr (because it can't - int is a primitive type, not a pointer), that b is anything but actually std::vector, that p->f is a method call rather than just an access to an int type member at fixed offset.

In Dart none of this are true.

class S {
  final int f;

int foo(int a, List<int> b, S p) {
  return a + b[0] + p.f;  

a can be null, b can be null or any instance of any implementation of List<int>, p can point to SImpl defined as

class SImpl implements S {
  get f => throw "Hahaha";

and so on and so forth.

So comparing Dart AOT to C++ does not really help. Compiling C++ is easier.

(As a sidenote: even C++ compiler can be assisted by PGO, e.g. you can get significant performance improvements from relayouting binaries or doing profile guided devirtualization - which highlights pure AOTs shortcomings).

mindplay-dk commented 4 years ago

@mraleph I think Dart was supposed to get strict nulls soon? Which should address that issue at least.

mit-mit commented 4 years ago

Yes, we definitely plan on doing VM perf optimisations once we have null safety landed.

windrunner414 commented 4 years ago

@mit-mit any plan on aot perf optimisation? flutter use aot on ios and also possible on android. and for serverless, startup speed, memory usage and runtime perf is all important

mraleph commented 4 years ago

@windrunner414 we are continuously working on improving performance of AOT code.

If you have some specific code in mind which you think runs slow please file a separate issue. Then we can take a look and suggest if we can do something on our side to make the code faster or if the code could be changed to make it faster.

mindplay-dk commented 4 years ago

@mraleph the raw number crunching performed by the image library I mentioned in this issue is definitely good candidate? It ought to perform better with AOT, as it's all statically-typed and, well, this is what CPU's do best. Getting close to bare-metal performance ought to be possible. :smile:

mraleph commented 4 years ago

@mindplay-dk while in general we want to improve performance of working with numbers, I would say that using pure Dart ports of image manipulation routines does not make sense to me - if performance is important - instead I'd recommend calling some native library to do the image manipulation (you can sandbox it if you are worried about vulnerabilities).

be-thomas commented 4 years ago

class S { final int f; };

int foo(int a, List b, S p) { return a + b[0] + p.f;

`a` can be `null`, `b` can be null or any instance of any implementation of `List<int>`, `p` can point to `SImpl` defined as

class SImpl implements S {
  get f => throw "Hahaha";

and so on and so forth.

So comparing Dart AOT to C++ does not really help. Compiling C++ is easier.

(As a sidenote: even C++ compiler can be assisted by PGO, e.g. you can get significant performance improvements from relayouting binaries or doing profile guided devirtualization - which highlights pure AOTs shortcomings).


List b is only supposed to be passed by reference. So it can be a pointer. Therefore it can be null. In C/C++ they are mostly pointers otherwise it's slow. We could use everything as pointers.

Also that OOP code, even C++ does it. And does it rather fast. Dart AOT has a lot of potential.

windrunner414 commented 4 years ago

When the new null safety land, we can know there are never null. for nullable object maybe it can be forced to check, can't call anything if u do not check if it's null. And for S, maybe don't need care for it's runtimeType, just use the offset of S,like c++. Do not pass the SImpl pointer but the S pointer.

class S {int a=1;}
class S1 {int b=2;}
class SImpl extends S with S1 {int c=3;}

*SImpl, *S -> int a
      *S1 -> int b
            int c

void s(S1 s1) => print(s1.b);
SImpl anyImpl = SImpl();

if call s(anyImpl), pass the *S1,I think we can know what type anyImpl is, unless it's dynamic. if it's dynamic, check the runtimeType is nessecary, but if not, this step can be skip

be-thomas commented 4 years ago

There are possibly more tricks one could use to speed up AOT because of Dart being very similar to Java. Android shifted from Dalvik VM(JIT) to ART runtime(AOT) and it has only been faster ever since.

Maybe we could learn from ART.

mraleph commented 4 years ago


List b is only supposed to be passed by reference. So it can be a pointer. Therefore it can be null. In C/C++ they are mostly pointers otherwise it's slow. We could use everything as pointers. Also that OOP code, even C++ does it. And does it rather fast. Dart AOT has a lot of potential.

I am not sure I understand what you are trying to say here. Yes, b is a pointer. Yes, it can be null. What's next? In reality it is more of a problem that variables of primitive types (like int) can be null - this is much bigger issue for performance than that variables of "complex" types like List<...> can be null.

That's where C++ differs a lot from Dart - variables of primitive types can't ever be nullptr there. Also if you use pointers in C++ and then derefence them compiler is actually free to assume that the pointer is not nullptr (it is UB to dereference a NULL pointer), in Dart null is an actual object which has some methods (like null.hashCode and null.toString work), while attempt to call anything else on null will trigger null.noSuchMethod. Drastic difference from C++. Though again: nullability is the biggest issue for primitives. For something like List<> the biggest issue is that often you don't know which implementation of List<> you are getting. It's as if in C++ instead of passing around std::vector<T>& you would pass around some sort of abstract interface with virtual methods and std::vector<> was one of the possible implementations. (Though it is even more complicated than that because of the covariance in Dart - C++ templates are invariant).

There are possibly more tricks one could use to speed up AOT because of Dart being very similar to Java. Android shifted from Dalvik VM(JIT) to ART runtime(AOT) and it has only been faster ever since.

Yes, there are tricks to speedup AOT. If you actually look though git history you will discover that we are constantly applying new :)

Note that these days ART does not actually use a simple AOT - since Android N it actually uses profile guided AOT which is driven by profiles collected in runtime. You don't compile the whole app on installation - instead you run application in a JIT and then use some background process to recompile hot parts of your application based on the profile information. Since Android 8 this profile information contains among other thing inline cache states - which allows "AOT" (I'd rather call it asynchronous JIT though) to perform speculative optimisations.

Also as I have said before: when compiling Java you don't face all the same challenges that you face when compiling Dart - for example Java int and double are non-nullable primitives just like in C++.


When the new null safety land, we can know there are never null. for nullable object maybe it can be forced to check, can't call anything if u do not check if it's null.

It is true, though it must be clarified that initially most applications would be run in hybrid opt-in/opt-out mode in which you can actually violate non-nullability promises. Only if your application is fully opted in (no dependencies are opted-out) and you are running in strong checking mode you can be sure that int x is never null. We do plan to make good use of non-nullability information for such applications.

And for S, maybe don't need care for it's runtimeType, just use the offset of S,like c++. Do not pass the SImpl pointer but the S pointer.

Yeah, I know how C++ implements inheritance. I am not sure why are you bringing it up here though. Notice that original example with SImpl replaces field with a getter. How this technique helpful in addressing that? (It is not)

It's an interesting question whether there is a lot of performance sensitive code like that to begin with.

Leaving that aside (assuming for example this sort of code was important and we wanted to apply this technique), I can see at least few challenges applying it:

windrunner414 commented 4 years ago

@mraleph We can know at compile time if it might be a getter / setter, and let S and any implementation of S to have a getter&setter, not just int f. It may improve performance but u are right, there are many challenges and it's complicated

hooluupog commented 4 years ago

Correct me if I'm wrong. Now dart is a real statically typed language(beginning from Dart 2.0) and being used for develop mobile apps. Flutter is a native performance cross-platform framework. To better complete with native apps(written in java/kotlin/swift), high performance is important. So, is there any plan to support unboxed type(something like java value type, inline classes[1], value types without object identity) to further improve performance and reduce memory usage?

[1]State of Valhalla. The Road to Valhalla(https://cr.openjdk.java.net/~briangoetz/valhalla/sov/01-background.html)

mraleph commented 4 years ago

@hooluupog Feature Request for value types is better raised at dart-lang/language, because it is a language design decision. We have discussed adding value types for many years now - and so far there have been much higher priority issues to tackle.

hooluupog commented 4 years ago

@mraleph Okay, got it.

jonaird commented 4 years ago

Forgive me if I’m getting this wrong but it sounds like it comes down to losing type metadata for the sake of file sizes? Are there other performance issues that this is important for? I can understand why this would be important for mobile apps and for dart2js but for serverside apps and cli’s, performance would be much more important than file size IMO.

mraleph commented 4 years ago

Forgive me if I'm getting this wrong but it sounds like it comes down to losing type metadata for the sake of file sizes?

No, I am not sure which part of this thread made you think this way.

It is true though that we take code size in consideration - which impacts for example our inlining decisions (AOT inlining is much less aggressive than JIT inlining as a result), but that's a somewhat separate topic.

sjapps commented 4 years ago

Forgive but this is kind of off topic: For a backend server application like aqueduct, is it recommended to deploy to production in AOT or JIT on say google's cloud run (semi serverless)? @devisions has been adding AOT support to aqueduct here

mit-mit commented 4 years ago

That would depend on a bunch of factors, such as how frequently your backend spins down/up, what kind of code is runs, etc. I'd recommend doing some benchmarking for your particular workload.

dxps commented 4 years ago

@sjapps It is true that - as @mit-mit Michael said - some stress testing would be needed on both AOT (native) and JIT (non-native), as your application behavior and JIT optimizations may something respond better than the native version.

Indeed, startup time and memory usage may favor your needs and expectation.

This is applicable to all other similar platforms, such as Java (more specifically, look for Quarkus with GraalVM).

Oh, and the lovely AOT capability of Aqueduct has been added by the Aqueduct Team and @joeconwaystk himself. I am start investing time into it as I would love to contribute, and in this particular case I was just a messenger and gave back some feedback. 😊

entrptaher commented 3 years ago

I tested a simple fib and puppeteer program in dart.

Test case: Fib

num fib(num n) {
  if (n <= 1) return 1;
  return fib(n - 1) + fib(n - 2);

void main() {

Running the compiled, jit, aot, kernel and vm version,

➜  time dart run bin/sample.dart
dart run bin/sample.dart  7.98s user 0.07s system 98% cpu 8.133 total

➜  time dart run bin/sample.jit      
dart run bin/sample.jit  7.58s user 0.06s system 98% cpu 7.763 total

➜  time dartaotruntime bin/sample.aot
dartaotruntime bin/sample.aot  13.33s user 0.00s system 99% cpu 13.333 total

➜  time bin/sample.exe 
bin/sample.exe  13.53s user 0.01s system 99% cpu 13.544 total

➜  time dart run bin/sample.dill            
dart run bin/sample.dill  7.86s user 0.07s system 99% cpu 7.973 total

Verdict: The compiled and AOT version is 1.7x time slower than the vm, jit and kernel versions.

Test case: Puppeteer

This will run a headless chromium and close it after creating a new page.

import 'package:puppeteer/puppeteer.dart';

void main() async {
  var browser = await puppeteer.launch(headless: true);
  var myPage = await browser.newPage();
  await browser.close();

Running it:

➜  time dart run bin/sample.dart
dart run bin/sample.dart  2.58s user 0.32s system 131% cpu 2.204 total

➜  time dart run bin/sample.dill
dart run bin/sample.dill  0.46s user 0.10s system 93% cpu 0.603 total

➜  time dart run bin/sample.jit 
dart run bin/sample.jit  0.35s user 0.12s system 73% cpu 0.646 total

➜  time dartaotruntime bin/sample.aot
dartaotruntime bin/sample.aot  0.07s user 0.07s system 70% cpu 0.202 total

➜  time bin/sample.exe         
bin/sample.exe  0.08s user 0.04s system 63% cpu 0.187 total

Verdict: the compiled and aot version is much faster, and the vm version is 10x times slower.

Keithcat1 commented 2 years ago

Would it be possible for the Dart VM to write JIT profiles to disk while a program is running, possibly when an option is passed to Dart or when a specific function is called and then use those profiles when compiling in AOT mode?

mraleph commented 2 years ago

I am going to rename the issue from "Dart native slower than Dart VM?" to a more concrete "image package works slower when compiled with dart2native compared to JIT".

Since 2019 we have improved many things in Dart AOT compiler (including TFA precision and better local optimizations). If I take the original benchmark which made @mindplay-dk file this issue, I see that we want from 80% speed difference to 11% speed difference (measuring on Dart SDK version: 2.17.0-222.0.dev (dev) (Fri Mar 18 12:54:18 2022 -0700) on "linux_x64"):

$ dart compile exe bin/main.dart
$ bin/main.exe
Time taken: 5822!
$ dart bin/main.dart
Time taken: 5203!

I have taken a quick look at the difference and the code quality has improved significantly, for example we are now unboxing phi which were previously boxed which leads to less memory traffic and more tight numeric code overall.

Now it seems that the a lot of difference originates from JITs ability to speculatively specialise for specific receiver types, e.g. InputBuffer class is defined like this:

class InputBuffer {
  List<int> buffer;
  // ...

In JIT mode we figure out that buffer is always Uint8List array and specialise accesses accordingly. In AOT mode TFA looses track of types somewhere (I did not try to figure out where) and instead generates virtual calls to List.operator[] when buffer is accessed.

If I replace List<int> buffer with Uint8List buffer the AOT runtime is decreased by ~140ms (which brings difference between JIT and AOT to 9% from 11%).

$ bin/main.exe
Time taken: 5682!

I think there are other cases like this, I looked at the flame-graph and I can clearly see places where we perform virtual dispatch to array methods (including typed array methods) instead of properly handling these cases inline. I think at least some of these cases can be attributed to potential issues in the compilation pipeline (I will file bugs for those), but some of these (like InputBuffer.buffer being typed too loosely - are actually problem with the code itself).

Would it be possible for the Dart VM to write JIT profiles to disk while a program is running, possibly when an option is passed to Dart or when a specific function is called and then use those profiles when compiling in AOT mode?

@Keithcat1 this is possible though a question arises how to use these profiles. They can be used to guide inlining decisions, but they will not help to fully close the gap between JIT and AOT because AOT will not be able to apply these profiles speculatively, it will need to keep the fallback case around - which will degrade the code quality.

Keithcat1 commented 2 years ago

I imagined it would work similar to the way Clang does it with -fprofile-generate and -fprofile-use