dart-lang / sdk

The Dart SDK, including the VM, JS and Wasm compilers, analysis, core libraries, and more.
https://dart.dev
BSD 3-Clause "New" or "Revised" License
10.31k stars 1.59k forks source link

Creating Uint8List is really slow on Android AOT and feels like should be faster overall #40966

Open KalilDev opened 4 years ago

KalilDev commented 4 years ago

While optimizing an function that creates an image on my app i noticed that creating the Uint8List buffer was taking 21ms while the rest of the function was taking 5ms. I tried some stuff to make it faster but couldn't, until i thought of using ffi with calloc to create an Uint8 pointer and using asTypedList on it for the buffer.

As you can see, the performance is MUCH better using a plain calloc (granted, it is unsafe/unmanaged memory, but i would not expect this huge of a difference):

Android arm64 AOT (Avg: 111x, Median: 275x) I/flutter ( 6194): Uint8List: I/flutter ( 6194): Avg: 7774.8us I/flutter ( 6194): Median: 8270.0us I/flutter ( 6194): All: (8646us, 7219us, 8500us, 9596us, 8040us, 10504us, 9783us, 7771us, 7643us, 46us) I/flutter ( 6194): calloc.asTypedList: I/flutter ( 6194): Avg: 66.1us I/flutter ( 6194): Median: 30.0us I/flutter ( 6194): All: (404us, 37us, 30us, 54us, 33us, 22us, 30us, 12us, 28us, 11us)

Interestingly, this behavior is way more noticeable on Android arm64 AOT (Tried 2 different devices, one running android 9.0, and the other running android 10.0). On windows (dart2native & JIT) uint8list & calloc are much closer (NOTE: Windows is using HeapAlloc instead of calloc, which is a bit faster than calloc, so the difference is smaller in reality):

Dart2native (Avg: 5.5x, Median: 3.9x) .\benchmark.exe calloc.asTypedList: Avg: 226.2us Median: 218.5us All: (274us, 212us, 224us, 238us, 242us, 212us, 224us, 213us, 211us, 212us) Uint8List: Avg: 1239.0us Median: 854.5us All: (1765us, 242us, 3457us, 1673us, 2078us, 503us, 497us, 1206us, 485us, 484us)

JIT (Avg: 4.8x, Median: 12x) dart .\benchmark.dart calloc.asTypedList: Avg: 555.6us Median: 206.0us All: (3611us, 206us, 200us, 201us, 221us, 226us, 205us, 206us, 201us, 279us) Uint8List: Avg: 2687.4us Median: 2544.0us All: (4344us, 1885us, 2512us, 4416us, 2576us, 2135us, 3316us, 2135us, 3353us, 202us)

Code used was:


UnsafeUint8List callocUint8List(int length) {
  final pointer = calloc(count: length);
  return UnsafeUint8List._(pointer, length);
}

void benchmark(bool isUnsafe) {
  final arraySize = 10000 * 10000;
  final times = List(10);
  final vals = List(10);
  final timer = Stopwatch()..start();
  for (var i = 0; i < 10; i++) {
    timer.reset();
    vals[i] = isUnsafe
        ? callocUint8List(arraySize)
        : PackedUint8List(Uint8List(arraySize));
    times[i] = timer.elapsedMicroseconds;
  }
  timer.stop();
  if (isUnsafe) {
    for (final list in vals) {
      (list as UnsafeUint8List).free();
    }
  }
  print('${isUnsafe ? 'calloc.asTypedList:\n' : 'Uint8List:\n'}'
      '\tAvg: ${times.average}us\n'
      '\tMedian: ${times.median}us\n'
      '\tAll: ${times.map((t) => '${t}us')}');
}

Improving this will probably benefit flutter also.

KalilDev commented 4 years ago

The Uint8List on the benchmark is 100,000kb (approx 100mb)

This is an approximately 12.8gbps allocation using Uint8List (7774us) average on Android. Finally, this is a 247.5gbps calloc on the worst case scenario (404us) on Android. I'm guessing the subsequent allocations are way faster because it's giving the same memory block over and over again.

Ok, I've read a bit, both devices use jemalloc, both are 64bit, so, both are allocating 20*2mb bin (default size for arm64). I do not know how fast is the memory on them yet, so I dont know if this is an property of the jemalloc, of the devices memory speed or of the dart allocator. I'm willing to try all this out, I'll try to benchmark the memory speed. Also may be because of the way Uint8List zeroes out the memory? I've yet to look into this too. I'll read the code for this and try to wrap my mind around it.