dart-lang / sdk

The Dart SDK, including the VM, JS and Wasm compilers, analysis, core libraries, and more.
https://dart.dev
BSD 3-Clause "New" or "Revised" License
10.24k stars 1.57k forks source link

UTF8 encoding is slow #25377

Open scheglov opened 8 years ago

scheglov commented 8 years ago

The following program encodes that same ASCII string using a naive approach and using actual UTF8.encode(). The naive approach is about 3 times faster. Could UTF8 be optimized to provide better performance when possible?

import 'dart:convert';

main() {
  String str = 'aaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbcccc';
  {
    Stopwatch sw = new Stopwatch()..start();
    //  var codeUnits2 = str.codeUnits.toList();
    for (int i = 0; i < 1000 * 1000; i++) {
      int length = str.length;
      List<int> bytes = new List(length);
      for (int j = 0; j < length; j++) {
        bytes[j] = str.codeUnitAt(j);
      }
    }
    print('Naive time: ${sw.elapsedMilliseconds}');
  }
  {
    Stopwatch sw = new Stopwatch()..start();
    for (int i = 0; i < 1000 * 1000; i++) {
      UTF8.encode(str);
    }
    print('UTF8 time: ${sw.elapsedMilliseconds}');
  }
}
scheglov commented 8 years ago

@stereotype441

iposva-google commented 8 years ago

@scheglov Thanks for pointing out that this is not a VM performance problem, but an underlying library issue.

johnmccutchan commented 8 years ago

Doesn't codeUnitAt return the UTF-16 encoding?

scheglov commented 8 years ago

Yes, it does. But if the string is ASCII-only, then it is the same as UTF-8, isn't it?

lrhn commented 8 years ago

The first version is actually not that naive - it knows, without checking anything, the size of the resulting output. That's one of the reasons UTF-8 encoding is not as fast. To avoid over-allocating it runs through the string once to find the result length, then runs through it again to do the actual encoding.

I'm not sure we have an "ASCII" special-case. That might be worth it - if the input length equals the output length, then you can just write the string into the result Uint8List directly.

scheglov commented 8 years ago

True, it knows the length. We could speculatively allocate Uint8List and fill it as we check.

When ASCII: 84 ms vs. 424 ms - 5 times faster. When not ASCII: 589 ms vs. 459 ms - about 30% slower.

import 'dart:convert';
import 'dart:typed_data';

main() {
//  String str = 'aaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbccccЙ';
  String str = 'aaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbcccc';
  {
    Stopwatch sw = new Stopwatch()..start();
    for (int i = 0; i < 1000 * 1000; i++) {
      toUTF8(str);
    }
    print('Optimized time: ${sw.elapsedMilliseconds}');
  }
  {
    Stopwatch sw = new Stopwatch()..start();
    for (int i = 0; i < 1000 * 1000; i++) {
      UTF8.encode(str);
    }
    print('UTF8 time: ${sw.elapsedMilliseconds}');
  }
}

List<int> toUTF8(String str) {
  int length = str.length;
  Uint8List bytes = new Uint8List(length);
  for (int i = 0; i < length; i++) {
    int unit = str.codeUnitAt(i);
    if (unit >= 128) {
      return UTF8.encode(str);
    }
    bytes[i] = unit;
  }
  return bytes;
}
mkustermann commented 4 years ago

@askeksa-google has worked on fast UTF-8 decoding earlier this year, and is probably going to take a look at faster UTF-8 encoding in Q4.