dart-lang / sdk

The Dart SDK, including the VM, JS and Wasm compilers, analysis, core libraries, and more.
https://dart.dev
BSD 3-Clause "New" or "Revised" License
10.23k stars 1.57k forks source link

fromBuffer(bytes) in Dart is around 2 times slower than parseFrom(bytes) in Java #44175

Open a14n opened 3 years ago

a14n commented 3 years ago

When parsing PBF files from open-street-map it looks like reading PrimitiveBlock is 2 times slower in dart compare to java.

I attached a primitiveBlock.zip file containing a buffer (8497305 bytes) corresponding to a PrimitiveBlock.

Dart parses the bytes in 430ms:

final sw = Stopwatch()..start();
PrimitiveBlock.fromBuffer(bytes);
print(sw.elapsed);

Java parses the bytes in 220ms:

var start = System.currentTimeMillis();
PrimitiveBlock.parseFrom(bytes);
System.out.println("time:" + (System.currentTimeMillis() - start));

Versions:

$ dart --version
Dart SDK version: 2.10.2 (stable) (Tue Oct 13 15:50:27 2020 +0200) on "linux_x64"
$ java --version
openjdk 11.0.9 2020-10-20
OpenJDK Runtime Environment (build 11.0.9+11-Ubuntu-0ubuntu1.18.04.1)
OpenJDK 64-Bit Server VM (build 11.0.9+11-Ubuntu-0ubuntu1.18.04.1, mixed mode, sharing)

protobuf-1.0.1

mraleph commented 3 years ago

It's actually around 6 times slower when I measure it, if you allow JVM to warmup by calling it couple of times. JVM goes to around 100ms after the first iteration:

0 took 333 ms
1 took 106 ms
2 took 108 ms
3 took 98 ms
4 took 71 ms
5 took 70 ms
6 took 71 ms
7 took 107 ms
8 took 111 ms
9 took 108 ms

Dart pays quite some price for Int64 wrapper and also for dynamic function calls which for no good reason are used in protobuf implementation, but even after changing these things there is still around 3x difference. It seems we additionally pay the price for excessive generality of some code - which we achieve through closures. We should look at forcing enough inlining to eliminate any closure calls on the hot path (currently I can't achieve that purely through @pragma('vm:prefer-inline') so an alternative might be needed.

/cc @mkustermann

mkustermann commented 3 years ago

Firstly thanks a lot for the report!

Let me transfer this issue over to dart-lang/sdk and mark it as VM issue for now. We'll investigate the performance issue.

mkustermann commented 3 years ago

@askeksa-google Could you investigate this?

mraleph commented 3 years ago

Per @askeksa-google request I have put the benchmark code I had on GitHub. Dart and Java benchmarks are in osm_benchmark branch of mraleph/protobuf repo. I have also put some of the changes I played with (e.g. removing Int64 wrappers and experimenting with avoiding closures in reader in the some-performance-tweaks branch on the same repo - though these changes are not validated by tests).

mkustermann commented 3 years ago

@mraleph Maybe you can upstream those changes you already made? (e.g. specialise the internal int64 (de)coding logic for VM vs non-VM)

mraleph commented 3 years ago

I don't see a way to upstream Int64 changes (which yield most of the improvement) as those change APIs. Changes related to declosurising reading can be upstreamed (probably) but introduce a lot of code duplication and IIRC they did not actually lead to huge improvement IIRC - hence my original comment that additional research is needed.

mkustermann commented 3 years ago

Couldn't we at least change class Int64 to allow it to be as tight as _Mint box. Something like:

const bool isVm = identicial(0, 0.0);

abstract class Int64 {
  static Int64 parseInt(String s) => isVm ? VmInt64.parse(s) : WebInt64.parse(s);
}

class VmInt64 implements Int64 {
  final int _value; // <- Ensure AOT can unbox this field.   
  VmInt64(this.value);

  factory VmInt64.parse(String s) => VmInt64(int.parse(s));

  ...
}

class WebInt64 implements Int64 {
  final int _l;
  final int _m;
  final int _h;
  ...
}

and then make the protobuf decoder still use Int64 but specialize it for isVm?

mkustermann commented 3 years ago

@askeksa-google @mraleph We should consider adding this benchmark to golem.

mraleph commented 3 years ago

Interesting idea, it would still box Smi-s which otherwise would be just tagged (e.g. in PbList), but might shave some of the performance difference. It will also require boxing in arguments and return values.

One thing I thought about is that we could add support for passing VmInt64 box around as value type. We could add a special annotation to it @pragma('vm:value-type') telling VM that its identity does not matter, then in the code like so:

VmInt64 foo(VmInt64 v) {
  return v + v;
}

AOT compiler could completely unbox things. The box would only need to be created when crossing types (e.g. from VmInt64 to T in List<T>). I think this is very promising and might prepare us for things like records as well.