Improve hashing performance

mraleph commented 2 months ago

The improvement is in the range of ~10-20x depending on platform and compilation mode, e.g. on my M1 MacBook Pro AOT mode goes from 6ns/byte to 0.3ns/byte and JIT mode goes from 5ns/byte to 0.4ns/byte.

The bulk of the improvement comes from avoiding allocating a temporary ByteData for every readLE64. While it might be reasonable to expect this allocation to be eliminated Dart VM compiler is currently unable to do that for several reasons (most of which can be easily fixed).

The second part of the win comes from avoiding boxed List type for accumulator and using Int64List instead.

Additionally JIT compiler trips over toUnsigned(64) which should be a no-op but instead hits a suboptimal implementation of out of range left shift (1 << 64) and falls over fast path, so I removed that as well.

SamJakob commented 2 months ago

Thanks! This is very interesting! I shall take a deeper dive and try and get this merged today.

SamJakob commented 2 months ago

I've managed to squeeze out an extra 200MB/s by avoiding a List allocation when creating the Int64List:

final acc = Int64List.fromList([
    kXXHPrime32_3,
    kXXHPrime64_1,
    kXXHPrime64_2,
    kXXHPrime64_3,
    kXXHPrime64_4,
    kXXHPrime32_2,
    kXXHPrime64_5,
    kXXHPrime32_1
  ]);

...becomes:

  final acc = Int64List(8);
  acc[0] = kXXHPrime32_3;
  acc[1] = kXXHPrime64_1;
  acc[2] = kXXHPrime64_2;
  acc[3] = kXXHPrime64_3;
  acc[4] = kXXHPrime64_4;
  acc[5] = kXXHPrime32_2;
  acc[6] = kXXHPrime64_5;
  acc[7] = kXXHPrime32_1;

EDIT: I've done a few more micro-optimizations and the benchmark is as follows so far:

AOT:

== Summary ==
Data size: 65536 bytes
Average: 0.2881808471679687 ns/byte
Average: 3.23 GB/s

JIT:

== Summary ==
Data size: 65536 bytes
Average: 0.29363156127929685 ns/byte
Average: 3.17 GB/s

Updated version of the benchmark script here:

xxh3_benchmark.dart

```dart import 'dart:io'; import 'dart:typed_data'; import 'package:xxh3/xxh3.dart'; const kBatchSize = 100; const kDataSize = 64 * 1024; void main() { final args = []; final bytes = args.isEmpty ? Uint8List(kDataSize) : File(args.first).readAsBytesSync(); final batchResults = []; for (var j = 0; j < kBatchSize; j++) { print('== Batch ${j + 1} of $kBatchSize =='); final N = 5000; final sw = Stopwatch()..start(); for (var i = 0; i < N; i++) { xxh3(bytes); } final totalNs = sw.elapsedMicroseconds * 1000; final nsPerIteration = totalNs / N; final nsPerByte = nsPerIteration / bytes.lengthInBytes; batchResults.add(nsPerByte); print(' -> $nsPerByte ns/byte'); } print(''); print('== Summary =='); print('Data size: $kDataSize bytes'); final averageNsPerByte = batchResults.reduce((a, b) => a + b) / kBatchSize; print('Average: $averageNsPerByte ns/byte'); final nsPerGB = averageNsPerByte * 1024 * 1024 * 1024; // kB -> mB -> GB final gigabytesPerNs = 1 / nsPerGB; final gigabytesPerSecond = gigabytesPerNs * 1e9; print('Average: ${gigabytesPerSecond.toStringAsFixed(2)} GB/s'); } ```

(This is on an M2).

SamJakob / xxh3

Improve hashing performance #4