Closed mraleph closed 2 months ago
Thanks! This is very interesting! I shall take a deeper dive and try and get this merged today.
I've managed to squeeze out an extra 200MB/s by avoiding a List
allocation when creating the Int64List
:
final acc = Int64List.fromList([
kXXHPrime32_3,
kXXHPrime64_1,
kXXHPrime64_2,
kXXHPrime64_3,
kXXHPrime64_4,
kXXHPrime32_2,
kXXHPrime64_5,
kXXHPrime32_1
]);
...becomes:
final acc = Int64List(8);
acc[0] = kXXHPrime32_3;
acc[1] = kXXHPrime64_1;
acc[2] = kXXHPrime64_2;
acc[3] = kXXHPrime64_3;
acc[4] = kXXHPrime64_4;
acc[5] = kXXHPrime32_2;
acc[6] = kXXHPrime64_5;
acc[7] = kXXHPrime32_1;
EDIT: I've done a few more micro-optimizations and the benchmark is as follows so far:
AOT:
== Summary ==
Data size: 65536 bytes
Average: 0.2881808471679687 ns/byte
Average: 3.23 GB/s
JIT:
== Summary ==
Data size: 65536 bytes
Average: 0.29363156127929685 ns/byte
Average: 3.17 GB/s
Updated version of the benchmark script here:
```dart
import 'dart:io';
import 'dart:typed_data';
import 'package:xxh3/xxh3.dart';
const kBatchSize = 100;
const kDataSize = 64 * 1024;
void main() {
final args = [];
final bytes =
args.isEmpty ? Uint8List(kDataSize) : File(args.first).readAsBytesSync();
final batchResults =
(This is on an M2).
The improvement is in the range of ~10-20x depending on platform and compilation mode, e.g. on my M1 MacBook Pro AOT mode goes from 6ns/byte to 0.3ns/byte and JIT mode goes from 5ns/byte to 0.4ns/byte.
The bulk of the improvement comes from avoiding allocating a temporary ByteData for every readLE64. While it might be reasonable to expect this allocation to be eliminated Dart VM compiler is currently unable to do that for several reasons (most of which can be easily fixed).
The second part of the win comes from avoiding boxed List type for accumulator and using Int64List instead.
Additionally JIT compiler trips over
toUnsigned(64)
which should be a no-op but instead hits a suboptimal implementation of out of range left shift (1 << 64) and falls over fast path, so I removed that as well.