ByteData abstraction impacts performance

DanielThomas commented 6 years ago

I was thinking about rehashing cost in ByteArrayOrdinalMap when running a cycle, and I got curious about whether the hash itself was a bottleneck. So I threw together a quick scratch benchmark (https://github.com/DanielThomas/hollow/tree/murmur3) to see what's what.

The good news is that Yonik Seeley's port used by Hollow HashCodes (hash_hollow) is the fastest I could find (used in Solr), but it appears that the ByteData abstraction, i.e. it's per-byte reads, significantly hurts throughput. The only difference between these benchmarks is one reads bytes using com.netflix.hollow.core.memory.ArrayByteData:

Benchmark                       (length)   Mode  Cnt       Score       Error   Units
Murmur3Benchmark.hash_hollow           1  thrpt    5  136862.511 ± 11350.094  ops/ms
Murmur3Benchmark.hash_hollow          10  thrpt    5   81524.319 ±  2689.955  ops/ms
Murmur3Benchmark.hash_hollow         100  thrpt    5   17258.054 ±   302.410  ops/ms
Murmur3Benchmark.hash_hollow        1000  thrpt    5    1898.878 ±    54.612  ops/ms
Murmur3Benchmark.hash_hollow       10000  thrpt    5     194.730 ±     6.152  ops/ms
Murmur3Benchmark.hash_hollow      100000  thrpt    5      19.450 ±     0.537  ops/ms
Murmur3Benchmark.hash_yonik            1  thrpt    5  231423.849 ±  5818.267  ops/ms
Murmur3Benchmark.hash_yonik           10  thrpt    5  113567.630 ±   341.959  ops/ms
Murmur3Benchmark.hash_yonik          100  thrpt    5   22265.668 ±   174.804  ops/ms
Murmur3Benchmark.hash_yonik         1000  thrpt    5    2580.319 ±     7.761  ops/ms
Murmur3Benchmark.hash_yonik        10000  thrpt    5     268.995 ±     1.895  ops/ms
Murmur3Benchmark.hash_yonik       100000  thrpt    5      27.137 ±     0.203  ops/ms

It'd of course be worse again for SegmentedByteArray, where you add a logical shift and bitwise for every com.netflix.hollow.core.memory.ByteData#get.

That got me thinking that there are field read cases like com.netflix.hollow.core.read.engine.object.HollowObjectTypeReadStateShard#readString(com.netflix.hollow.core.memory.ByteData, long, int) that'd likely be impacted too.

So it feels like ByteData abstraction needs to be leakier and start with the a contract that assumes segmented data and provide ways of expose underlying arrays directly, or at least a byte[] contract that handles and copies across the potential spanned array segments.

Full Benchmark Results

Benchmark                       (length)   Mode  Cnt       Score       Error   Units
Murmur3Benchmark.hash_guava            1  thrpt    5  187422.727 ±  5990.667  ops/ms
Murmur3Benchmark.hash_guava           10  thrpt    5   90013.621 ±  4670.890  ops/ms
Murmur3Benchmark.hash_guava          100  thrpt    5   15803.129 ±   254.116  ops/ms
Murmur3Benchmark.hash_guava         1000  thrpt    5    1646.876 ±    15.252  ops/ms
Murmur3Benchmark.hash_guava        10000  thrpt    5     168.375 ±     3.224  ops/ms
Murmur3Benchmark.hash_guava       100000  thrpt    5      18.480 ±     0.923  ops/ms
Murmur3Benchmark.hash_hive             1  thrpt    5  225538.310 ± 28378.051  ops/ms
Murmur3Benchmark.hash_hive            10  thrpt    5  110997.225 ±  3593.197  ops/ms
Murmur3Benchmark.hash_hive           100  thrpt    5   19188.473 ±  1607.453  ops/ms
Murmur3Benchmark.hash_hive          1000  thrpt    5    2226.580 ±   149.374  ops/ms
Murmur3Benchmark.hash_hive         10000  thrpt    5     227.843 ±    22.904  ops/ms
Murmur3Benchmark.hash_hive        100000  thrpt    5      23.052 ±     2.436  ops/ms
Murmur3Benchmark.hash_hollow           1  thrpt    5  136862.511 ± 11350.094  ops/ms
Murmur3Benchmark.hash_hollow          10  thrpt    5   81524.319 ±  2689.955  ops/ms
Murmur3Benchmark.hash_hollow         100  thrpt    5   17258.054 ±   302.410  ops/ms
Murmur3Benchmark.hash_hollow        1000  thrpt    5    1898.878 ±    54.612  ops/ms
Murmur3Benchmark.hash_hollow       10000  thrpt    5     194.730 ±     6.152  ops/ms
Murmur3Benchmark.hash_hollow      100000  thrpt    5      19.450 ±     0.537  ops/ms
Murmur3Benchmark.hash_sangupta         1  thrpt    5  167244.867 ±  8154.188  ops/ms
Murmur3Benchmark.hash_sangupta        10  thrpt    5   66672.478 ± 18962.516  ops/ms
Murmur3Benchmark.hash_sangupta       100  thrpt    5    6507.477 ±  9612.646  ops/ms
Murmur3Benchmark.hash_sangupta      1000  thrpt    5     704.660 ±    25.304  ops/ms
Murmur3Benchmark.hash_sangupta     10000  thrpt    5      81.783 ±   110.223  ops/ms
Murmur3Benchmark.hash_sangupta    100000  thrpt    5       6.828 ±     8.232  ops/ms
Murmur3Benchmark.hash_yonik            1  thrpt    5  231423.849 ±  5818.267  ops/ms
Murmur3Benchmark.hash_yonik           10  thrpt    5  113567.630 ±   341.959  ops/ms
Murmur3Benchmark.hash_yonik          100  thrpt    5   22265.668 ±   174.804  ops/ms
Murmur3Benchmark.hash_yonik         1000  thrpt    5    2580.319 ±     7.761  ops/ms
Murmur3Benchmark.hash_yonik        10000  thrpt    5     268.995 ±     1.895  ops/ms
Murmur3Benchmark.hash_yonik       100000  thrpt    5      27.137 ±     0.203  ops/ms

DanielThomas commented 6 years ago

For the 100,000 byte case, throughput for Hollow vs Yonik is 1854MB/s vs 2586 MB/s.

DanielThomas commented 6 years ago

Relates to https://github.com/Netflix/hollow/issues/307

Netflix / hollow

ByteData abstraction impacts performance #305

Full Benchmark Results