dadhi / ImTools

Fast and memory-efficient immutable collections and helper data structures
MIT License
231 stars 10 forks source link

Try new .Net Core 2.1 PopCount() intrinsic for HAMT implementation #5

Open dadhi opened 6 years ago

dadhi commented 6 years ago

Currently it uses HammingWeight algorithm for calculating number of set bits in a int value: https://bitbucket.org/dadhi/dryioc/src/e788ffce78b727b35fda5355a2297d40f0e7731b/Net45/Playground/HashArrayMappedTrieTests.cs#lines-190

Here is the thing Aye, that's there System.Runtime.Intrinsics.X86.PopCount(uint/ulong)

First step will to BDN benchmark against Swar or HammingWeight with PopCount.

dzmitry-lahoda commented 5 years ago

Have you done that? May I do it?

dadhi commented 5 years ago

No, not yet. Yes, you can try!

dadhi commented 5 years ago

@dzmitry-lahoda I have added two basic HAMT implementations into the Playground project with some tests.

There are two (or more) implementations: IntHashTrie<V> and HashTrie<K, V> (for arbitrary keys).

The method(s) to consider for replace with PopCount are named uint GetSetBitsCount(uint n): https://github.com/dadhi/ImTools/blob/75b3b3b61db93653829ab9430ff9920745ea4b48/src/Playground/IntHashArrayMappedTrieTests.cs#L211

The hand-written benchmarks (without BenchmarkDotNet) are in https://github.com/dadhi/ImTools/blob/master/src/Playground/TreeBenchmarks.cs

dzmitry-lahoda commented 5 years ago
               Method | ItemCount |        Mean |     Error |       StdDev |
--------------------- |---------- |------------:|----------:|-------------:|
         SWAR_ImTools |      1000 |  2,362.5 ns |  47.07 ns |    73.284 ns |
               Popcnt |      1000 |    504.7 ns |  10.05 ns |     8.908 ns |
 SWAR_ImTools_Inlined |      1000 |  1,658.4 ns |  27.01 ns |    25.261 ns |
    SWAR_ForkOfCoreFx |      1000 |  2,378.0 ns |  45.01 ns |    42.106 ns |
         SWAR_ImTools |     10000 | 23,281.8 ns | 362.54 ns |   339.117 ns |
               Popcnt |     10000 |  4,912.9 ns |  73.60 ns |    68.845 ns |
 SWAR_ImTools_Inlined |     10000 | 16,221.2 ns | 265.45 ns |   248.305 ns |
    SWAR_ForkOfCoreFx |     10000 | 24,810.7 ns | 492.58 ns | 1,049.733 ns |

I did in my fork as it has .NET Core only build (i have hard time work with legacy projects:()

https://github.com/dzmitry-lahoda/ImTools/blob/unnoficial/src/Playground/BitCountsBenchmarks.cs

For synthetic benchmark.

  1. Seem better to mark for inlining to be 100% sure you are (I guess it will be better for exact that method, so will test https://gist.github.com/mrange/d6e7415113ebfa52ccb660f4ce534dd4#gistcomment-2026775)
  2. You SWAR is slightly better to some than other SWAR:) on my Intel 7gen
  3. Popcnt is 3 to 4 times faster.

I guess need to make TreeBenchmarks to be BDN. And add copy of Trie code. Will send results.

I guess I may merge these classes into separate branch and send pull. But better if whole solution will be .NET Core for contrib.

dzmitry-lahoda commented 5 years ago
Adding 10 items (ms):
Trie PopCnt- 0
Trie - 0

Getting one out of 10 items 1,000,000 times (ms):
Trie PopCnt - 21
Trie - 20
====================
Adding 100 items (ms):
Trie PopCnt- 2
Trie - 0

Getting one out of 100 items 1,000,000 times (ms):
Trie PopCnt - 39
Trie - 38
====================
Adding 1000 items (ms):
Trie PopCnt- 0
Trie - 0

Getting one out of 1000 items 1,000,000 times (ms):
Trie PopCnt - 40
Trie - 38
====================
Adding 10000 items (ms):
Trie PopCnt- 4
Trie - 3

Getting one out of 10000 items 1,000,000 times (ms):
Trie PopCnt - 49
Trie - 22
====================
Adding 100000 items (ms):
Trie PopCnt- 66
Trie - 80

Getting one out of 100000 items 1,000,000 times (ms):
Trie PopCnt - 64
Trie - 23
====================
Adding 1000000 items (ms):
Trie PopCnt- 1489
Trie - 1533

Getting one out of 1000000 items 1,000,000 times (ms):
Trie PopCnt - 18
Trie - 24

Strange result.

  1. May be transform to BDN to avoid any influences (but do not seems that issues relevant). I have intel 7700hq and process is 64 bits.
  2. Tried inline GetSetBitsCount. No changes.

So seem no need to popcnt for now for specific case.

dzmitry-lahoda commented 5 years ago

I see Trie is faster than Tree always. And only slower on 1M items to add. Interesting replacement of IoC. But may be be better try out BDN to measure memory and reduce any fluctuations. E.g. right now GC.Collect() seems(not sure) starts GC, but not waits to finish.

dzmitry-lahoda commented 5 years ago

popcnt is with netcoreapp3.0 preview. not with 2.1 or 2.2:) so I suggest to close story as it seems irrelevant for both performance and usage in near future.

dadhi commented 5 years ago

Thanks for the test.

I will keep it open until I played with it myself, will decide how to proceed later.

dzmitry-lahoda commented 5 years ago

I was trying to get the code of intrinsic for some reason. Found that for .NET Core 2.1 there should be kind of System.Runtime.Intrinsics.Experimental.dll. But I could not find such.