Slow non-O(1) amortized access times for HashTables/Maps/AssociativeArrays...

chapel-lang / chapel

a Productive Parallel Programming Language

https://chapel-lang.org

Other

1.78k stars 418 forks source link

Slow non-O(1) amortized access times for HashTables/Maps/AssociativeArrays... #15874

Closed GordonBGood closed 4 years ago

GordonBGood commented 4 years ago

Summary of Problem

Most implementations of Hash Tables have (1) amortized access and access times in the hundreds of CPU clock cycles per access, whether running a contains, add, or delete. The current (version 1.22) version of Chapel is something like ten times as slow although that perception may be influenced by that it does not have O(1) amortized access times, as can be seen in running the following test code.

Steps to Reproduce

Source Code:

use Time;

config const LIMIT = 1000000;

type Prime = uint(32);

class Primes { // needed so we can use next to get successive values
  var n: Prime; var obp: Prime; var q: Prime;
  var bps: owned Primes?;
  var keys: domain(Prime); var dict: [keys] Prime;
  proc next(): Prime { // odd primes!
    if this.n < 5 then { this.n = 5; return 3; }
    if this.bps == nil {
      this.bps = new Primes(); // secondary odd base primes feed
      this.obp = this.bps!.next(); this.q = this.obp * this.obp;
    }
    while true {
      if this.n >= this.q then { // advance secondary stream of base primes...
        const adv = this.obp * 2; const key = this.q + adv;
        this.obp = this.bps!.next(); this.q = this.obp * this.obp;       
        this.keys += key; this.dict[key] = adv;
      }
      else if this.keys.contains(this.n) then { // found a composite; advance...
        const adv = this.dict[this.n]; this.keys.remove(this.n);
        var nkey = this.n + adv;
        while this.keys.contains(nkey) do nkey += adv;
        this.keys += nkey; this.dict[nkey] = adv;
      }
      else { const p = this.n; this.n += 2; return p; }
      this.n += 2;
    }
    return 0; // to keep compiler happy in returning a value!
  }
  iter these(): Prime { yield 2; while true do yield this.next(); }
}

proc main() {
  var count = 0;
  write("The first 25 primes are:  ");
  for p in new Primes() { if count >= 25 then break; write(p, " "); count += 1; }
  writeln();

  var timer: Timer;
  timer.start();

  count = 0;
  for p in new Primes() { if p > LIMIT then break; count += 1; }

  timer.stop();
  write("Found ", count, " primes up to ", LIMIT);
  writeln(" in ", timer.elapsed(TimeUnits.milliseconds), " milliseconds.");  
}

Compile command:

With the source file named Bench, compile as: chpl --fast Bench.chpl.

Execution command:

With the source file named Bench and compiled as above, the output of ./Bench is as follows:

The first 25 primes are:  2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 
Found 78498 primes up to 1000000 in 1383.32 milliseconds.

When fun as ./Bench --LIMIT=10000000, the output is as follows:

The first 25 primes are:  2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 
Found 664579 primes up to 10000000 in 35967.8 milliseconds.

This shows that the time does not increase linearly as expected; in this case it is almost thirty times slower for increased work of only about ten times.

The performance on tio.run (also a Sandy Bridge class CPU running at 2.9 GHz instead of my 3.1 GHz) is about the same as on my cygwin installation with the specifications below.

This was also tested by converting the code to use the "Map" library with the same algorithm to find that it is the same asymptotic performance but just a constant factor slower for all ranges, likely due to extra interface overhead costs.

To show that it is not the fault of the algorithm but rather the Chapel Associative Array implementation, here is a tio.run implementation using a roll-your-own Pythonish Hash Table, which uses the same "an-integer-is-it's-own-hash-value" implementation but increases the table size when half full instead of two thirds full to reduce hash collisions for simplicity.

Sieving to a million it outputs the following:

The first 25 primes are:  2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 
Found 78498 primes up to 1000000 in 25.781 milliseconds.

and sieving to ten million it outputs the following:

The first 25 primes are:  2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 
Found 664579 primes up to 10000000 in 291.548 milliseconds.

showing the approximate linear performance with sieving range (actually O(n log(log(n))) as expected.

tio.run is currently running Chapel version 1.20, but there doesn't seem to be any significant differences to the cywin version 1.22 I am running on my actual machine in this regard.

I believe this demonstrates that Chapel's Associative Array's implementation need some serious work!

Configuration Information

Output of chpl --version: chpl version 1.22.0
Output of $CHPL_HOME/util/printchplenv --anonymize: CHPL_TARGET_PLATFORM: cygwin64 + CHPL_TARGET_COMPILER: gnu + CHPL_TARGET_ARCH: x86_64 + CHPL_TARGET_CPU: native + CHPL_LOCALE_MODEL: flat + CHPL_COMM: none + CHPL_TASKS: fifo + CHPL_LAUNCHER: none + CHPL_TIMERS: generic + CHPL_UNWIND: none + CHPL_MEM: cstdlib + CHPL_ATOMICS: cstdlib + CHPL_GMP: gmp + CHPL_HWLOC: none + CHPL_REGEXP: re2 + CHPL_LLVM: none + CHPL_AUX_FILESYS: none +
Back-end compiler and version, e.g. gcc --version or clang --version: gcc (GCC) 9.3.0

mppf commented 4 years ago

There have been some recent changes to associative domain's implementation and on master I'm not seeing the problem size scaling problem you mentioned on my machine:

$ ./primes --LIMIT=1000000
The first 25 primes are:  2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 
Found 78498 primes up to 1000000 in 226.243 milliseconds.
$ ./primes --LIMIT=10000000
The first 25 primes are:  2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 
Found 664579 primes up to 10000000 in 3609.08 milliseconds.
$ ./primes --LIMIT=20000000
The first 25 primes are:  2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 
Found 1270607 primes up to 20000000 in 8469.99 milliseconds.

This is not perfect linear scaling either but it's much closer...

Edit: By "linear scaling" people usually mean "O(N)" and in that event performance scaling like 30N is still technically "linear". I understand you don't want 30N scaling in this case, though.

mppf commented 4 years ago

else if this.keys.contains(this.n) then {

There's no need to write then followed by a curly - then is just for single-expression conditionals. Just do else if this.keys.contains(this.n) {.

ronawho commented 4 years ago

Here's some timings from one of the machines we use for nightly testing (chapcs). Results are for the version using an associative array built with 1.22 and master, and your custom Hash Table built with 1.22. As Michael mentioned performance has gotten much better on master, but could still use some work.

Once https://github.com/chapel-lang/chapel/pull/15869 (an optimization to Map) is in I'll look at results for using Map instead of associative array.

LIMIT	assoc arr (1.22)	assoc arr (master)	custom HT (1.22)
1_000_000	2,545.8 ms	287.6 ms	21.7 ms
10_000_000	32,620.2 ms	4,568.6 ms	255.6 ms
100_000_000	861,752.0 ms	81,083.6 ms	2,391.6 ms

GordonBGood commented 4 years ago

Thanks for the update, Michael. It's good you are making progress on this, but it really should have linear scaling.

On Fri, Jun 19, 2020, 00:57 Michael Ferguson notifications@github.com wrote:

There have been some recent changes to associative domain's implementation and on master I'm not seeing the problem size scaling problem you mentioned on my machine:

$ ./primes --LIMIT=1000000 The first 25 primes are: 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 Found 78498 primes up to 1000000 in 226.243 milliseconds. $ ./primes --LIMIT=10000000 The first 25 primes are: 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 Found 664579 primes up to 10000000 in 3609.08 milliseconds. $ ./primes --LIMIT=20000000 The first 25 primes are: 2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97 Found 1270607 primes up to 20000000 in 8469.99 milliseconds.

This is not linear scaling either but it's much closer...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/chapel-lang/chapel/issues/15874#issuecomment-646218546, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRTMTIVEMZOJK42KK7WUKDRXJIRLANCNFSM4OBHFR2Q .

cassella commented 4 years ago

performance scaling like 30N is still technically "linear".

That doesn't sound right to me. A lousy large constant factor (and/or constant overhead) would have already been reflected in the first measurement.

GordonBGood commented 4 years ago

Michael, after an extra week or two with Chapel, I now know that then for conditionals is the same as do for loops: only required for single statement use, optional for use with blocks delineated by curly brackets. I neglected to remove this redundancy before I posted this little benchmark; however, other than making the code less concise, I don't suppose it does any harm.

On Fri, Jun 19, 2020, 00:58 Michael Ferguson notifications@github.com wrote:

else if this.keys.contains(this.n) then {

There's no need to write then followed by a curly - then is just for single-expression conditionals. Just do else if this.keys.contains(this.n) {.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/chapel-lang/chapel/issues/15874#issuecomment-646218854, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRTMTJI2JH34S7WMATX7TTRXJIT3ANCNFSM4OBHFR2Q .

GordonBGood commented 4 years ago

Elliot, as mentioned in the OP, when converted to use Map this benchmark seems to show the same non-linear scaling so that implementation, too, seems to need work. It also seemed to be a constant factor even slower, which was/is a concern.

On Fri, Jun 19, 2020, 03:30 Elliot Ronaghan notifications@github.com wrote:

Here's some timings from one of the machines we use for nightly testing (chapcs). Results are for the version using an associative array built with 1.22 and master, and your custom Hash Table built with 1.22. As Michael mentioned performance has gotten much better on master, but could still use some work.

Once #15869 https://github.com/chapel-lang/chapel/pull/15869 (an optimization to Map) I'll look at results for using Map instead of associative array. LIMIT assoc arr (1.22) assoc arr (master) custom HT (1.22) 1_000_000 2,545.8 ms 287.6 ms 21.7 ms 10_000_000 32,620.2 ms 4,568.6 ms 255.6 ms 100_000_000 861,752.0 ms 81,083.6 ms 2,391.6 ms

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/chapel-lang/chapel/issues/15874#issuecomment-646289906, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRTMTNQ6D7UZSYS55SIKRTRXJ2OPANCNFSM4OBHFR2Q .

ronawho commented 4 years ago

Elliot, as mentioned in the OP, when converted to use Map this benchmark seems to show the same non-linear scaling so that implementation, too, seems to need work. It also seemed to be a constant factor even slower, which was/is a concern.

Right, today Map is implemented with an associative array, but that will not be true after #15869 is merged so performance characteristics will be different (hopefully better)

GordonBGood commented 4 years ago

performance scaling like 30N is still technically "linear".

That doesn't sound right to me. A lousy large constant factor (and/or constant overhead) would have already been reflected in the first measurement.

Paul, think about it. In fact, the current version 1.22 Map implementation seems to be about 2.5N constant factor slower than the current Associative Array/Domain implementation. This means that this constant 2.5 slowness relative factor is the same for the first comparison as the second and all other comparisons, which then reflect the same scaling, linear or lack of being linear as in this case.

bradcray commented 4 years ago

I've been keeping this tab open in hopes that someone will concede in the debate over whether or not the current behavior is linear, but it seems to have gone silent instead. Neither Elliot nor Michael's measurements of the current Chapel implementation look like they could be considered linear to me, whereas Elliot's measurements of @GordonBGood's implementation do seem to be much moreso.

I thought Gordon was asserting our results were non-linear, Michael was asserting that they were, and that Paul was asserting that they weren't, but then Gordon seemed to take a contrary stance toward Paul as well, leaving me confused overall.

Arguably, this debate isn't as important as agreement that our current performance is slower than it should be, but I'm reluctant to resolve all of the comments about whether or not the performance is linear without understanding whether we're in agreement or not.

Maybe give this comment a thumbs-up if you agree that Chapel's current behavior is non-linear, and follow up with an argument for why you think it's linear if you do? If everyone agrees that it's non-linear, I'll close all the comments that don't add new data to the OP.

GordonBGood commented 4 years ago

@BradCray,I agree with Michael and Elliot that Associative Domain/Array performance has been improved so it is less non-linear than the current stable release in version 1.22, but it is still non-linear.

GordonBGood commented 4 years ago

Perhaps a little more explanation of what constitutes "linear scaling" for this benchmark is in order here:

This benchmark has a number of culling operations as given by the Wikipedia Sieve of Eratosthenes article; as applicable to the above testing doing odds-only sieving to a million, ten million, twenty million, and a hundred million of about 810,900 operations, 8,925,000 operations, 18,310,000 operations, and 96,280,000 operations, respectively (rounded to four decimal places). However, due to the programming logic to prevent adding duplicate entries into the hash table, there aren't this many of all hash table operations, with there being about this many contains operations, but only about the number of odd composites to these ranges minus the number of found primes to the range or about (500,000 - 78498) = 421,502, and so on, of hash table delete and add operations, or less than half as many of each as there are of contains operations. The maximum number of entries in the hash table isn't that many as it is equal to the number of odd primes up to the square root of the range sieved, or 167 entries of odd primes to a thousand for a range of a million and only 1228 entries for odd primes to ten thousand for a range of a hundred million.

Thus, if the hash table implementation had linear scaling, we would expect the execution times to increase by these ratios; however, even for the improved version, there is a ratio of about 15.9 instead of 11.0 for the ratio of execution times for ranges of a million to ten million and 17.7 instead of 11.9 times for the ratio of execution times for ranges of ten million to a hundred million. This indicates that the empirical logarithmic growth power is about 1.2 instead of 1.0 as it should be for linear growth (for instance for ten to a hundred million 81.0836/4.5686 = (962800000/8925000)^x; x = about 1.2).

I haven't looked into how "master" implements Associative Domains/Arrays, but if it uses any kind of a cryptographic hash perhaps implementing "perfect hash" features, then that's overkill and not what is required here: for a hash table all we require is that most of the generated hashes don't collide, but if a few percent do collide it isn't serious in contrast with a cryptographic hash were we want to guarantee that the huge majority of hashes don't collide. What must be avoided is any addition of a computation that adds a logarithmic amortized complexity factor.

This is also reflected in the handling of what is done when there are hash collisions in that we need an algorithm that has a fairly high chance of finding an empty slot on the next try, but if not and it doesn't cost too much in execution time (nor add computations that have a logarithmic complexity factor) then it's not serious to just try again until potentially all slots have been tried. There is also a balance in choosing the ratio of unused empty available slots to the total look up slot table size so as to reduce the incidences of collisions: Python authors seem to have tested to find that maintaining a ratio of from one third to two thirds empty slots is sufficient so that the costs of handling collisions isn't too high or noticeable in the amortized complexity but my custom version is even simpler to maintain the ratio to from three quarters to one half empty slots for even a little less cost in collision handling at a cost of a slightly bigger slot look-up table size.

The growth of the slot look up table is done by binary (logarithmic) powers in order to reduce and level the amortized time for such growth, and the importance of a simple hashing or re-hashing algorithm comes into play here as well in that all entries may need re-computation of their slot number when the slot table grows, which if too slow will add a constant overhead. However, if growth is not done by powers, then the amortized cost of this re-computation will not be seen as linear.

There are tens or hundreds of hash table implementations out there that do have amortized O(1) "linear scaling" (amortized) asymptotic performance complexity; I don't see why Chapel can't have it too.

mppf commented 4 years ago

There are tens or hundreds of hash table implementations out there that do have amortized O(1) "linear scaling" (amortized) asymptotic performance complexity; I don't see why Chapel can't have it too.

I don't think that anybody is disagreeing with this point. I'm certainly not. I'm also not sure if it's obvious but we are taking the issue brought up here seriously and are working to improve the hashtable implementation.

I think I accidentally created a huge distraction in this issue about whether it was linear or not, and I'm pretty sure I was wrong about that, so thanks for your patience. But either way I think it's a distraction from the problem here.

but if it uses any kind of a cryptographic hash perhaps implementing "perfect hash" features

it does not

This is also reflected in the handling of what is done when there are hash collisions in that we need an algorithm that has a fairly high chance of finding an empty slot on the next try, but if not and it doesn't cost too much in execution time (nor add computations that have a logarithmic complexity factor) then it's not serious to just try again until potentially all slots have been tried.

It is possible that this is actually the core of the problem. I think that our current implementation is choosing when to resize the hashtable based on the number of full slots, but deleted slots are also contributing to the collision rate, which, in open addressing schemes, makes each lookup slower. If there are enough collisions, this could end up making even a contains operation O(n) instead of O(1).

But in any case I wish we could move to a hashtable implementation more current than the quadratic probing open addressing we have. Unfortunately it's been hard to prioritize such an effort.

GordonBGood commented 4 years ago

This is also reflected in the handling of what is done when there are hash collisions in that we need an algorithm that has a fairly high chance of finding an empty slot on the next try, but if not and it doesn't cost too much in execution time (nor add computations that have a logarithmic complexity factor) then it's not serious to just try again until potentially all slots have been tried.

It is possible that this is actually the core of the problem. I think that our current implementation is choosing when to resize the hashtable based on the number of full slots, but deleted slots are also contributing to the collision rate, which, in open addressing schemes, makes each lookup slower. If there are enough collisions, this could end up making even a contains operation O(n) instead of O(1).

But in any case I wish we could move to a hashtable implementation more current than the quadratic probing open addressing we have. Unfortunately it's been hard to prioritize such an effort.

It sounds like you know where the problem likely lies and it's only a lack of priority that holds back the implementation of a better algorithm.

Actually, it looks like the current "master" implementation only suffers from about an extra log(n) factor in asymptotic complexity and not a full n factor, at least for the amortized rate, although that could still indicate a worst case n factor as in common hash table implementations along the lines of what you describe as being more frequently triggered due to an error in continuously processing deleted slots.

mppf commented 4 years ago

But in any case I wish we could move to a hashtable implementation more current than the quadratic probing open addressing we have. Unfortunately it's been hard to prioritize such an effort.

It sounds like you know where the problem likely lies and it's only a lack of priority that holds back the implementation of a better algorithm.

I asked @LouisJenkinsCS about the hashtable in PR #14409 - it sounds like it could be adjusted to be a sequential hashtable by removing the locks. The result would not necessarily be as fast as other hashtables for the constant factor - but it should not have any collision-related scaling issues - it should always have O(log(log(N)) lookup.

ronawho commented 4 years ago

A performance test for this was added in https://github.com/chapel-lang/chapel/pull/15955. It uses the custom hash table code with wrappers for associative arrays and maps so that we can test the 3 variants, but be sure the computation/kernel is the same. Note that since https://github.com/chapel-lang/chapel/pull/15869, maps no longer use associative arrays and should be faster than associative arrays. The performance of that can be tracked at https://chapel-lang.org/perf/chapcs/?startdate=2020/06/27&enddate=today&graphs=associativetypeaddremove. It just runs the 3 variants with --LIMIT=10_000_000 and reports seconds instead of milliseconds.

Following Michael's suggestion in https://github.com/chapel-lang/chapel/issues/15874#issuecomment-649557774, https://github.com/chapel-lang/chapel/pull/15982 changed our hash table so that we resize based on the number of full+deleted slots, not just the number of full slots. This resolves the scaling issues and improves performance.

LIMIT	assoc arr (before)	assoc arr (now)	custom HT
1_000_000	0.30 s	0.11 s	0.02 s
10_000_000	4.29 s	1.12 s	0.21 s
100_000_000	72.33 s	11.49 s	2.27 s

Raw performance for associative arrays is still ~5x off, but the scaling problem is resolved. Raw performance for map is a little better, though still ~3x off from the custom hash table.

LIMIT	assoc arr	map	custom HT
1_000_000	0.11 s	0.07 s	0.02 s
10_000_000	1.12 s	0.74 s	0.21 s
100_000_000	11.49 s	7.48 s	2.27 s
1_000_000_000	119.42 s	79.07 s	24.06 s

There's probably further improvements we can make to raw performance, but I expect we'll address those with a different hash table implementation down the road and that's probably not something we plan to look at in the immediate future.

ronawho commented 4 years ago

And FYI that the map API allows for some optimizations that can improve performance. e.g.

adv = dict[n]; dict.remove(n);
...
while dict.contains(nkey) do nkey += adv;
dict.add(nkey, adv);

Could be :

adv = dict.getAndRemove(n);
...
while !dict.add(nkey, adv) do nkey += adv;

That just removes 2 extra lookups, which improves performance by ~10% but does not impact scaling.

ronawho commented 4 years ago

@GordonBGood thanks for reporting this issue.

I believe the scaling issue has been resolved, and performance on master for --LIMIT=10_000_000 is ~30x better compared to 1.22 for the associative array variant, and ~43x better for the map variant.

Your custom hash table still has better raw performance, but I'm inclined to close this issue since the scaling issue is resolved. I expect raw performance will improve in the future when we look at different hash table implementations, and we'll be able to track that going forward since we run this as a performance test now.

GordonBGood commented 4 years ago

@ronawho (Elliot), yes, my major concern was the non-linear scaling and that has been resolved for both Associative Arrays and for the Map library.

I wouldn't expect any general solution to be as fast as my "custom" hash table version as it truly is "custom" in that it makes some assumptions about the nature of the key as being a 32-bit prime integer and therefore never zero; as well, it uses Python's quadratic probing technique in the case of collisions and I believe that was tuned and tested by Guido van Rossum himself for performance. It's great that you have retained the custom hash table in your performance testing suite as a performance target, as it you can get anywhere close to its speed in some future implementation, Chapel would truly have a performant hashtable/Associative Array/Map implementation.

As the subject of this issue is the non-linear scaling performance which has been resolved, go ahead and close it as the performance will be continuously flagged by the inclusion of the custom test in your testing suite.

Thanks for your attention to this issue!

Regards, Gordon

ronawho commented 2 years ago

FYI @bmcdonald3 recently optimized our hash table implementation in https://github.com/chapel-lang/chapel/pull/18427, and that made our map/associative array versions ~2x faster. From the nightly perf graphs we now see:

impl	time
assoc array	0.39s
chapel map	0.30s
custom map	0.24s

bradcray commented 2 years ago

Very cool! Thanks Ben!