dense_set: insert after erase is too slow

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?

#include "google/dense_hash_set"
#include "tr1/unordered_set"

int main()
{
//  typedef std::tr1::unordered_set<int> TSet;
  typedef google::dense_hash_set<int> TSet;
  TSet set;
  set.set_empty_key(-1);
  set.set_deleted_key(-2);
  const int size = 1000000;
  set.rehash(size);
  for (int i = 0; i < size; ++i) set.insert(i);
  for (int i = 0; i < size; ++i) { set.erase(i); set.insert(i); }
  return 0;
}

What is the expected output? What do you see instead?
time ./a.out ~ 3 sec (dense_set)
time ./a.out ~ 0.01 sec (unordered_set)

What version of the product are you using? On what operating system?
Google Hash 1.11 amd64
Linux DE3UNP-503726 2.6.32-26-generic #48-Ubuntu SMP Wed Nov 24 10:14:11 UTC 
2010 x86_64 GNU/Linux
gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5)
Intel Core 2 Duo 3 GHz
g++ -O3

Please provide any additional information below.

Original issue reported on code.google.com by bulovya...@gmail.com on 28 Jun 2011 at 5:39

GoogleCodeExporter commented 8 years ago

OK, I tracked this down.  The problem is one that I had thought about before, 
but never seen in action until now: I combine the logic for find() and 
insert().  As a result, the insert() call takes a lot more time for this 
test-case than it ought to, because you've created an access pattern where 
lookups are much more expensive than inserts.

I don't expect such an access pattern to occur very often in real-life data, so 
I don't think this is a code-red situation.  But it is worth cleaning up: the 
hashtable code is doing extra work for no good reason.  I'll look to get that 
fix into the next release.

Original comment by csilv...@gmail.com on 28 Jun 2011 at 9:08

Added labels: Priority-Medium, Type-Defect

GoogleCodeExporter commented 8 years ago

btw, was this motivated by an actual use-case you were seeing?  Or did you 
happen to just be trying out various benchmark-type codes?

Original comment by csilv...@gmail.com on 28 Jun 2011 at 9:09

GoogleCodeExporter commented 8 years ago

no real use case behind it, this example is from my bench for different hash 
maps implementations. but, there are many applications which do insert just 
after erase (although not erase(i), insert(i)), e.g. when you keep list of last 
used elements (cache).
but this case is really strange, as I understand from the dense_hash algorithm 
description it shouldn't make any problem: mark element as deleted, do find, 
check that the bucket deleted and put new element into it. all actions must be 
very fast for int.
P.S. anyway, thanks for the lib.

Original comment by bulovya...@gmail.com on 28 Jun 2011 at 11:04

GoogleCodeExporter commented 8 years ago

Hmm, on second thought this is not easy to fix.  When we see the bucket is 
deleted, we only know we can insert the item there if we know the item doesn't 
already exist in the hashtable.  Because we do open chaining, we need to go all 
the way to the end of the chain to tell this.

Your example is a pathological case for this: it requires all the items to be 
hashed right next to each other (which the identity hash does on this input), 
and requires erasing and re-inserting the same element.  Any deviation from 
that -- for instance, a real life top-n cache, where the item being deleted is 
different from the one being inserted -- would not have this bad behavior.

There are various hacks that would improve this benchmark, but not performance 
in real-life cases.  Without any indication that real apps are suffering from 
this, I'll probably leave it as-is.  I'll keep the bug open, though, but lower 
the priority.

Original comment by csilv...@gmail.com on 29 Jun 2011 at 2:27

Added labels: Priority-Low
Removed labels: Priority-Medium

GoogleCodeExporter commented 8 years ago

For future reference, here is the test I added to time_hash_map.cc to document 
the slowness here:

template<class MapType>
static void time_map_reverse_toggle(int iters) {
  MapType set;
  Rusage t;
  int i;

  // Just like toggle(), but we do erase+insert rather than insert+erase.
  // We prime the hashtable outside the benchmarking.
  set.resize(iters);
  for (i = 0; i < iters; i++) {
    set[i] = i+1;
  }

  const size_t start = CurrentMemoryUsage();
  t.Reset();
  for (i = 0; i < iters; i++) {
    set.erase(i);
    set[i] = i+1;
  }

  double ut = t.UserTime();
  const size_t finish = CurrentMemoryUsage();

  report("map_reverse_toggle", ut, iters, start, finish);
}

Original comment by csilv...@gmail.com on 29 Jun 2011 at 3:27

GoogleCodeExporter commented 8 years ago

just want to mention that the problem isn't related to the serial access 
pattern and hashing conclusions.
kind of this also works slow:
for (int i = 0; i < size; ++i) set.insert(i);
for (int i = 0; i < size; ++i) { set.erase(i); set.insert((i+1)*(i+1) + size); }

Original comment by bulovya...@gmail.com on 29 Jun 2011 at 9:29

GoogleCodeExporter commented 8 years ago

Hmm, I would expect we'd do better on this than we do on your first case, but 
maybe not totally great.

Basically, any time you have a really uneven distribution of hash values, 
you'll see bad results.  That's what happens here: we have 2^k buckets (for 
some k), but all the hash values are in the range 1..size, and there are none 
in the range size..2^k -- at least, for your initial insert.

If you used a non-trivial hashing function -- even just a simple bit-mix -- 
performance here would be much better.  In general, dense_hash_* and 
sparse_hash_* assume a good hashing function.  Often, the identity hash is not 
good.

Original comment by csilv...@gmail.com on 29 Jun 2011 at 8:34

GoogleCodeExporter commented 8 years ago

I just tested the code with MCT (https://launchpad.net/libmct) and it appears 
to work about at the speed of stdlib implementation (but note that rehash(size) 
for closed hashing is not very appropriate, you need sth. like rehash(2 * size) 
instead).

I have the following in the library code:

      start_probing (std::size_t hash) const
      {
        // We require that number of buckets is a power of 2.  To compensate, we involve a
        // prime number here; this usually improves speed with particularly bad hash
        // function choices.
        return (hash * size_t_constants <>::HUGE_PRIME) & (this->num_buckets - 1);
      }

Didn't really make many tests in different situations, but this trick seems to 
break such pathological cases nicely, while having little effect in "normal" 
cases.

Original comment by pogonys...@gmail.com on 30 Jun 2011 at 10:06

GoogleCodeExporter commented 8 years ago

Yeah, there's a philosophical debate here; we had a similar issue come up 
recently with respect to hashing pointers (which likewise do poorly with the 
identity hash).  Multiplying by a prime is a good way of improving a bad hash 
function, but it's not free.  My feeling is that people who already have a good 
hash function shouldn't have to pay the price for that extra multiplication all 
the time.

Another way of saying this is: all your tests would do much better if you 
replaced the identity hash with a hash function like
   return x * HUGE_PRIME

The flip side of this is that users need to know to do that, and need to 
recognize that hashtable performance is a bottleneck for them and this is a 
good way to fix it.  I don't know the best way to do that, but I feel that 
trying to do that is better than trying to correct for a bad hash function in 
the hashtable code.

Original comment by csilv...@gmail.com on 30 Jun 2011 at 10:23

GoogleCodeExporter commented 8 years ago

Yeah, I also had doubts about this line.  However, I considered that there are 
two "line of thoughts" when implementing a hash table: one dictates 2^n buckets 
(like both SparseHash and MCT use), the other to always use a prime number 
(e.g. GLib as far as I remember).  So "to compensate" in the comment was 
written having that second option in mind: some implementations already do 
something similar because they do hash % prime_num_buckets.

Original comment by pogonys...@gmail.com on 30 Jun 2011 at 10:46

GoogleCodeExporter commented 8 years ago

Yeah, hashing by a prime is more expensive than hashing a power of 2.  Again, I 
see these as all hacks to work around the fact the user has not picked a good 
hash function.  On the other hand, picking a good hash function can be 
difficult, and maybe not everyone shares my views of what a 'good' hash 
function is. :-)

For strings, I think it's pretty straightforward actually: you need lots of 
mixing, and something like MurmurHash is going to be good for everyone -- it 
has all the good properties a hashtable implementation might need.  The problem 
is for things like numbers: you're tempted to use the identity hash or 
something else really fast, which may be a perfectly fine 'shortcut' for some 
hashtable implementations but not for others.

I'm torn as to what to do here.  I don't want to speed-penalize folks who are 
using a 'good' hash function already.  But on the other hand, paying for an 
extra multiply is cheap compared to the cost of using a bad (for us) hash 
function.

I wish I could identify when a hash function was 'bad' and correct for it -- 
something like adaptive hash evaluation.  But I think that would be too slow: 
probably slower than just always multiplying.  My intuitions about speed are 
probably not that great here.

Overall, I'm not yet convinced this problem is widespread enough in the wild to 
justify making any changes.  I'm going to close the bug WillNotFix, but feel 
free to reopen it if you'd like to try to convince me I'm mistaken. :-)

Original comment by csilv...@gmail.com on 26 Aug 2011 at 1:26

Changed state: WontFix

epitzer / sparsehash

dense_set: insert after erase is too slow #71