efficient / libcuckoo

A high-performance, concurrent hash table
Other
1.59k stars 276 forks source link

bucket size power of 2 limitation #7

Open voutcn opened 9 years ago

voutcn commented 9 years ago

Hi all,

If I am correct, the rehashing alway double the bucket size. In practice, if the number of items is slightly larger than power of two, a load factor will be only ~50%, which is a waste of memory. I would like to ask if it is possible that the rehashing can boost the bucket size by a smaller factor, say 1.5 or 1.3, which is more memory-friendly. Thanks.

manugoyal commented 9 years ago

Hi voutcn,

We use powers of 2 to size the table since it's a fairly standard way to grow vectors and other resizable arrays, and because a power-of-two sized table has some convenient properties for computing hashes in our model.

I'm not really sure resizing by a smaller factor would fix your problem, because say we sized by 1.5x. Then if the number of items happened to be slightly larger than a power of 1.5, the load factor would still only be ~50%. Do you have a specific use case where you need to have a large table with slightly more elements than a power of 2? Also remember that it is the bucket size that is a power of 2, not the number of elements. By default, each bucket has 8 items, so we can actually store 2^n * 8 values for each power-of-two sizing of the table.

dave-andersen commented 9 years ago

I've actually been working on an implementation of fractional resizing for a couple of days. It's not done yet, & I will likely not have it done until next week, but I will have something reasonably soon to use to evaluate it.

The trade off is that it will be slower for insertion and access, but it will have lower memory access. Maybe a 15 or 10 percent slowdown.

On Wed, Mar 18, 2015, 12:04 Manu Goyal notifications@github.com wrote:

Hi voutcn,

We use powers of 2 to size the table since it's a fairly standard way to grow vectors and other resizable arrays, and because a power-of-two sized table has some convenient properties for computing hashes in our model.

I'm not really sure resizing by a smaller factor would fix your problem, because say we sized by 1.5x. Then if the number of items happened to be slightly larger than a power of 1.5, the load factor would still only be ~50%. Do you have a specific use case where you need to have a large table with slightly more elements than a power of 2? Also remember that it is the bucket size that is a power of 2, not the number of elements. By default, each bucket has 8 items, so we can actually store 2^n * 8 values for each power-of-two sizing of the table.

Reply to this email directly or view it on GitHub https://github.com/efficient/libcuckoo/issues/7#issuecomment-83035723.

manugoyal commented 9 years ago

That's interesting. Is the slowdown related to the hashing and bucket indexing?

-Manu On Mar 18, 2015 9:07 AM, "David Andersen" notifications@github.com wrote:

I've actually been working on an implementation of fractional resizing for a couple of days. It's not done yet, & I will likely not have it done until next week, but I will have something reasonably soon to use to evaluate it.

The trade off is that it will be slower for insertion and access, but it will have lower memory access. Maybe a 15 or 10 percent slowdown.

On Wed, Mar 18, 2015, 12:04 Manu Goyal notifications@github.com wrote:

Hi voutcn,

We use powers of 2 to size the table since it's a fairly standard way to grow vectors and other resizable arrays, and because a power-of-two sized table has some convenient properties for computing hashes in our model.

I'm not really sure resizing by a smaller factor would fix your problem, because say we sized by 1.5x. Then if the number of items happened to be slightly larger than a power of 1.5, the load factor would still only be ~50%. Do you have a specific use case where you need to have a large table with slightly more elements than a power of 2? Also remember that it is the bucket size that is a power of 2, not the number of elements. By default, each bucket has 8 items, so we can actually store 2^n * 8 values for each power-of-two sizing of the table.

Reply to this email directly or view it on GitHub https://github.com/efficient/libcuckoo/issues/7#issuecomment-83035723.

— Reply to this email directly or view it on GitHub https://github.com/efficient/libcuckoo/issues/7#issuecomment-83036657.

apc999 commented 9 years ago

Hi, Dinghua,

Growing the table size by a smaller factor (rather than 2x everytime ) is certainly an interesting and attractive feature. Actually I was told by different people that this could be useful in certain practical applications. For libcuckoo, we made this choice (growing table size by 2x) to enjoy some tricks that can speed up the table grow by doing memcpy, and cleaning up the table in background (compared to locking the table and rehashing each item). So this is a tradeoff. But if you can finish it with some clever trick, I am very happy to learn :)

Best,

On Wed, Mar 18, 2015 at 9:04 AM, Manu Goyal notifications@github.com wrote:

Hi voutcn,

We use powers of 2 to size the table since it's a fairly standard way to grow vectors and other resizable arrays, and because a power-of-two sized table has some convenient properties for computing hashes in our model.

I'm not really sure resizing by a smaller factor would fix your problem, because say we sized by 1.5x. Then if the number of items happened to be slightly larger than a power of 1.5, the load factor would still only be ~50%. Do you have a specific use case where you need to have a large table with slightly more elements than a power of 2? Also remember that it is the bucket size that is a power of 2, not the number of elements. By default, each bucket has 8 items, so we can actually store 2^n * 8 values for each power-of-two sizing of the table.

— Reply to this email directly or view it on GitHub https://github.com/efficient/libcuckoo/issues/7#issuecomment-83035723.

Computer Science Department Carnegie Mellon University

dave-andersen commented 9 years ago

Yeah - we have to move to mod instead of mask to compute the indexes.

I use a fast computation of the mod for a fixed #, but it's still slower than mask.

On Wed, Mar 18, 2015 at 2:41 PM Bin Fan notifications@github.com wrote:

Hi, Dinghua,

Growing the table size by a smaller factor (rather than 2x everytime ) is certainly an interesting and attractive feature. Actually I was told by different people that this could be useful in certain practical applications. For libcuckoo, we made this choice (growing table size by 2x) to enjoy some tricks that can speed up the table grow by doing memcpy, and cleaning up the table in background (compared to locking the table and rehashing each item). So this is a tradeoff. But if you can finish it with some clever trick, I am very happy to learn :)

Best,

  • Bin

On Wed, Mar 18, 2015 at 9:04 AM, Manu Goyal notifications@github.com wrote:

Hi voutcn,

We use powers of 2 to size the table since it's a fairly standard way to grow vectors and other resizable arrays, and because a power-of-two sized table has some convenient properties for computing hashes in our model.

I'm not really sure resizing by a smaller factor would fix your problem, because say we sized by 1.5x. Then if the number of items happened to be slightly larger than a power of 1.5, the load factor would still only be ~50%. Do you have a specific use case where you need to have a large table with slightly more elements than a power of 2? Also remember that it is the bucket size that is a power of 2, not the number of elements. By default, each bucket has 8 items, so we can actually store 2^n * 8 values for each power-of-two sizing of the table.

Reply to this email directly or view it on GitHub https://github.com/efficient/libcuckoo/issues/7#issuecomment-83035723.

Computer Science Department Carnegie Mellon University

Reply to this email directly or view it on GitHub https://github.com/efficient/libcuckoo/issues/7#issuecomment-83116483.

voutcn commented 9 years ago

Thank you all.

@manugoyal Theoretically resizing by say 1.5x should increase the lower bound the load factor. If the number of elements happen to be slightly larger than power of 1.5, the load factor will ~1/1.5 = 2/3. @apc999 understand that power of 2 has many good properties, but I didn't go that deep to study those tricks :) @dave-andersen look forward to your work!

dave-andersen commented 8 years ago

Following up on this - sorry for the very long delay. I now know how to do this in a quite reasonably efficient way, but it's going to be a somewhat complex implementation path to get this version of it to handle that cleanly. Will start a design doc...

univerone commented 1 year ago

Following up on this - sorry for the very long delay. I now know how to do this in a quite reasonably efficient way, but it's going to be a somewhat complex implementation path to get this version of it to handle that cleanly. Will start a design doc...

Hi David, may I ask is there any update?