estraier / tkrzw

a set of implementations of DBM
Apache License 2.0
164 stars 20 forks source link

Maximum HashDBM database size #28

Closed svenha closed 2 years ago

svenha commented 2 years ago

I am considering to port a lot of code from tchdb (tokyocabinet) to Tkrzw. I thought that HashDBM would be the natural candidate, but the default maximum database size is 32 GB, while I need 0.5 to 2 TB (without compression because currently I rely on transparent filesystem compression). If HashDBM is the right choice for my porting, which choices of offset width and alignment power are recommended and how should these be applied?

BTW: I am so happy that a successor of the excellent tokyocabinet is there and maintained! Thanks a million.

estraier commented 2 years ago

As you say, HashDBM is suitable for your purpose. Tuning "align_pow" or "offset_width" can increase the maximum file size. See these examples: C++: https://github.com/estraier/tkrzw/blob/master/example/hashdbm_ex2.cc C: https://github.com/estraier/tkrzw/blob/master/example/langc_ex1.c

By default the align_pow is 3, which means that the start offset of every record is aligned to 2^3=8. 32-bit address means 4GB limit. 8 4GB is 32GB. If you set the align_pow to 9, the alignment becomes 2^9=512. 512 4GB = 2TB. However, there's a side effect. Records smaller than 512B take 512B. So, if you has a lot of smaller records, increasing the alignment deteriorates space efficiency.

Another solution is to increase the address width. By default, the offset_width is 4, which means 32-bit addressing. You can set it to 5, which means 40-bit addressing whose maximum size is 1TB. If the offset_width is 5 and the alignment_pow is 3, the maximum size is 8TB. The side effect of increasing the offset_width is that the footprint of the hash table and each record increases 1 byte for each record, approximately.

In sort, if most of your records are larger than 512B, you should modify the alignment_pow. Otherwise, you should modify the offset_width.

On Mon, Feb 21, 2022 at 12:04 AM svenha @.***> wrote:

I am considering to port a lot of code from tchdb (tokyocabinet) to Tkrzw. I thought that HashDBM would be the natural candidate, but the default maximum database size is 32 GB, while I need 0.5 to 2 TB (without compression because currently I rely on transparent filesystem compression). If HashDBM is the right choice for my porting, which choices of offset width and alignment power are recommended and how should these be applied?

— Reply to this email directly, view it on GitHub https://github.com/estraier/tkrzw/issues/28, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQGJVREAT55VDKMGMS4M7I3U4D7HPANCNFSM5O4QVLOA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

svenha commented 2 years ago

Thanks for the explanation. (The mentioned C example does not contain such tuning, yet, but I get the idea from the C++ one.)