basho / leveldb

Clone of http://code.google.com/p/leveldb/
BSD 3-Clause "New" or "Revised" License
408 stars 182 forks source link

Accuracy of bloom2 impacted by compilers #240

Open matthewvon opened 2 years ago

matthewvon commented 2 years ago

I have ported util/bloom2.cc to a different environment. The bloom filter, with 16 bits per key, is supposed to have 0.0004 false positive rate. The filter was showing 0.0117 and 0.0519 within two different compiler environments (Debian 10 and Debian 11 respectively using default gcc packages). The bloom filter is important due to Riak's read before write characteristics.

The problem appears related to these two lines:

https://github.com/basho/leveldb/blob/develop/util/bloom2.cc#L70

https://github.com/basho/leveldb/blob/develop/util/bloom2.cc#L97

The existing code uses a mix of variable declarations: size_t, unsigned, and uint32_t. This appears to cause portions of the code to be 64 bit and portions 32 bit. Forcing all variables around these two lines to consistent uint32_t declaration was the cure.

However, just patching the current code does not address how to "fix" bad bloom filters in existing data sets. The corrected bloom filter code has the potential to make data written with the old code to seemingly disappear. As is, there might be a scenario where data also disappears due to distribution of new Riak releases that use a different compiler than before. I have not proven this, just guessing.

The solution might be to create a corrected bloom3.cc. Then the existing leveldb code can automatically pick the bloom filter that matches existing data files and use the corrected bloom3.cc bloom filters for new data and data newly passing through a normal compaction.

martinsumner commented 1 year ago

Sorry @matthewvon , I didn't see this issue when you raised it. Thank you for highlighting the issue here.

In terms of contributors who are left with rights to basho/leveldb and basho/eleveldb, we have none with the c++ skills necessary to maintain this, or the test setup to prove non-functional changes. So we're making only the minimal changes we have to - and most of the main customers are migrating to leveled or bitcask, to protect themselves from future issues.

There are other (i.e. non-Riak) users of leveledb/eleveldb out there though, and it seems a shame that this should wither away due to lack of skills in our community. Some of the other generally useful tools in the basho box have been freed from the confines of the basho repo (e.g. lager, webmachine). Perhaps if a volunteer was to come forward to maintain we should do the same with basho leveldb/eleveldb.

Sorry for the late and unsatisfactory answer. I will leave the issue open, as it is clearly an important problem, even if we have no plans to resolve it.