Lendfating / leveldb

Automatically exported from code.google.com/p/leveldb
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

The defective binary search code in Block::Iter::Seek() in block.cc #103

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
The following code is from Seek() function in block.cc. As the comments says, 
it performs a stl::lower_bound operation in the restart array. However, I 
believe the implementation is flawed. 

    // Binary search in restart array to find the first restart point
    // with a key >= target
    uint32_t left = 0;
    uint32_t right = num_restarts_ - 1;
    while (left < right) {
      uint32_t mid = (left + right + 1) / 2;

      ......

      Slice mid_key(key_ptr, non_shared);
      if (Compare(mid_key, target) < 0) {
        // Key at "mid" is smaller than "target".  Therefore all
        // blocks before "mid" are uninteresting.
        left = mid;
      } else {
        // Key at "mid" is >= "target".  Therefore all blocks at or
        // after "mid" are uninteresting.
        right = mid - 1;
      }
    }

In some cases, it may find a previous restart point instead of the correct one. 
Although the final result is correct, it involves unnecessary iteration and 
confusing. I think the following code is better.

    // Binary search in restart array to find the first restart point
    // with a key >= target
    uint32_t left = 0;
    uint32_t right = num_restarts_;
    while (left < right) {
      uint32_t mid = (left + right) / 2;

      ......

      Slice mid_key(key_ptr, non_shared);
      if (Compare(mid_key, target) < 0) {
        // Key at "mid" is smaller than "target".  Therefore all
        // blocks before "mid" are uninteresting.
        left = mid + 1;
      } else {
        // Key at "mid" is >= "target".  Therefore all blocks at or
        // after "mid" are uninteresting.
        right = mid;
      }
    }

Original issue reported on code.google.com by yangzhuo...@gmail.com on 15 Jul 2012 at 1:36

GoogleCodeExporter commented 9 years ago
This code needs to implement something other than
std::lower_bound.  I suspect there are misleading
comments in the code that are leading you astray.

In particular, the block stores multiple keys
between each restart point.  And what the binary
search needs to do is find the largest restart
point with a key < target so we can start
decoding from there and find any occurrence of
target that might occur between restart points.

Consider the following example with target==250.

  restart point 0:  key == 100
                    ... some keys in range [100..199] here ...
  restart point 1:  key == 200
                    ... some keys in range [200..299] here ...
  restart point 2:  key == 300
                    ... some keys in range [300..inf] here ...

In this example, we would want the binary search
to find restart point 1 (key == 200) and then
sequentially decode from there.

Now consider your proposed code.
First iteration:
  left = 0
  right = 3 
  mid = (left + right) / 2 == 1
  Compare(mid_key, target) == Compare(200, 250)
  The comparison returns < 0, so we do:
     left = mid + 1 = 2
We have skipped past the correct result of binary
search (restart point 1).  So we will end up starting
the decoding process too late in the block.

Sorry about the misleading comment.  I will change
the following comment:
    // Binary search in restart array to find the first restart point
    // with a key >= target
to:
    // Binary search in restart array to find the last restart point
    // with a key < target

Original comment by san...@google.com on 16 Jul 2012 at 5:04