BytesRefHash can be simplified (and made faster) [LUCENE-5854]

asfimport commented 10 years ago

Currently BytesRefHash stores the length of each byte sequence as either one or two bytes inside the byte pool. This is redundant (slows down add operation and increases the required memory).

Logically, what BytesRefHash does is assign linear IDs (0..n) to each unique byte sequence on input. So what's really needed are two data structures:

a compact list of byte sequences (indexed 0...n),
a hash table lookup of hash(byteSequence) -> ID.

The first item is already implemented efficiently in BytesRefArray. Note that the length of each byte sequence is implicitly stored as the difference in starting offsets between the next sequence's start offset and the current sequence (clever!). This doesn't allow for removals, but saves on encoding and representation.

The second bullet point above is trivial (linear hash table of IDs or -1 indicating empty slots).

I have a clear-room implementation of the above (based on HPPC data structures though) and it does show some performance improvement (on simplistic randomized data benchmarks).

BenchmarkByteBlockOrdSet.testBytesRefHash: [measured 5 out of 7 rounds, threads: 1 (sequential)]
 round: 2.62 [+- 0.03], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 5, GC.time: 0.20, time.total: 18.43, time.warmup: 5.35, time.bench: 13.08
BenchmarkByteBlockOrdSet.testByteBlockOrdSet: [measured 5 out of 7 rounds, threads: 1 (sequential)]
 round: 1.91 [+- 0.01], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 8, GC.time: 0.20, time.total: 13.37, time.warmup: 3.83, time.bench: 9.54

But I think the reason it'd be worth looking at this in Lucene is making BytesRefHash simpler to understand. For example, put operation in my code looks like this:

  public int put(ByteBlockRef ref) {
    assert ref != null; 

    int slot = hash(ref) & mask;
    for (int id = ordByHash[slot];
             id != EMPTY_SLOT;
             id = ordByHash[slot]) {
      if (blocks.compareTo(id, ref) == 0) {
        return id;
      }
      slot = (slot + 1) & mask;
    }

    int newId = ordByHash[slot] = blocks.size();
    blocks.add(ref);

    if (size() == resizeAt) {
      grow();
    }

    return -newId - 1;
  }

and get is simply delegation to the list of byte sequences:

  public ByteBlockRef get(int index, ByteBlockRef ref) {
    return blocks.get(index, ref);
  }

What makes this refactoring slightly more complicated is that there is a fair bit of hairy stuff in BytesRefHash that the person doing the refactoring would have to look at. The craziest parts are, to me:

  private void rehash(final int newSize, boolean hashOnData)
-> when hashOnData is true this class becomes insane :)

  shrinking is not really used anyway; neither are element removals (and they're not really needed I think).

  compacting and in-memory sorting should be moved outside of this class and placed somewhere else. You could still grab the internal storage buffers, but do the compacting/ in-memory sorting logic somewhere else and thus not leave the BytesRefHash object in an invalid/ crazy state.

Migrated from LUCENE-5854 by Dawid Weiss (@dweiss), updated May 09 2016

asfimport commented 10 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I leave this unassigned if somebody has the time to dig into this. I only needed a fast ord lookup set without Lucene dependency and I note the conclusions from my experiment/ reimplementation. :)

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Wow, this would be a nice speedup :)

apache / lucene

BytesRefHash can be simplified (and made faster) [LUCENE-5854] #6916