Open asfimport opened 10 years ago
Dawid Weiss (@dweiss) (migrated from JIRA)
I leave this unassigned if somebody has the time to dig into this. I only needed a fast ord lookup set without Lucene dependency and I note the conclusions from my experiment/ reimplementation. :)
Michael McCandless (@mikemccand) (migrated from JIRA)
Wow, this would be a nice speedup :)
Currently BytesRefHash stores the length of each byte sequence as either one or two bytes inside the byte pool. This is redundant (slows down add operation and increases the required memory).
Logically, what BytesRefHash does is assign linear IDs (0..n) to each unique byte sequence on input. So what's really needed are two data structures:
The first item is already implemented efficiently in BytesRefArray. Note that the length of each byte sequence is implicitly stored as the difference in starting offsets between the next sequence's start offset and the current sequence (clever!). This doesn't allow for removals, but saves on encoding and representation.
The second bullet point above is trivial (linear hash table of IDs or -1 indicating empty slots).
I have a clear-room implementation of the above (based on HPPC data structures though) and it does show some performance improvement (on simplistic randomized data benchmarks).
But I think the reason it'd be worth looking at this in Lucene is making BytesRefHash simpler to understand. For example, put operation in my code looks like this:
and get is simply delegation to the list of byte sequences:
What makes this refactoring slightly more complicated is that there is a fair bit of hairy stuff in BytesRefHash that the person doing the refactoring would have to look at. The craziest parts are, to me:
Migrated from LUCENE-5854 by Dawid Weiss (@dweiss), updated May 09 2016