apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.72k stars 1.04k forks source link

Add unsigned packed int impls in oal.util [LUCENE-1990] #3065

Closed asfimport closed 14 years ago

asfimport commented 15 years ago

There are various places in Lucene that could take advantage of an efficient packed unsigned int/long impl. EG the terms dict index in the standard codec in #2532 could subsantially reduce it's RAM usage. FieldCache.StringIndex could as well. And I think "load into RAM" codecs like the one in TestExternalCodecs could use this too.

I'm picturing something very basic like:

interface PackedUnsignedLongs  {
  long get(long index);
  void set(long index, long value);
}

Plus maybe an iterator for getting and maybe also for setting. If it helps, most of the usages of this inside Lucene will be "write once" so eg the set could make that an assumption/requirement.

And a factory somewhere:

  PackedUnsignedLongs create(int count, long maxValue);

I think we should simply autogen the code (we can start from the autogen code in LUCENE-1410), or, if there is an good existing impl that has a compatible license that'd be great.

I don't have time near-term to do this... so if anyone has the itch, please jump!


Migrated from LUCENE-1990 by Michael McCandless (@mikemccand), resolved Apr 06 2010 Attachments: generated_performance-te20100226.txt, LUCENE-1990_PerformanceMeasurements20100104.zip, LUCENE-1990.patch (versions: 3), LUCENE-1990-te20100122.patch, LUCENE-1990-te20100210.patch, LUCENE-1990-te20100212.patch, LUCENE-1990-te20100223.patch, LUCENE-1990-te20100226.patch, LUCENE-1990-te20100226b.patch, LUCENE-1990-te20100226c.patch, LUCENE-1990-te20100301.patch, perf-mkm-20100227.txt, performance-20100301.txt, performance-te20100226.txt Linked issues:

asfimport commented 15 years ago

Fuad Efendi (migrated from JIRA)

Specifically for FieldCache, let's see... suppose Field may have 8 different values, and number of documents is high.

Value0  0  1  0  0  0  0  0   0  1  0  0  0  0  0 ...  
Value1  1  0  1  0  0  0  0   0  0  0  0  0  0  0 ...  
Value2  0  0  0  1  1  0  0   0  0  0  0  0  0  0 ...  
Value3  0  0  0  0  0  0  0   0  0  0  0  1  0  0 ...  
Value4  0  0  0  0  0  0  1   0  0  0  0  0  0  0 ...  
Value5  0  0  0  0  0  1  0   0  0  0  1  0  1  0 ...  
Value6  0  0  0  0  0  0  0   1  0  1  0  0  0  0 ...  
Value7  0  0  0  0  0  0  0   0  0  0  0  0  0  1 ...  

And now, if we go "horizontally" we will end up with 8 arrays of int[]. What if we go "vertically"? Field could be encoded as 3-bit (8 different values).

CONSTRAINT: specifically for FieldCache, each Column must have the only "1".

And we can end with array of 3-bit values storing position in a column! Size of array is IndexReader.maxDoc().

hope I am reinventing bycycle :)

P.S. Of course each solution has pros and cons, I am trying to focus on FieldCache specific use cases.

  1. For a given document ID, find a value for a field
  2. For a given query results, sort it by a field values
  3. For a given query results, count "facet" for each field value

I don't think such naive compression is slower than abstract int[] arrays... and we need to change public API of field cache too: if method returns int[] we are not saving any RAM.

Better is to compare with SOLR use cases and to make API closer to real requirements; SOLR operates with some bitsets instead of arrays...

asfimport commented 15 years ago

Earwin Burrfoot (migrated from JIRA)

hope I am reinventing bycycle

I believe some databases do that.

asfimport commented 15 years ago

Fuad Efendi (migrated from JIRA)

Suttiwat sent me a link: http://blog.juma.me.uk/2008/10/14/32-bit-or-64-bit-jvm-how-about-a-hybrid/

This is vendor-specific, and possibly may cause unexpected problems, but we can try in some specific cases: "Compressed Oops have been included (but disabled by default) in the performance release JDK6u6p (requires you to fill a survey), so I decided to try it in an internal application and compare it with 64-bit mode and 32-bit mode."

-XX:+UseCompressedOops

There are other vendors around too such as Oracle JRockit which is much faster server-side...

asfimport commented 14 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

I have very recently done some experiments with random-access packed positive integer arrays (we could also call them unsigned). The basic approach is to put the bits after each other in an int- or long-array, then extract the value by shifting and masking. Unfortunately I have not been able to determine a single winning strategy for space/speed trade-offs.

I've tried four simple implementations:

  1. Pack the bits right after each other.
  2. Pack the bits right after each other but allow only 1, 2, 4, 8, 16 or 32 as value-bits.
  3. Wrap an int[] or long[] and store the values directly (for comparison with 1 & 2).
  4. Use int[] directly without any wrapping.

(Code can be found at http://summa.svn.sourceforge.net/viewvc/summa/trunk/Common/src/dk/statsbiblioteket/summa/common/util/bits/ where #1 is BitsArrayPacked, #2 is BitsArrayAligned and #3 is BitsArrayInt)

The obvious benefit from #2 is that the bits for values are always contained in a single block (int or long), which reduces the amount of operations for set and get considerably. Unfortunately this means that as soon as a value greater than 65535 needs to be stored, the internal representation will require 32 bits/value. This means no space-benefit compared to #3 while the speed penalty remains. A hybrid approach might be considered where the implementation is determined by the number of bits needed to store a value.

I've done some performance tests of the implementations on an aging 1.8GHz single-core laptop (Dell 820). The code can be found at http://summa.svn.sourceforge.net/viewvc/summa/trunk/Common/test/dk/statsbiblioteket/summa/common/util/bits/BitsArrayPerformanceTest.java?revision=2038&view=markup

In the tables below, the measurements for Packed, Aligned, Int, Constant and int[] are the total time in milliseconds to perform actionCount actions (either read or write). Random numbers (using the same seed) were used for index as well as value and the overhead of constructing the random numbers were subtracted from the measurements. "Constant" is a dummy where no values are stored and the same value is always returned on a get. It is used to measure method-call overhead.

For arrays of 10M values, 6-33 million values can be read or written per second. If this is within acceptable limits, I'd be happy to try and make a contribution to Lucene based on the code. However, I probably won't find the time before February or March 2010.

actionCount arrayLength  actionType    valueMax  Packed(#1) Aligned(#2)     Int(#3)    Constant   int[](#4)
   10000000     1000000       write           1         583         416         984         254         635
   10000000     1000000       write          15         594         499        1172         286         843
   10000000     1000000       write         255         604         478        1057         213         656
   10000000     1000000       write         256         734         765        1062         109         729
   10000000     1000000       write       65535        1036         802        1124         417         734
   10000000     1000000       write       65536        1015        1130        1072         172         781
   10000000     1000000       write     2097151        1020        1187        1052         223         614
   10000000     1000000       write  2147483646        1136        1073         839          73         719
   10000000     1000000        read           1         286         203         786         104           0
   10000000     1000000        read          15         291         182         859          78           0
   10000000     1000000        read         255         672         494         989          67           0
   10000000     1000000        read         256         568         729        1104          93           0
   10000000     1000000        read       65535         833         755        1088          99           0
   10000000     1000000        read       65536         947         974        1062         104           0
   10000000     1000000        read     2097151         999         963        1062          88           0
   10000000     1000000        read  2147483646        1349         869        1260          84           0

   10000000    10000000       write           1         833         568        1458         239        1229
   10000000    10000000       write          15        1427        1255        1432         276        1615
   10000000    10000000       write         255        1599        1578        1448         250        1244
   10000000    10000000       write         256        1656        1520        1317         109        1109
   10000000    10000000       write       65535        1734        1630        1385         245        1307
   10000000    10000000       write       65536        1718        1640        1395         182        1208
   10000000    10000000       write     2097151        1807        1781        1447         250        1291
   10000000    10000000       write  2147483646        1718        1599        1281          73        1099
   10000000    10000000        read           1         562         296        1301         109           0
   10000000    10000000        read          15        1187        1198        1322          83           0
   10000000    10000000        read         255        1421        1432        1526          99           0
   10000000    10000000        read         256        1521        1583        1588         104           0
   10000000    10000000        read       65535        1578        1427        1374          31           0
   10000000    10000000        read       65536        1634        1499        1489         113           0
   10000000    10000000        read     2097151        1625        1374        1468         224           0
   10000000    10000000        read  2147483646        1609        1349        1499          83           0
asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

These are great results – thanks Toke!

What's the difference between the two "sections" of your results? The bottom section seems to have slower times overall than the top one.

It's curious that "aligned" isn't always a win.

For the packed case, did you generate specialized code for each of the cases? I wonder how much / if that'd really help in practice.

For the standard codec's terms dict, my guess is we'd just go with packed ints.

asfimport commented 14 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

The first section if for 1M values in the structure, the second is for 10M. As the CPU on the test-machine (Intel T2400) has only 2MB of level 2 cache, the increased processing time for the seemingly same amount of work is an effect of more cache-misses.

Caching also accounts for why the packed version is sometimes better than the aligned. For values representable as 9 or 17 bits, the aligned version needs 16 and 32 bits respectively. In the case with 10M values, the packed version uses 1.1MB and 2.1MB for 9 and 17 bits respectively, while the aligned uses 2MB and 4MB respectively. The simpler logic of the aligned version does not compensate enough for the higher amount of trips around main memory.

I did not generate any specialized code for the aligned case: No matter the number of bits/value, the amount of shifts, masks and ors is always the same. If the number of bits/value is known beforehand, specialized cases should be used (I made a factory that selects between packed, aligned and direct (#3), depending on the number of bits/value). The reason for not doing so in the first place is that I wanted to let the structure auto-adjust the bits/value when a new value was added. Having different implementations encapsulated in the same class means another level of indirection or conditionals, both of which I wanted to avoid for performance reasons. That being said, I haven't tested how much of a penalty this would be.

The standard use case seems to be some sort of update-round, after which no updates are performed. Having a cleanupAndOptimize-call that potentially creates a new and optimized structure, would fit well into this and would avoid the indirection / conditional penalty.

A whole other matter is long vs. ints. I've tried using longs instead of ints as the backing array and the penalty on my 32bit processor was very high (I need to make some tests on this). If it must be possible to set and get longs, it's hard to avoid using long[] as the internal structure, but if ints are accepted as the only valid values, selecting long[] as backing array for 64 bit machines and int[] for 32 bit, might be the solution.

All this calls for a factory-approach to hide the fairly complex task of choosing the right implementation.

asfimport commented 14 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

I made some small tweaks to improve performance and added long[]-backed versions of Packed (optimal space) and Aligned (no values span underlying blocks), the ran the performance tests on 5 different computers. It seems very clear that level 2 cache (and presumably RAM-speed, but I do not know how to determine that without root-access on a Linux box) plays a bigger role for access speed than mere CPU speed. One 3GHz with 1MB of level 2 cache was about half as fast than a 1.8GHz laptop with 2MB of level 2 cache.

There is a whole lot of measurements and it is getting hard to digest. I've attached logs from the 5 computers, should anyone want to have a look. Some observations are:

  1. The penalty of using long[] instead of int[] on my 32 bit laptop depends on the number of values in the array. For less than a million values, it is severe: The long[]-version if 30-60% slower, depending on whether packed or aligned values are used. Above that, it was 10% slower for Aligned, 25% slower for Packed. On the other hand, 64 bit machines dos not seem to care that much whether int[] or long[] is used: There was 10% win for arrays below 1M for one machine, 50% for arrays below 100K for another (8% for 1M, 6% for 10M) for another and a small loss of below 1% for all lenghts above 10K for a third.

  2. There's a fast drop-off in speed when the array reaches a certain size that is correlated to level 2 cache size. After that, the speed does not decrease much when the array grows. This also affects direct writes to an int[] and has the interesting implication that a packed array out-performs the direct access approach for writes in a number of cases. For reads, there's no contest: Direct access to int[] is blazingly fast.

  3. The access-speed of the different implementations converges when the number of values in the array rises (think 10M+ values): The slow round-trip to main memory dwarfs the logic used for value-extraction.

Observation #3 supports Mike McCandless choice of going for the packed approach and #1 suggests using int[] as the internal structure for now. Using int[] as internal structure makes if unfeasible to accept longs as input (or rather: longs with more than 32 significant bits). I don't know if this is acceptable?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The first section if for 1M values in the structure, the second is for 10M. As the CPU on the test-machine (Intel T2400) has only 2MB of level 2 cache, the increased processing time for the seemingly same amount of work is an effect of more cache-misses.

Got it.

Caching also accounts for why the packed version is sometimes better than the aligned. For values representable as 9 or 17 bits, the aligned version needs 16 and 32 bits respectively. In the case with 10M values, the packed version uses 1.1MB and 2.1MB for 9 and 17 bits respectively, while the aligned uses 2MB and 4MB respectively. The simpler logic of the aligned version does not compensate enough for the higher amount of trips around main memory.

OK, though, I think we should somehow exclude these CPU cache effects from the test. In a real production situation, lots of other things are competing for that cache, so I think the mostly-cache-miss case is most realistic here.

Actually a good way to do that is to test it in a wider context, eg once I can get #3262 "integrated", we can test sorting by int field with these different ways of encoding.

I did not generate any specialized code for the aligned case: No matter the number of bits/value, the amount of shifts, masks and ors is always the same. If the number of bits/value is known beforehand, specialized cases should be used (I made a factory that selects between packed, aligned and direct (#3), depending on the number of bits/value).

OK I have a start at this under #3262. I'll try to clean up & post here...

The reason for not doing so in the first place is that I wanted to let the structure auto-adjust the bits/value when a new value was added. Having different implementations encapsulated in the same class means another level of indirection or conditionals, both of which I wanted to avoid for performance reasons. That being said, I haven't tested how much of a penalty this would be.

The standard use case seems to be some sort of update-round, after which no updates are performed. Having a cleanupAndOptimize-call that potentially creates a new and optimized structure, would fit well into this and would avoid the indirection / conditional penalty.

I think the writing API can be a write-once API, where I declare the maxValue I will write. This fits with the indexing chain, where we buffer values in RAM and then flush to the segmetn files.

This way we don't need the complexity of having to resize the bits internally when a new max value is seen.

So, the factory would take number of values we will write, the max value, and I guess an enum for Packed/Aligned/DedicatedArray, and return a simple writer that just exposes an "add(long value)" method?

A whole other matter is long vs. ints. I've tried using longs instead of ints as the backing array and the penalty on my 32bit processor was very high (I need to make some tests on this). If it must be possible to set and get longs, it's hard to avoid using long[] as the internal structure, but if ints are accepted as the only valid values, selecting long[] as backing array for 64 bit machines and int[] for 32 bit, might be the solution.

Hmm... I guess we could still support values requiring more than 32 bits, encoded into int[], but it's more hairy as the value could span 2 or 3 ints.

Probably we should specialize/default the factory selection based on whether the JVM is 32/64 bit? And, whether the bit size is <= 32.

All this calls for a factory-approach to hide the fairly complex task of choosing the right implementation.

Yes.

I made some small tweaks to improve performance and added long[]-backed versions of Packed (optimal space) and Aligned (no values span underlying blocks), the ran the performance tests on 5 different computers. It seems very clear that level 2 cache (and presumably RAM-speed, but I do not know how to determine that without root-access on a Linux box) plays a bigger role for access speed than mere CPU speed. One 3GHz with 1MB of level 2 cache was about half as fast than a 1.8GHz laptop with 2MB of level 2 cache. There is a whole lot of measurements and it is getting hard to digest. I've attached logs from the 5 computers, should anyone want to have a look. Some observations are:

1. The penalty of using long[] instead of int[] on my 32 bit laptop depends on the number of values in the array. For less than a million values, it is severe: The long[]-version if 30-60% slower, depending on whether packed or aligned values are used. Above that, it was 10% slower for Aligned, 25% slower for Packed. On the other hand, 64 bit machines dos not seem to care that much whether int[] or long[] is used: There was 10% win for arrays below 1M for one machine, 50% for arrays below 100K for another (8% for 1M, 6% for 10M) for another and a small loss of below 1% for all lenghts above 10K for a third.

2. There's a fast drop-off in speed when the array reaches a certain size that is correlated to level 2 cache size. After that, the speed does not decrease much when the array grows. This also affects direct writes to an int[] and has the interesting implication that a packed array out-performs the direct access approach for writes in a number of cases. For reads, there's no contest: Direct access to int[] is blazingly fast.

3. The access-speed of the different implementations converges when the number of values in the array rises (think 10M+ values): The slow round-trip to main memory dwarfs the logic used for value-extraction.

Observation #3 supports Mike McCandless choice of going for the packed approach and #1 suggests using int[] as the internal structure for now. Using int[] as internal structure makes if unfeasible to accept longs as input (or rather: longs with more than 32 significant bits). I don't know if this is acceptable?

I think we should agree on an API that #3262 can use to write/read packed ints, then get our two patches talking. I'll pull out the barebones packed ints I'm currently using, and post here... then let's merge/iterate to a common API, so I can cutover to the patch from here in #3262?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

How about something like this API, for writing packed ints:

abstract class Writer {
  public abstract void add(long v) throws IOException;
  public abstract void finish() throws IOException;
}

then a factory:

enum Mode {Packed, Aligned, FixedArray};

public static Writer getWriter(IndexOutput out, int valueCount, long maxValue, Mode mode);

(we can iterate on the names... always the hardest part).

Packed means full bit packing (most space efficient, but slowest decode time), Aligned might waste some bits (eg for nbits=4, that's naturally aligned, but for nbits=7, we'd waste 1 bit per long, FixedArray (which'd use byte[], short[], int[], long[]) would potentially waste the most bits but have the fastest decode.

If nbits happens to be 8, 16, 32, 64, the factory should just always FixedArray I think? And of course powers of two will automatically be Aligned (with the per-nbits specialized code).

Wew can also default impls to underlying int[] vs long[] backing store depending on 54/32 bit jre, and, nbits. If jre is 32 bit but nbits is > 32 bit I think we just use long[] backing.

For reading, a similar API:

abstract class Reader {
  public abstract long get(index);
}

public static Reader getReader(IndexInput in);
asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached patch with my current roughed up approach for packed ints (from LUCENE-2186).

Let's try to standardize the API, then merge the two approaches, then I'll cutover with #3262.

It includes gen.py, which autogens dedicated decoders for each of the nbits cases, excluding 8, 16, 32, 64 bits, since these are done with dedicated array reader impls.

It uses a single writer (I don't think we need specialized writers), but the writer encodes in the same byte order as IndexOutput.writeLong, so that the byte order matches the dedicated array reader impls.

It only encodes into long[] – we should create cases for int[] (selected by the factory depending on 32 vs 64 bit jre).

We should also explore just reading in a full byte[] and using Int/Short/Long buffer to decode. This API should also allow for a future mmap impl as well.

Probably we should name all of these UnsignedPackedInts, since they require values >= 0. (Hmm, though, the 64 bit case is tricky – I guess we make an exception for that case).

asfimport commented 14 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

Introducing yet another level of indirection and making a byte/short/int/long-prvider detached from the implementation of the packed values it tempting. I'm fairly afraid of the overhead of the extra method-calls, but I'll try it and see what happens.

I've read your (Michael McCandless) code an I can see that the tiny interfaces for Reader and Writer works well for your scenario. However, as the Reader must have (fast) random access, wouldn't it make sense to make it possible to update values? That way the same code can be used to hold ords for sorting and similar structures.

Instead of Reader, we could use

abstract class Mutator {
  public abstract long get(int index);
  public abstract long set(int index, long value);
}

...should the index also be a long? No need to be bound by Java's 31-bit limit on array-length, although I might very well be over-engineering here.

The whole 32bit vs. 64bit as backing array does present a bit of a problem with persistence. We'll be in a situation where the index will be optimized for the architecture used for building, not the one used for searching. Leaving the option of a future mmap open means that it is not possible to do a conversion when retrieving the bits, so I have no solution for this (other than doing memory-only).

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

bq.Introducing yet another level of indirection and making a byte/short/int/long-prvider detached from the implementation of the packed values it tempting.

You mean the layer that stores the minValue, so that the full range is supported? I actually think we should absorb that into packed ints, so it's only one method call per lookup, and specialize the "positive only" cases to avoid the extra add per lookup.

With that fix, it's still a method call per lookup, but I don't see how we can get away from that, unless we allow for exposure of the raw array for the no-packing cases (which we could consider...).

Remember we use packed ints in places where we can accept some loss of CPU perf. for improvements in RAM usage (see the comment I just added to LUCENE-2186).

However, as the Reader must have (fast) random access, wouldn't it make sense to make it possible to update values?

Yeah, we do eventually want CSF to be updateable, but I don't think we need this for phase 1? Likewise, I think all we need now for Lucene is a "WriteOnceWriter", not a "RandomAccessWriter". Ie, you open a writer, you add (sequentially) all values, you close.

...should the index also be a long?

I would stick with int now (we are doing this for Lucene, whose docIDs are still ints...). Design for today.

The whole 32bit vs. 64bit as backing array does present a bit of a problem with persistence. We'll be in a situation where the index will be optimized for the architecture used for building, not the one used for searching. Leaving the option of a future mmap open means that it is not possible to do a conversion when retrieving the bits, so I have no solution for this (other than doing memory-only).

I'm confused – a future mmap impl shouldn't put pressure on the file format used by packed ints today? Ie, a future mmap impl can use a totally different format than the designed-to-be-slurped-into-RAM format for packed ints, today?

Also, what do you mean by optimized for building not searching?

Note that on 32 bit machines, if there is actually a gain, we can make a backing store with ints yet still allow for storage of nbits>32? It "just" means a value may be split across 2 or 3 values?

asfimport commented 14 years ago

Paul Elschot (migrated from JIRA)

I've made a remark at #2484 (a first attempt at a PFOR implementation) about the header structure for encoding this. One thing that is not covered here is how to deal with input arrays with intermediate length that are shorter than 32 and longer than 3 or 4. Shorter ones can easily be encoded as vByte. Simple9 might be a solution, but it has only 28 data bits and 9 different encoding cases so it appears to be somewhat small. There is first attempt at Simple9 at #3265.

Since the discussion here is on alignment (int/long) I'm wondering how (and whether) to go from the current byte aligned structures to int aligned. Using aligned ints would save the shifting done at IndexInput.getInt() that reads 4 bytes and shifts them into place to create an int from them. Simple9 can be int aligned and I'd like to add bigger variations of that, but peferably only ones that add a multiple of 4 bytes.

So would make sense to add functionality to IndexInput and IndexOutput to allow int aligned access? Are java's data streams and/or nio buffers smart enough to avoid the byte shifting for ints in such cases?

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Using aligned ints would save the shifting done at IndexInput.getInt() that reads 4 bytes and shifts them into place to create an int from them.

How's that? Is there some JVM intrinsic?

asfimport commented 14 years ago

Paul Elschot (migrated from JIRA)

For the record: on the flex branch I just saw IntIndexInput and IntIndexOutput.

asfimport commented 14 years ago

Paul Elschot (migrated from JIRA)

Using aligned ints would save the shifting done at IndexInput.getInt() that reads 4 bytes and shifts them into place to create an int from them.

How's that? Is there some JVM intrinsic?

In the aligned case, and with the right byte order, getInt() on a java data stream might be reduced to processor instructions operating on 4 bytes at a time.

Is that what you mean by JVM intrinsic?

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

In the aligned case, and with the right byte order, getInt() on a java data stream might be reduced to processor instructions operating on 4 bytes at a time. Is that what you mean by JVM intrinsic?

Yes. But are you aware of that having been implemented in any JVMs?

asfimport commented 14 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

Toke: Introducing yet another level of indirection and making a byte/short/int/long-prvider detached from the implementation of the packed values it tempting.

You mean the layer that stores the minValue, so that the full range is supported? I actually think we should absorb that into packed ints, so it's only one method call per lookup, and specialize the "positive only" cases to avoid the extra add per lookup.

No, I mean the backing array of ints or longs. For memory, the obvious choice is int[] or long[], but designing for flexibility calls for an API, which coincidentally is identical to Reader. So, a 4-bit Aligned could be backed by a DirectInt (which will contain an int[]) or a Persistent (doing mmap or some other persistent-oriented lookup).

...I should show this in code instead. I'll try and find the time.

Yeah, we do eventually want CSF to be updateable, but I don't think we need this for phase 1?

Not in the specific scenario, no.

Toke: The whole 32bit vs. 64bit as backing array does present a bit of a problem with persistence. We'll be in a situation where the index will be optimized for the architecture used for building, not the one used for searching. Leaving the option of a future mmap open means that it is not possible to do a conversion when retrieving the bits, so I have no solution for this (other than doing memory-only).

I'm confused - a future mmap impl shouldn't put pressure on the file format used by packed ints today? Ie, a future mmap impl can use a totally different format than the designed-to-be-slurped-into-RAM format for packed ints, today?

Also, what do you mean by optimized for building not searching?

When we're using Aligned, the choice of using int or long for the backing array dictates how the persistent bitstream will be. If the index builder is a 64 bit machine and 7 bits/value is used, the result will be longs, each containing 9 values.

When the structure is read into memory by the searcher, the backing array will again be long. But if the searcher happens to be a 32 bit machine, the performance will be less than if ints were used for the backing array.

One way to handle this is to do a conversion, when loading into memory. If the searcher is 64 bit, it will always convert into longs. If the searcher is 32 bit, it will convert into int, if the bits/value is <= 32. The conversion is cheap, so this is no problem in itself.

However, if we're planning for the future (or for flexibility, depending on point of view), we would very much like the persistent format to be directly usable, so that we don't need to load the whole structure into memory. This rules out conversion and sets us back to step 1: The index will be optimized for either 32bit or 64 bit searchers.

Oh well, we could always just ignore it and say that Aligned is 64 bit based. As it is more memory-efficient than Aligned on 32 bit machines, maybe the slightly smaller number of backing longs will compensate for the overhead of retrieving longs on a 64 bit machine.

{quote} Note that on 32 bit machines, if there is actually a gain, we can make a backing store with ints yet still allow for storage of nbits>32? It "just" means a value may be split across 2 or 3 values? {quote}

My guess is that the number of values vs. the available level 2 cache will play a big role here: For a relatively small number of values, the added logic will be costly. For a larger number of values, the cache-misses will dwarf that.

asfimport commented 14 years ago

Paul Elschot (migrated from JIRA)

Paul: In the aligned case, and with the right byte order, getInt() on a java data stream might be reduced to processor instructions operating on 4 bytes at a time. Is that what you mean by JVM intrinsic?

Yonik: Yes. But are you aware of that having been implemented in any JVMs?

I remember reading about an implementation doing that, but I can't find it back now.

If one cannot have such ints directly, there is not much point in unpacking via IndexInput.getInt(), it would be better to unpack directly from the bytes. Also, in that case, I'd prefer to use single byte increments (above 4 byte increments) for the size of the encoded data for variations on Simple9.

asfimport commented 14 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

Looking at bit patterns and persistence, I see 3 different ones: Packed, aligned32 and aligned64. Regardless of whether 32bit or 64bit is used when a packed structure is created, it can be read as both 32bit and 64bit packed. As for the special cases of 8, 16, 32 and 64 bits/value, the bit patterns are identically to both packed and aligned. This leeds me to propose a header designating one of the three structures mentioned.

The current draft from Michael McCandless states both bitsPerValue and maxValue in the persistent format. It seems a redundant to have both, but I might be missing something here? Either way, the bitsPerValue is ambiguous as it does not translate to memory usage the same way for packed, aligned32 or aligned64. Should I choose, I'd go for maxValue.

What about a header stating

format (String "packed", "aligned32" or "aligned64")
valueCount (vInt)
maxValue (vLong)

?

I have working code for packed32 and packed64 and am currently fitting it into Michael's patch. I hope to finish it this weekend.

asfimport commented 14 years ago

Paul Elschot (migrated from JIRA)

How about encoding the header something like this:

VByte 1 bit. Simple9 4 bits. These cases imply valueCount and maxValue. For around 4-16 numbers encode the complete header in 5-6 bits, also implying valueCount and maxValue. For FrameOfRef, encoding 32 or more numbers, a larger header can be used, 4 bytes for example, maxValue is implied from the number of frame bits. valueCount could be 32, 64, 128 (i.e. 2 bits). Also the number of exceptions could be put there.

This header "type" can be chosen depending on the given length of the encoded sequence.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I have working code for packed32 and packed64 and am currently fitting it into Michael's patch. I hope to finish it this weekend.

Nice! Sounds like good progress Toke!

The current draft from Michael McCandless states both bitsPerValue and maxValue in the persistent format

I was only storing maxValue as a convenience for the layer above – we don't need to do that – I think storing format (packed, aligned32, aligned64) and bitsPerValue makes sense.

Regardless of whether 32bit or 64bit is used when a packed structure is created, it can be read as both 32bit and 64bit packed.

Right, but with the challenge (if we use 32bit backing array) of properly handling the nbits>32 case (this is perfectly doable... "it's just software" ;) ).

As for the special cases of 8, 16, 32 and 64 bits/value, the bit patterns are identically to both packed and aligned.

I had chosen to match IndexOutput/Inputs's byte order (big-endian) so that the packed format naturally reads back with IndexInput's readLong/Int/Short (I added a readShort).

I'm assuming for these special cases that dedicated Reader impls, with byte[], short[], int[], long[] backing array, is faster than eg backing with a long[] and shift/masking per lookup.

But eg for the nbits=3 case, aligned 32/64 would ensure that no value spans across two underlying entries in the backing array (wasting some bits of storage in exchange). Whereas the nbits=2 or 4 cases would naturally be aligned anyway...

One question: the Reader api is now this:

long get(int index);

Which is convenient since obviously long can accommodate all of the underlying possible nbits, but... for small nbits values, this logically entails a cast. EG say nbits=8, so it's a direct byte[] backing array. get() must cast up to long, and caller must operate with long... I'm wondering whether that forced casting is going to hurt performance enough to make us want to have dedicated precision (8, 16, 32, 64) Reader interfaces....

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This header "type" can be chosen depending on the given length of the encoded sequence.

I think we should separate the header needed for this issue (single header stored once at the beginning of a potentially massive file), from the per-block header used by int block based formats like SimpleN/PForDelta/etc?

I think Toke's proposed header, if we swap in bitsPerValue in place of maxValue, is good for this issue?

asfimport commented 14 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

I've uploaded a preliminary patch with packed32, packed64, directByte, directShort, directInt and directLong implementations. I've used Michael McCandless patch as foundation, but the new patch is generated to be independent from the old one. It uses maxValue instead of bitsPerValue for the header, there's no test of packed32 and there's a general need for cleanup. The main missing components are aligned32 and aligned64.

I've done quite a bit of refactoring and (cheater that I am) added setters to all implementations of Reader, although not to the interface. Besides the nitty-gritty details of the implementation, I suspect that the code for selecting which implementation to use is a prime candidate for discussion. It is located in PackedInts and tries to select the best implementation based on preference for packed, aligned and direct paired with preference for 32bit and 64bit.

  private static IMPLEMENTATION getImplementation(
          long maxValue, PRIORITY priority, BLOCK_PREFERENCE block) {
    int bits = bitsRequired(maxValue);
    switch (priority) {
      case direct: {
        bits = getNextFixedSize(bits);
        break;
      }
      case aligned: {
        if (block == BLOCK_PREFERENCE.bit32) {
          if (bits == 7 || bits >= 11) {
            bits = getNextFixedSize(bits); // Align to byte, short, int or long
          }
        } else {
          if ((bits >= 13 && bits <= 15) || (bits >= 22)) {
            bits = getNextFixedSize(bits); // Align to short, int or long
          }
        }
      }
    }
    switch (bits) { // The safe choices
      case 8: return IMPLEMENTATION.directByte;
      case 16: return IMPLEMENTATION.directShort;
      case 32: return IMPLEMENTATION.directInt;
      case 63:
      case 64: return IMPLEMENTATION.directLong;
    }

    if (priority == PRIORITY.aligned || bits == 1 || bits == 2 || bits == 4) {
      return block == BLOCK_PREFERENCE.bit32 && bits < 32 ?
              IMPLEMENTATION.aligned32 : IMPLEMENTATION.aligned64;
    }
    return block == BLOCK_PREFERENCE.bit32 && bits < 32 ?
            IMPLEMENTATION.packed32 : IMPLEMENTATION.packed64;

    return IMPLEMENTATION.packed64;

I think that an "auto"-value for priority is worth considering: For 9, 17 and 33 bits/value, packed is often faster than aligned due to only using half the memory and thus having lower risk of level 2 cache misses. For high bits/value, such as 30, 31, 62, 63 and 64 (guesstimating here), choosing direct seems to be the best choice for most situations. Users of PackedInts should not be expected to know this.

I'll start work on aligned32 and aligned64, but I will leave the rest of the patch alone for now, as I suspect that there'll be some changes to the current draft.

asfimport commented 14 years ago

Paul Elschot (migrated from JIRA)

The generated code in the patches has quite a few switch statements to decode a single value. These switch statements could be avoided by using something like this (adapted from the 1410b patch):

/** Decode a value from the compressed array of b bit values by retrieving the corresponding bits.
 * Since numFrameBits is always smaller than the number of bits in an int,
 * at most two ints in the buffer will be used.
 */
public int decodeCompressedValueBase(int compressedPos, int numBits) {
  int compressedBitPos = numBits * compressedPos;
  int intIndex = (compressedBitPos >> 5);
  int firstBitPosition = compressedBitPos & 31;
  int value = intBuffer.get(intIndex) >>> firstBitPosition;
  if ((firstBitPosition + numBits) > 32) { // value does not fit in first int
    intIndex++;
    value |= (intBuffer.get(intIndex) << (32 - firstBitPosition));
  }
  final int maxValue = (int) ((1L << numBits) - 1);
  return value & maxValue;
}

As maxValue is essentially a mask, it could also be looked up in an array.

Could that be faster than these generated switch statements?

asfimport commented 14 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

I think Michaels generated code was meant as a temporary solution, until a handcrafted version was available. In packed32 and packed64, the code for decoding a value is

  public long get(final int index) {
    final long majorBitPos = index * elementBits;
    final int elementPos = (int)(majorBitPos >>> BLOCK_BITS); // / BLOCK_SIZE
    final int bitPos =     (int)(majorBitPos & MOD_MASK); // % BLOCK_SIZE);

    final int base = bitPos * FAC_BITPOS;

    return ((blocks[elementPos] << shifts[base]) >>> shifts[base+1]) |
            ((blocks[elementPos+1] >>> shifts[base+2]) & readMasks[bitPos]);
  }

which looks a lot like your (Paul Elschot) suggestion. It avoids all conditionals at the cost of more bit-operations and a dummy element at the end of the backing array. I must admit that my performance testing of the different solutions has been fairly ad hoc (measure, tweak, repeat), so an appropriate test would be in order.

asfimport commented 14 years ago

Paul Elschot (migrated from JIRA)

Nice to see a more mature alternative. The trade off between (dummy element/unconditioned extra access) and (conditioned extra access) is of later concern. Both the extra access and the condition take cycles, so it's not clear which one will be faster. It might even depend on the value of elementBits.

asfimport commented 14 years ago

Paul Elschot (migrated from JIRA)

As to whether to use int or long in the interface unsigned packed int, the only numbers that will probably need to be long in the foreseeable future are docids. However this change can be delayed by not allowing an index segment to grow beyond 2^32 or 2^31-1docs, and by only implementing the long docids for multiple index segments. So as long as it is ok to assume that an index segment can have MAXINT docs at most, we could use an int interface here. Do Nutch and/or Solr already have long docids implemented on multiple index readers/writers or segments?

The other border is the max size of a document field. If that goes beyond MAXINT, the positions and maybe even the frequencies would need to be changed from int to long. But for now I can't think of a real use case with a document field that has more than MAXINT positions. That would be like a book with ten million pages of text. Did anyone ever run into this limitation?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Good progress !

I think Michaels generated code was meant as a temporary solution, until a handcrafted version was available

Actually that was intended to be a fast impl... the switch should be compiled to a direct lookup (maybe plus a conditional to catch the "default" case even though it will never happen...ugh). But I like your impl with no conditional at all. We should test both.

As to whether to use int or long in the interface unsigned packed int, the only numbers that will probably need to be long in the foreseeable future are docids.

Also the file offsets into the terms dict, possibly the offsets in RAM into the terms dict character data (UTF8 byte[]). Also, when we do column stride fields, we allow storing values > int. I think we should stick with long get(index) for now.

Other comments:

asfimport commented 14 years ago

Paul Elschot (migrated from JIRA)

As to whether to use int or long in the interface unsigned packed int, the only numbers that will probably need to be long in the foreseeable future are docids.

Also the file offsets into the terms dict, possibly the offsets in RAM into the terms dict character data (UTF8 byte[]). Also, when we do column stride fields, we allow storing values > int. I think we should stick with long get(index) for now.

Indeed I missed the offsets. However, a long get(index) will probably get in the way, because there are far fewer offsets than normal data, and the 32 bit processors will be around for a long time. So I think we'll need both long (for the offsets) and int (for the rest) in the end.

asfimport commented 14 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

I think we should remove getMaxValue() from the Reader interface?

Yes. I only left maxValue in the code because I ran out of time.

Why create the IMPLEMENTATION enum? Why not simply return an [anonymous] instance of Writer?

The IMPLEMENTATION enum is only used internally and is package private. It was introduced to separate decision-making from specific implementations - e.g. the packed writer is the same for packed32 and packed64, although the reader differs. But it could very well be that it confuses more than it helps.

Why not store bitsPerValue in the header instead of maxValue?

As above - I did not have the time to fix it and wanted to push the patch in order to move the discussion along.

Also, the maxValue at write time should not have to be known - eg the factory API should let me ask for a direct short writer without declaring the maxValue I will store.

Since Packed and Aligned needs maxValue (or bitsPerValue), this would require two distinct methods in the factory, each returning a subset of the possible implementations. I find that rather confusing.

I wonder if we should add an optional Object getDirectBackingArray(). The packed/aligned impls would return null, but the direct byte/short/int/long impls would return their array. [...]

Speaking of API additions, I find that

public int getBitsPerValue();
public int size();
public void set(long value);
public void clear();

are trivial to implement for the known implementations. They open up for things like auto-growing to fit higher values by using a delegating wrapper, re-using the structure for counting purposes and sorting in-place.

I think we shouldn't put a getWriter on every Reader impl... because it's a one to many mapping? Eg the format written by PackedWriter can be read by direct byte/short/int/long, Packed32/64.

Quite right.

For starters I don't think we should make reader impls that can read nbits > 31 bits with an int[] backing array. I think long[] backing array is fine.

The current patch limits nbits to 32 for Packed32. I am confident that an int-backed reader with nbits > 32 will be slower than a long-backed reader on a 32 bit machine.

I don't think we need separate PRIORITY and BLOCK_PREFERENCE? Can't we have a single enum (STORAGE?) with: packed, aligned32, aligned64? "Direct" is really just packed with nbits rounded up to 8,16,32,64.

I agree that it does complicate matters somewhat to have them separated. When calling getReader the BLOCK_PREFERENCE should also be removed, as the block preference will always be the same as that architecture. Removing the "direct" option would require the caller to do some of the logic in some cases: If low processing requirements is a priority, direct is preferably and when the bitsPerValue is calculated, the caller would have to do the if (bitsPerValue > 32) bitsPerValue = 64 and so on.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Also, the maxValue at write time should not have to be known - eg the factory API should let me ask for a direct short writer without declaring the maxValue I will store.

Since Packed and Aligned needs maxValue (or bitsPerValue), this would require two distinct methods in the factory, each returning a subset of the possible implementations. I find that rather confusing.

Maybe the caller just always uses the bitsRequired method to get the required bit width per value?

Though, when we enable specializing storing of negative values as well, that'll be a hassle...

OK let's leave it as you must pass the maxValue for now.

Speaking of API additions, I find that

public int getBitsPerValue();
public int size();
public void set(long value);
public void clear();

are trivial to implement for the known implementations. They open up for things like auto-growing to fit higher values by using a delegating wrapper, re-using the structure for counting purposes and sorting in-place.

I think the first 2 make sense, but I'd rather not pursue the 2nd two at this time. Ie, I think this API only needs write-once, and then read-only.

If we open up random writing (set/clear), with auto-growing, etc., that does add complexities to the impl. EG the backing store can no longer be final, we'd have to do some locking (or mark the array volatile) for thread safety, etc.

As far as I can tell... Lucene today doesn't yet need random write to the packed ints. The terms dict index and CSF are the two needs I think we have now. Someday (when CSF supports writing) we will... but not yet today?

I don't think we need separate PRIORITY and BLOCK_PREFERENCE? Can't we have a single enum (STORAGE?) with: packed, aligned32, aligned64? "Direct" is really just packed with nbits rounded up to 8,16,32,64.

I agree that it does complicate matters somewhat to have them separated. When calling getReader the BLOCK_PREFERENCE should also be removed, as the block preference will always be the same as that architecture. Removing the "direct" option would require the caller to do some of the logic in some cases: If low processing requirements is a priority, direct is preferably and when the bitsPerValue is calculated, the caller would have to do the if (bitsPerValue > 32) bitsPerValue = 64 and so on.

(There's a bug in the patch in PackedInts.getReader, where it switches the block size based on whether JRE is 64 bit: it's always choosing 64 bit now).

The "direct" option only applies during writing (ie, you round up to the nearest native type bit width). At read time it's just a packed 8/16/32/64.

Hmm... maybe we could just add an optional 2nd arg to bitsRequired, a boolean eg "roundUpToNative" or something, which if true does that rounding for you? (And then go back to caller computes bit width and passes it in?).

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Toke, are you still working on this...? If not, I can take a crack? I'd really like to get something online here before we land flex, so the terms dict index isn't so wasteful of RAM.

asfimport commented 14 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

Changing the code to use bitsPerValue instead of maxValue for constructors and persistent format took a bit longer than anticipated. To get things flowing, I've attached the code as it is now. I've moved the classes to o.a.l.util.packed and performed some clenup too. It still needs aligned32 and aligned64 implementations and more cleanup, which I'll work on for the next hour today and hopefully some hours tomorrow.

One current use-case for mutable packed ints would be for StringOrdValComparator (using an auto-grow wrapper), although the gain might be small as the overhead of the Strings is so large. I understand the problem of making all packed ints mutable, but a compromise might be to have a Mutable interface and a new factory-method that returns the same implementations as Mutable instead of Reader? That way it is possible to use the implementations for things such as sorting instead of having to re-implement them. I've left the interface for Reader clean as you suggested, but kept the implementations of set in the classes for now, as the code has already been made.

asfimport commented 14 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

I've read through the comments on LUCENE-1990 and implemented most of what has been suggested. The attached patch contains implementations for all the variants we've talked about, including aligned. There's a known bug in persistence for aligned64 (and probably also for aligned32) that I haven't stomped yet. There's also a clear need for a more elaborate unit-test with regard to persistence.

Other outstanding issues, as I see them, are whether or not mutable packed arrays should be requestable (as general purpose data structures) and how the factory for creating a writer should work. I have added a getMutable-method to the factory and not touched the return type Reader for the getReader-method. That way read-only users will not be tempted to try and update the received structure. As for the arguments to the factory, Michael McCandless suggested that the preferences should be expressed with (packed | aligned32 | aligned64 | auto). As fas as I can see, this should work. However, I've only just reached this conclusion and haven't had the time to implement it.

A speed-test has been added and the results from my machine can be seen below. In order for it to be really usable, it should be tried on other machines too.

I won't touch the code before sometime next week, but I'll keep an eye on LUCENE-1990 comments until then.

        bitsPerValue          valueCount            getCount    PackedDirectByte   PackedDirectShort            Packed32     PackedAligned32     PackedDirectInt            Packed64     PackedAligned64    PackedDirectLong
                   1                1000            10000000                 167                 141                 258                 242                 172                 264                 242                 183
                   1             1000000            10000000                 224                 232                 266                 233                 246                 262                 238                 338
                   1            10000000            10000000                 359                 469                 280                 278                 508                 278                 272                 551
                   3                1000            10000000                 168                 166                 265                 241                 163                 262                 243                 166
                   3             1000000            10000000                 227                 226                 261                 251                 239                 274                 249                 330
                   3            10000000            10000000                 406                 476                 301                 304                 522                 300                 308                 547
                   4                1000            10000000                 167                 168                 266                 239                 164                 285                 239                 169
                   4             1000000            10000000                 228                 231                 294                 274                 262                 291                 269                 314
                   4            10000000            10000000                 385                 480                 308                 333                 514                 331                 315                 557
                   7                1000            10000000                 172                 174                 278                 248                 162                 271                 238                 177
                   7             1000000            10000000                 224                 236                 289                 281                 272                 278                 277                 345
                   7            10000000            10000000                 405                 473                 389                 447                 516                 399                 402                 553
                   8                1000            10000000                 192                 171                 268                 242                 174                 291                 240                 163
                   8             1000000            10000000                 226                 232                 291                 284                 286                 274                 265                 314
                   8            10000000            10000000                 381                 467                 406                 428                 512                 422                 419                 580

        bitsPerValue          valueCount            getCount   PackedDirectShort            Packed32     PackedAligned32     PackedDirectInt            Packed64     PackedAligned64    PackedDirectLong
                   9                1000            10000000                 166                 274                 241                 170                 261                 237                 163
                   9             1000000            10000000                 229                 299                 273                 250                 284                 275                 327
                   9            10000000            10000000                 483                 443                 477                 519                 438                 455                 568
                  15                1000            10000000                 170                 265                 239                 174                 264                 235                 162
                  15             1000000            10000000                 232                 285                 274                 240                 278                 269                 339
                  15            10000000            10000000                 473                 518                 524                 523                 519                 521                 550
                  16                1000            10000000                 166                 263                 236                 172                 264                 235                 160
                  16             1000000            10000000                 229                 285                 278                 244                 293                 272                 332
                  16            10000000            10000000                 470                 513                 517                 509                 534                 529                 548

        bitsPerValue          valueCount            getCount            Packed32     PackedAligned32     PackedDirectInt            Packed64     PackedAligned64    PackedDirectLong
                  17                1000            10000000                 262                 255                 177                 260                 234                 160
                  17             1000000            10000000                 290                 306                 273                 304                 290                 320
                  17            10000000            10000000                 532                 572                 533                 529                 556                 551
                  28                1000            10000000                 269                 256                 187                 267                 238                 163
                  28             1000000            10000000                 293                 295                 253                 293                 296                 312
                  28            10000000            10000000                 542                 567                 501                 548                 567                 542
                  31                1000            10000000                 260                 235                 177                 266                 232                 158
                  31             1000000            10000000                 292                 294                 244                 296                 297                 328
                  31            10000000            10000000                 552                 563                 516                 562                 568                 548

        bitsPerValue          valueCount            getCount     PackedDirectInt            Packed64     PackedAligned64    PackedDirectLong
                  32                1000            10000000                 172                 263                 241                 166
                  32             1000000            10000000                 241                 291                 297                 320
                  32            10000000            10000000                 519                 556                 573                 546

        bitsPerValue          valueCount            getCount            Packed64     PackedAligned64    PackedDirectLong
                  33                1000            10000000                 264                 239                 159
                  33             1000000            10000000                 293                 374                 319
                  33            10000000            10000000                 559                 595                 552
                  47                1000            10000000                 264                 242                 164
                  47             1000000            10000000                 319                 369                 322
                  47            10000000            10000000                 577                 601                 548
                  49                1000            10000000                 261                 243                 162
                  49             1000000            10000000                 323                 413                 319
                  49            10000000            10000000                 584                 610                 551
                  63                1000            10000000                 269                 235                 161
                  63             1000000            10000000                 396                 369                 313
                  63            10000000            10000000                 592                 596                 559

(Java 1.6.0_15-b03, default settings on a Dell Precision M6500: Intel i7 Q 820 @ 1.73GHz, 8 MB level 2 cache, dual-channel PC 1333 RAM, running Ubuntu Karmic)

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Great progress Toke!

I guess we should do Mutable since you're so far along already :)

But, now that we have getMutable, can we make the concrete impls package private? Javadocs for Mutable.set should note that the size is fixed once you allocate it. We have no way to save a Mutable... should we add that? If so, we may want to rename Writer -> WriteOnceWriter. This way consumers can also get a Mutable, do random writes, then save, if the "write once" model isn't a good fit.

Maybe we should just merge Mutable & Reader, then? (LongStore? LongArray? PackedLongs?)

We should state clearly that these are all unsigned ints storage.

Maybe rename PackedDirectInt to PackedDirect32 (and Short to 16, Byte to 8). Because... while it is using a direct int[] under the hood, it's really using all 32 bits for the full positive int range. So PackedDirect32 can be used even for pos ints that would overflow a normal java "int". (Though, for long we obviously can't use that 64th bit for positive ints...).

The @see in the new IndexInput.readShort is wrong (referencing writeInt).

Can you add @lucene.internal to the javadocs?

Seems like once we stomp the bugs, beef up the tests, and merge PRIORITY and BLOCK_PREFERENCE (into maybe STORAGE?) for the public API, we are nearly done? Thanks Toke!

asfimport commented 14 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

I am sorry, but personal issues sapped my time and energy this week, so Lucene got bumped down my priority-list. I am going to code4lib next week and I'll try and get some hacking done in the plane from Denmark to USA, but that depends on whether or not there is a power socket near my seat. If I don't upload a patch late monday, it will be early march before I'll get it done

But, now that we have getMutable, can we make the concrete impls package private? Javadocs for Mutable.set should note that the size is fixed once you allocate it.

Agreed on both.

We have no way to save a Mutable... should we add that?

I dont know enough about persistence in Lucene to make that call. Since the writer is tied to Lucene, it would not work for general purposes, so making a writer for Mutables only seems to make sense if the user uses it to build index-structures?

Maybe we should just merge Mutable & Reader, then? (LongStore? LongArray? PackedLongs?)

I don't understand that one? You made a compelling argument for returning immutables to readers earlier (problems with concurrency and having all back ends support writes).

As for the name... I don't know. None of the sound right, but I have no other suggestion.

We should state clearly that these are all unsigned ints storage.

Maybe rename PackedDirectInt to PackedDirect32 (and Short to 16, Byte to 8). Because... while it is using a direct int[] under the hood, it's really using all 32 bits for the full positive int range.

Good point. The rest of your suggestions are also very valid.

asfimport commented 14 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

I've renamed most of the classes to short form, as the "Packed"-prefix did was not that descriptive and fixed some bugs. Still pending is the mutable writer and a bug in persistence for aligned64. Good news (for Lucene at least) is that an airplane blocking snowdrift means that I have time this week for continued hacking.

But, now that we have getMutable, can we make the concrete impls package private? Javadocs for Mutable.set should note that the size is fixed once you allocate it.

The implementations are now package private, but I only put the note about fixed size on the getMutable-method. There's nothing wrong with creating a custom auto growing Mutable.

We should state clearly that these are all unsigned ints storage.

Done.

Maybe rename PackedDirectInt to PackedDirect32 (and Short to 16, Byte to 8).

Done (Direct8, Direct16, Direct32 and Direct64).

The @see in the new IndexInput.readShort is wrong (referencing writeInt).

Fixed.

Can you add @lucene.internal to the javadocs?

Should this also be applied to package private classes? Marking those as internal seems redundant.

Seems like once we stomp the bugs, beef up the tests, and merge PRIORITY and BLOCK_PREFERENCE (into maybe STORAGE?) for the public API, we are nearly done?

I've removed BLOCK_PREFERENCE from the API. It's still used internally, mainly to do controlled testing. Tests are beefed up (and currently fails for aligned, so clearly beefing worked).

asfimport commented 14 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

Now we're getting somewhere. I finally squashed the persistence bug and the tests has been turned up another notch. Everything seems to run as it should. Pending issues, as I see them:

The last one is interesting. The code for getting a value from aligned uses devision and a single RAM-request:

  public long get(final int index) {
    final int blockPos = index / valuesPerBlock;
    final int bitPos = (index - (blockPos * valuesPerBlock)) * bitsPerValue;
    return (blocks[blockPos] >>> shifts[bitPos]) & readMask;

where the code for packed uses shift and two RAM-requests:

    final long majorBitPos = index * bitsPerValue;
    final int elementPos = (int)(majorBitPos >>> BLOCK_BITS); // / BLOCK_SIZE
    final int bitPos =     (int)(majorBitPos & MOD_MASK); // % BLOCK_SIZE);

    final int base = bitPos * FAC_BITPOS;

    return ((blocks[elementPos] << shifts[base]) >>> shifts[base+1]) |
            ((blocks[elementPos+1] >>> shifts[base+2]) & readMasks[bitPos]);

I have done some tests (see the TODO-file in the attached patch) and on 64 bit machines, the difference in access-speed for aligned vs. packed is not that great and not always in favor of aligned. Probably because some space is wasted and the RAM-cache is not so well utilized. If this is also the case for 32 bit machines, I vote for removing aligned and only used packed with the special-case optimizations direct8, direct16, direct32 and direct64. This would also mean that there is only one persistent format.

java -cp lucene-core-3.1-dev.jar org.apache.lucene.util.packed.PackedIntsPerformance

Runs throught the performance tests and delivers a simple report, so it should be very easy to test on different platforms. It only measures access speed.

I consider this patch ready for review and concentrate on other matters until I hear more.

asfimport commented 14 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

I couldn't help making a tiny tweak to the performance test so that it outputs execution time means for the different implementations. I have attached measurements from 5 different 64 bit machines. Looking at the means, I observe the following:

The direct implementations outperforms packed and aligned for all sane cases (using direct8 to hold only 1 bit/value is clearly a bad idea). No surprise there.

Caveat: The tests were run without any other significantly resource heavy processes disturbing it. This means that there were no fighting for the CPU cache.

Major caveat: Tests are needed on other processors than 64 bit Intel.

I would be great if someone could figure out how to make an aligned getter without using division as that is surely the thing that hampers aligned performance.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Great progress! I think this is very close.

Airplane blocking snow drifts!? Where on earth are you anyway?

Can you add @lucene.internal to the javadocs?

Should this also be applied to package private classes? Marking those as internal seems redundant.

Yeah I agree package private APIs don't need the @lucene.internal...

It's very interesting that align is never a win – I think in that case removing it makes sense? It'll be a nice simplification.

I think we don't need to make a MutableWriter, at least before committing? Nobody needs it now... (I think?).

Other small things:

asfimport commented 14 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

Airplane blocking snow drifts!? Where on earth are you anyway?

In Denmark. The guy responsible for clearing the runway did indeed clear the runway. He just forgot that the plane needs to taxi into the runway in the first place. That made us miss our connecting flight.

It's very interesting that align is never a win – I think in that case removing it makes sense? It'll be a nice simplification.

Well, practically never wins for the machines I tested on and never wins with my implementation.

Can you use @lucene.internal instead of the NOTE that I had put on the classes?

Done... I think. I'm not very good at this part, so if someone else wants to do some cleanup i JavaDoc and such, they are very welcome by me.

We lost "final" in the RamUsageEstimator constants

Strange. Oh well, fixed.

Did we ever test performance of the specialized (generated) decoders using switch statements?

I just did a quick hack in order to measure performance and I was very surprised that the generated switch-based implementations performs so well. It's nearly on par with packed most of the time and exceeds it in some cases. I only tested on 3 machines though. The hack is in the LUCENE-1990-te20100226c.patch and is called when the performance test is executed.

Attachment generated_performance-te20100226.txt contains measurements where the py-generated code is tested together with the other implementations.

Note to self: Switch is not equivalent to a series of if-else, when we're talking performance and when we switch without omissions in the cases.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Airplane blocking snow drifts!? Where on earth are you anyway?

In Denmark. The guy responsible for clearing the runway did indeed clear the runway. He just forgot that the plane needs to taxi into the runway in the first place. That made us miss our connecting flight.

Good grief!

It's very interesting that align is never a win - I think in that case removing it makes sense? It'll be a nice simplification.

Well, practically never wins for the machines I tested on and never wins with my implementation.

I think we should remove it...

Did we ever test performance of the specialized (generated) decoders using switch statements?

I just did a quick hack in order to measure performance and I was very surprised that the generated switch-based implementations performs so well. It's nearly on par with packed most of the time and exceeds it in some cases. I only tested on 3 machines though. The hack is in the LUCENE-1990-te20100226c.patch and is called when the performance test is executed.

Thanks for testing this! It is interesting.

I ran the perf test on a CentOS 5.4 machine, java 1.6.0_17-b04 64 bit server, Intel core 2 duo E8400 (3 ghz) – attached perf-mkm-20100227.txt. I also show the switch impl close, though always a bit behind.

Seems like we should just stick with the non-gen'd packed impl?

Note to self: Switch is not equivalent to a series of if-else, when we're talking performance and when we switch without omissions in the cases.

Right, if the switch cases are compact, it should compile into a fast jump table (though it may still do an unecessary bounds check).

I think, once we removed aligned, this is ready to commit? I think we should land this on flex branch? (It's using CodecUtil, BytesRef – I'll merge them when I commit). Then I can cutover the terms index to use packed ints.

asfimport commented 14 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

I've tested on two 32 bit Windows machines: An Intel T2400 (32 bit only) running XP and an Athlon X2 4850e (64 bit capable) running 32 bit XP. The result can be seen in attachment performance-20100301.txt. Something curious happens with high (32+) bits/value for the T2400 as aligned overtakes packed. However, the overall picture is still that aligned only wins for a few special cases, so now I'll be happy to remove it from the patch. As a note, generated is also slower than packed on the AMD processor, although not as much as for Intel.

I have removed all traces of aligned from PackedInts, but kept the classes in the patch, in the case that someone finds a faster way to handle aligned. PackedIntsPerformance still includes both the generated switch-implementation and Aligned32 and Aligned64. It should be possible to apply the patch without Aligned32, Aligned64, AlignedWriter and PackedIntsPerformance.

asfimport commented 14 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

Some thoughts on avoiding the generic division by experimenting with reciprocal multiplication: For aligned, the sane number of values/block are [3, 5, 6, 7, 8, 9, 10, 16, 21, 32, 64]. I tried testing index from 0 to Integer.MAX_VALUE with these divisors and reciprocal multiplication. It worked perfectly for all divisors except [5, 7, 9, 10, 21]. Unfortunately it already falls for divisor 21 at index 252645140, which makes it useless as a full replacement. If one were so inclined, it would be possible to select aligned implementation based on valueCount, with fallback to the "slow" version. The gain of using fast division seems quite substantial as it makes aligned 14-40% faster than packed (note: Just tested on a single machine). However, re-introducing aligned with four different implementations (Aligned32, Aligned32Fast, Aligned64, Aligned64Fast) is rather daunting and it would make the selection code really messy.

I can see that there are well-known tricks to get around the rounding errors. Some are described at http://www.cs.uiowa.edu/\~jones/bcd/divide.html#fixed . I don't know if these extra tricks would negate the 14-40% speed gain though. Since I would like to get the patch out of the door, I vote for keeping aligned disabled and just note that more bit fiddling might make it attractive at some point.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks great Toke – a few small things:

I can take these too – I think it's ready to commit on flex after this. Thanks!

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

We should also add the @lucene.internal javadoc comments everywhere instead of the big NOTE. Why has one class a full-uppercase class name?

asfimport commented 14 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

Michael McCandless:

  • I think we shouldn't add Aligned*.java to svn? It'll just add unused bits to the JAR, and, we can always fallback to this issue to pull them in at a future time?

I agree. At the current state, Aligned is just dead weight.

This also means that the performance tester won't be part of the commit though. I can quickly make a performance tester that does not use aligned, if it is preferable to keep performance testing.

  • Can you resolve the remaining nocommits? EG (since we are unsigned) we can't get the 64 bit case working. I don't think we should rename to UnsignedXXX, nor, support minValue at this point, and remove the ComparableBytesRef, and I'll merge BytesRef into flex's when I commit.

I can take these too - I think it's ready to commit on flex after this

It will help a lot if you take care of these issues, thanks.

asfimport commented 14 years ago

Toke Eskildsen (@tokee) (migrated from JIRA)

Uwe Schindler:

We should also add the @lucene.internal javadoc comments everywhere instead of the big NOTE. Why has one class a full-uppercase class name?

Are you looking at patch LUCENE-1990-te20100301.patch? I don't see any NOTE and no full-uppercase class name?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

It will help a lot if you take care of these issues, thanks.

OK will do. I'll commit soon to flex... thanks Toke!