PFOR implementation [LUCENE-1410]

asfimport commented 16 years ago

Implementation of Patched Frame of Reference.

Migrated from LUCENE-1410 by Paul Elschot, 1 vote, resolved Aug 19 2012 Attachments: autogen.tgz, for-summary.txt, LUCENE-1410.patch (versions: 4), LUCENE-1410b.patch, LUCENE-1410c.patch, LUCENE-1410-codecs.tar.bz2, LUCENE-1410d.patch, LUCENE-1410e.patch, TermQueryTests.tgz, TestPFor2.java (versions: 3)

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

20081001: Initial implementation of PFOR.

The target package is o.a.l.util.pfor, mostly for testing convenience, even though the code has no dependencies on Lucene.

To try it out please use jvmarg -server during the test, see also TestPFor.java on how to do this. The command ant -Dtestcase=TestPFor test-core should start the test after applying the patch. The test will take about 25 seconds to run.

There is some optimization for decompression included. On my machine with a 1.6.0_03-b05 Sun jvm, the decompression performance for 1-7 bits frame size varies between about 60M ints/sec unoptimized and 200M ints/sec optimized, as reported by the test. This appears adequate for practical use, but it should be noted that this performance is from CPU cache to CPU cache.

The implementation still needs quite a bit of work. I'm posting it now because I'd like feedback on the interface for compression and decompression. Typical intended usage is present in TestPFor.java.

asfimport commented 16 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Awesome, I'll have a look!

The TestPFor.java didn't make it into the patch (it just references a link).

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

Sorry about the mistake in the patch, a correction will follow shortly.

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

A corrected b.patch including TestPFor.java

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

Another detail about the decompression performance as reported above: these are all cases without exceptions, so an expected performance degradation for patching exceptions is not included in the performance results. Patching exceptions is provided for in the code, but that code is not yet optimized.

asfimport commented 16 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Paul, I'm eager to test pfor on real Lucene vInts.

So I created a simple test program (attached TestPFor2.java). Run it like this:

  Usage: java org.apache.lucene.util.pfor.TestPFor2 <indexDirName> <vIntFileNameIn> <pForFileNameOut>

  Eg: java org.apache.lucene.util.pfor.TestPFor2 /lucene/index _l.prx _l.prx.pfor

where indexDirName is the directory of a Lucene index, vIntFileNameIn should be a file that just has a bunch of vInts (Lucene's *.frq and *.prx fit this) and pForFileNameOut is a file it writes with blocks encoded in PFor.

It first encodes the vInts from vIntFileNameIn into pfor blocks written to pForFileNameOut. Then it measures decode time of reading all vInts from vIntFileNameIn vs decode time of reading all pfor blocks. It runs each round 5 times.

The test has certain serious flaws:

Can you add a method that figures out the right frame size to use for a given block of ints (so that \~90% of the ints are < N bits)?
I'm using fixed 6-bit frame size. Can you add bigger bit sizes to your pfor decompress?

With these fixes the test would be more fair to pfor.

In the PFor file that I write, I simply write an int (# bytes) followed by the bytes, for each block. Then when reading these blocks I read the #bytes, then read into the byte array backing the intArray passed to the PFor for decompress. In a real integration I think writing the int #bytes should be unecessary (is pfor self puncuating?).

This is inefficient because in doing this for real we should avoid the double-copy of the byte[]. In fact, we might push it even lower (under IndexInput, eg, create a IntBlockIndexInput) to possibly avoid the copy into byte[] in BufferedIndexInput by maybe using direct ByteBuffers from the OS. So even once we fix the top two issues above, the results of a "real" integration should be still faster.

I ran this on a 622 MB prx file from a Wikipedia index, and even with the above 2 limitations it's still a good amount faster:

java org.apache.lucene.util.pfor.TestPFor2 /lucene/big _l.prx _l.prx.pfor

encode _l.prx to _l.prx.pfor...
442979072 vints; 888027375 bytes compressed vs orig size 651670377

decompress using pfor:
4198 msec to decode 442979072 vInts (105521 vInts/msec)
4332 msec to decode 442979072 vInts (102257 vInts/msec)
4165 msec to decode 442979072 vInts (106357 vInts/msec)
4122 msec to decode 442979072 vInts (107467 vInts/msec)
4061 msec to decode 442979072 vInts (109081 vInts/msec)

decompress using readVInt:
7315 msec to read 442979104 vInts (60557 vInts/msec)
7390 msec to read 442979104 vInts (59943 vInts/msec)
5816 msec to read 442979104 vInts (76165 vInts/msec)
5937 msec to read 442979104 vInts (74613 vInts/msec)
5970 msec to read 442979104 vInts (74200 vInts/msec)

It's really weird how the time gets suddenly faster during readVInt. It's very repeatable. on another machine I see it get suddenly slower starting at the same (3rd) round. Adding -server and -Xbatch doesn't change this behavior. This is with (build 1.6.0_10-rc-b28) on Linux and (build 1.6.0_05-b13-120) on Mac OS X 10.5.5.

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

Q: Can you add a method that figures out the right frame size to use for a given block of ints (so that \~90% of the ints are <N bits)? A: PFor.getNumFrameBits() does this for a given int[] at offset and size. Choosing the right size is a dilemma: too small will need too much header decoding and too large may result in using too large number of frame bits, i.e. too large compressed size.

Q: I'm using fixed 6-bit frame size. Can you add bigger bit sizes to your pfor decompress? A: It should work for 1<=numFrameBits<=30.

Q: is pfor self punctuating? A: PFor.bufferByteSize() returns the number of bytes used in the compressed buffer, see also the javadocs. For practical use, the start of each compressed block must be known, either from somewhere else, or from the size of the previously encoded block. The number of compressed integers is encoded in the header, but I'm not sure whether I made that info available before decompression to allow allocation of an int[] that is large enough to hold the decompressed data.

>: It's really weird how the time gets suddenly faster during readVInt. A: it's probably the JIT jumping in. That's why I preferred to test in 3 1-second loops and reporting performance each second. The 2nd second always has better performance.

It's nice to see that PFor is faster than VInt, I had not tested that yet. Which block size did you use for PFor? Never mind, I'll take a look at the code of TestPFor2.

Btw. after decompressing, the things are ints not vInts.

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

Q: I'm using fixed 6-bit frame size. Can you add bigger bit sizes to your pfor decompress? A: It should work for 1<=numFrameBits<=30.

That answer was too short. There is some fallback code (decodeAnyFrame) that will decompress for any frame or reference. The patch contains unrolled versions of that for up to 7 bits iirc. I'll add higher numbers of bits later, the code is straightforward and it could actually be generated. Btw. on my machine the unrolled 9 bit version is actually a bit slower than the loop, I don't know why, maybe there are too many buffer references (9) in the loop for the jit to cope with.

asfimport commented 16 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

A: PFor.getNumFrameBits() does this for a given int[] at offset and size.

Super, I missed that – I'll use it.

Btw. after decompressing, the things are ints not vInts.

Oh yeah, I'll fix the prints in my test.

There is some fallback code (decodeAnyFrame) that will decompress for any frame or reference

Right I was wanting the unrolled version to see the fastest result we can get for pfor, but your next observation is spooky so maybe this isn't really going to help our test:

Btw. on my machine the unrolled 9 bit version is actually a bit slower than the loop, I don't know why, maybe there are too many buffer references (9) in the loop for the jit to cope with.

We should look at the asm that's produced?

it's probably the JIT jumping in.

But strangely with -Xbatch I still see the effect. And even stranger, on another machine (Mac OS X) it gets slower when the JIT jumps in. I'm spooked.

Which block size did you use for PFor?

I used 128 but I'll try different sizes.

For practical use, the start of each compressed block must be known, either from somewhere else, or from the size of the previously encoded block.

But can I also get the "bytes consumed" when I ask decompress() to run?

When we really integrate, things like term infos and skipping data will know how to jump to the start of blocks. But for raw sequential scanning, if the header is self-punctuating we would not need to store the block size between each block.

asfimport commented 16 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

For people not intimately familiar with PFOR (like me), I found the following to be helpful: http://cis.poly.edu/cs912/indexcomp.pdf

asfimport commented 16 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

There's also this recent comparison of index compression approaches: http://www2008.org/papers/pdf/p387-zhangA.pdf and http://homepages.cwi.nl/\~heman/downloads/msthesis.pdf.

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

Q: We should look at the asm that's produced? A: Maybe. But would that be java bytecodes or x86 or powerpc? I'd prefer to wait with that. There are some nice machine instrns to this on the various architectures, but that would require to use native code.

Q: But can I also get the "bytes consumed" when I ask decompress() to run? A; Yes, but only after decompression or after compression. The number of exceptions is not explicitly coded in the header, so the size to encode the exceptions is not known beforehand. That could be changed, but than the header will get bigger. (For compression, it's possible to do a run without buffer first.)

Block size 128 should be fine, one could also try 64 and 32.

Thanks for posting the links here, they are the ones I used to code this up. (Does that count as homework? :) ) I did not put the links in the code because of possible link rot. The article titles and authors are in the javadocs. Currently the easy way to find the links is via google scholar.

asfimport commented 16 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New TestPFor2 attached. I changed vInt -> int in the prints and now compute "best" bit size per-block using getNumFrameBits() and use that per block.

I took a big frq file and computed %tg for each #bits:

bits	count	pct
1	0	0.0
2	24328	0.9
3	94689	3.7
4	213887	8.4
5	277510	10.8
6	284905	11.1
7	193081	7.5
8	262857	10.3
9	260460	10.2
10	212046	8.3
11	162872	6.4
12	109801	4.3
13	77366	3.0
14	56620	2.2
15	34294	1.3
16	31000	1.2
17	28803	1.1
18	21617	0.8
19	22317	0.9
20	30326	1.2
21	162663	6.4

And for prx:

bits	count	pct
1	23034	0.7
2	12637	0.4
3	49286	1.4
4	56344	1.6
5	69385	2.0
6	81964	2.4
7	108847	3.1
8	179296	5.2
9	428787	12.4
10	873828	25.2
11	957913	27.7
12	534426	15.4
13	81856	2.4
14	2436	0.1
15	474	0.0
16	200	0.0
17	43	0.0
18	17	0.0
19	1	0.0

It's interesting that prx is so much more tightly clustered than frq.

New results for decoding vInt vs pfor, for frq file:

327864576 ints; 431137397 bytes compressed vs orig size 447077047

decompress using pfor:
2514 msec to decode 327864576 ints (130415 ints/msec)
2171 msec to decode 327864576 ints (151020 ints/msec)
2137 msec to decode 327864576 ints (153422 ints/msec)
2148 msec to decode 327864576 ints (152637 ints/msec)
2067 msec to decode 327864576 ints (158618 ints/msec)

decompress using readVInt:
4549 msec to read 327864691 ints (72074 ints/msec)
4421 msec to read 327864691 ints (74160 ints/msec)
3240 msec to read 327864691 ints (101192 ints/msec)
3199 msec to read 327864691 ints (102489 ints/msec)
3323 msec to read 327864691 ints (98665 ints/msec)

PFor is 54.766% faster

and prx file:

encode _l.prx to _l.prx.pfor...
442979072 ints; 628580878 bytes compressed vs orig size 651670377
decompress using pfor:
6385 msec to decode 442979072 ints (69378 ints/msec)
5904 msec to decode 442979072 ints (75030 ints/msec)
5796 msec to decode 442979072 ints (76428 ints/msec)
5807 msec to decode 442979072 ints (76283 ints/msec)
5803 msec to decode 442979072 ints (76336 ints/msec)

decompress using readVInt:
6893 msec to read 442979104 ints (64265 ints/msec)
6759 msec to read 442979104 ints (65539 ints/msec)
5304 msec to read 442979104 ints (83517 ints/msec)
5275 msec to read 442979104 ints (83977 ints/msec)
5792 msec to read 442979104 ints (76481 ints/msec)

PFor is 8.989% slower

Some comments:

In both cases, the pfor file a bit smaller than the original Lucene file. And, it's unfair because I inject the 4 extra bytes per block (which I think won't be needed "for real").
For frq file, pfor is quite a bit faster (54.8%) but for prx file it's slower (9.0%). Maybe it's because prx file has much higher occurrence of bigger bit sizes (where we don't have unrolled version)? It's odd; maybe I'm doing something wrong. With a fixed 6 bits it is quite a bit faster (I retested to verify!) which is also weird because there would be many exceptions. Exception processing ought to be slower than non-exception processing!

Stepping back a bit: I wonder what %tg of the time in a "typical" Lucene search is spent decoding vInts? That would bound how much improvement we could ever expect from this optimization during searching.

asfimport commented 16 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Q: We should look at the asm that's produced? A: Maybe. But would that be java bytecodes or x86 or powerpc? I'd prefer to wait with that. There are some nice machine instrns to this on the various architectures, but that would require to use native code.

I was thinking local CPU's native asm, so that we could at least see if the benefits of pfor (no hard-for-cpu-to-predict if statement inside inner loop) "survive" through the Java stack. I'll try to look.

Q: But can I also get the "bytes consumed" when I ask decompress() to run? A; Yes, but only after decompression or after compression. The number of exceptions is not explicitly coded in the header, so the size to encode the exceptions is not known beforehand. That could be changed, but than the header will get bigger.

OK so it is self-punctuating, so, we wouldn't need to encode block size in bytes into the file.

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

The number of bits is not really informative, it would be better to have a distribution of the result of getNumFrameBits(). Then we can see for which nrs of bits loop unrolling might be tried.

As for decompression speed, please remember that the patching code (that decodes the exceptions into the output) has not yet been optimized at all.

The lucene .frq file contains the docids as deltas and the frequencies but with a special encoding in case the frequency is one. I'd rather try compressing the real delta docids and the real frequencies than the encoded versions.

The .prx file should be useable as it is, and from the results reported in the articles I would expect a real performance advantage for PFor for that. Michael, could you post a distribution of the number of frame bits for the .prx file? I'd like to know a reasonable maximum for unrolling the corresponding loops.

>: maybe I'm doing something wrong I don't think so, the code is still young. Try running TestPFor and have a look at the output near the end for the case of 6 frame bits. That should show the unrolled decoding speed for the case without exceptions. Then compare that to the cases with lower and higher nrs of frame bits.

>: Stepping back a bit: I wonder what %tg of the time in a "typical" Lucene search is spent decoding vInts? That would bound how much improvement we could ever expect from this optimization during searching.

A: there is also the influence of the reduction of data to be fetched (via various caches) from the index. As reported in the articles, PFor strikes a nice optimum between decompression speed and fetching speed from (fast) disks.

>: I was thinking local CPU's native asm. A: I'd try a C version first. Iirc gcc has a decent optimizer for bit ops, but it's been a while for me that I used C.

For the record, to show the variation in decompression speeds for different numbers of frame bits without exceptions, here is my current output from TestPFor:

test901PerfFor1Decompress starting
Compressed 128 bytes into 8, ratio 0.0625
test901PerfFor1Decompress 0 numFrameBits 1 decompressed 104857600 in 1021 msecs, 102700 ints/msec, (25 iters).
test901PerfFor1Decompress 1 numFrameBits 1 decompressed 171966464 in 1020 msecs, 168594 ints/msec, (41 iters).
test901PerfFor1Decompress 2 numFrameBits 1 decompressed 171966464 in 1012 msecs, 169927 ints/msec, (41 iters).

test902PerfFor2Decompress starting
Compressed 128 bytes into 12, ratio 0.09375
test902PerfFor2Decompress 0 numFrameBits 2 decompressed 130023424 in 1017 msecs, 127849 ints/msec, (31 iters).
test902PerfFor2Decompress 1 numFrameBits 2 decompressed 155189248 in 1000 msecs, 155189 ints/msec, (37 iters).
test902PerfFor2Decompress 2 numFrameBits 2 decompressed 159383552 in 1022 msecs, 155952 ints/msec, (38 iters).

test903PerfFor3Decompress starting
Compressed 128 bytes into 16, ratio 0.125
test903PerfFor3Decompress 0 numFrameBits 3 decompressed 188743680 in 1016 msecs, 185771 ints/msec, (45 iters).
test903PerfFor3Decompress 1 numFrameBits 3 decompressed 205520896 in 1018 msecs, 201886 ints/msec, (49 iters).
test903PerfFor3Decompress 2 numFrameBits 3 decompressed 201326592 in 1026 msecs, 196224 ints/msec, (48 iters).

test904PerfFor4Decompress starting
Compressed 128 bytes into 20, ratio 0.15625
test904PerfFor4Decompress 0 numFrameBits 4 decompressed 146800640 in 1014 msecs, 144773 ints/msec, (35 iters).
test904PerfFor4Decompress 1 numFrameBits 4 decompressed 159383552 in 1007 msecs, 158275 ints/msec, (38 iters).
test904PerfFor4Decompress 2 numFrameBits 4 decompressed 159383552 in 1011 msecs, 157649 ints/msec, (38 iters).

test905PerfFor5Decompress starting
Compressed 128 bytes into 24, ratio 0.1875
test905PerfFor5Decompress 0 numFrameBits 5 decompressed 159383552 in 1010 msecs, 157805 ints/msec, (38 iters).
test905PerfFor5Decompress 1 numFrameBits 5 decompressed 176160768 in 1009 msecs, 174589 ints/msec, (42 iters).
test905PerfFor5Decompress 2 numFrameBits 5 decompressed 176160768 in 1004 msecs, 175458 ints/msec, (42 iters).

test906PerfFor6Decompress starting
Compressed 128 bytes into 28, ratio 0.21875
test906PerfFor6Decompress 0 numFrameBits 6 decompressed 117440512 in 1001 msecs, 117323 ints/msec, (28 iters).
test906PerfFor6Decompress 1 numFrameBits 6 decompressed 125829120 in 1002 msecs, 125577 ints/msec, (30 iters).
test906PerfFor6Decompress 2 numFrameBits 6 decompressed 125829120 in 1004 msecs, 125327 ints/msec, (30 iters).

test907PerfFor7Decompress starting
Compressed 128 bytes into 32, ratio 0.25
test907PerfFor7Decompress 0 numFrameBits 7 decompressed 125829120 in 1016 msecs, 123847 ints/msec, (30 iters).
test907PerfFor7Decompress 1 numFrameBits 7 decompressed 150994944 in 1015 msecs, 148763 ints/msec, (36 iters).
test907PerfFor7Decompress 2 numFrameBits 7 decompressed 150994944 in 1012 msecs, 149204 ints/msec, (36 iters).

test908PerfFor8Decompress starting
Compressed 124 bytes into 35, ratio 0.28225806
test908PerfFor8Decompress 0 numFrameBits 8 decompressed 85327872 in 1015 msecs, 84066 ints/msec, (21 iters).
test908PerfFor8Decompress 1 numFrameBits 8 decompressed 121896960 in 1021 msecs, 119389 ints/msec, (30 iters).
test908PerfFor8Decompress 2 numFrameBits 8 decompressed 121896960 in 1022 msecs, 119272 ints/msec, (30 iters).

test909PerfFor9Decompress starting
Compressed 124 bytes into 39, ratio 0.31451613
test909PerfFor9Decompress 0 numFrameBits 9 decompressed 52822016 in 1016 msecs, 51990 ints/msec, (13 iters).
test909PerfFor9Decompress 1 numFrameBits 9 decompressed 56885248 in 1028 msecs, 55335 ints/msec, (14 iters).
test909PerfFor9Decompress 2 numFrameBits 9 decompressed 56885248 in 1029 msecs, 55282 ints/msec, (14 iters).

The loop for 9 bits is not unrolled, unrolling makes it a tiny little bit (55->54 Mint/sec) slower.

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

I wrote: The number of bits is not really informative, it would be better to have a distribution of the result of getNumFrameBits(). Then we can see for which nrs of bits loop unrolling might be tried.

But I misread what was meant by the number of bits, actually the number of frame bits that I requested is already listed there, so I'll have a go at unrolling for larger numbers of frame bits, especially for the .prx data.

asfimport commented 16 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

As for decompression speed, please remember that the patching code (that decodes the exceptions into the output) has not yet been optimized at all.

But this is what I find so weird: for prx, if I fix the encoding at 6 bits, which generates a zillion exceptions, we are \~31% faster than decoding vInts, and the pfor file size is much bigger (847 MB vs 621 MB) than vInt. But if instead I use the bit size as returned by getNumFrameBits(), which has far fewer exceptions, we are 9.0% slower and file size is a bit smaller than vInt. Exception processing seems to be very fast, or, it's the non-unrolled (ForDecompress.decodeAnyFrame) that's slow which could very well be.

The lucene .frq file contains the docids as deltas and the frequencies but with a special encoding in case the frequency is one. I'd rather try compressing the real delta docids and the real frequencies than the encoded versions.

I'll try that. I bet if we had two separate streams (one for the docIDs and one for the corresponding freqs) we'd get even better pFor compression. If we did that "for real" in Lucene that'd also make it fast to use a normally-indexed field for pure boolean searching (you wouldn't have to index 2 different fields as you do today to gain the performance at search time). In fact, for AND queries on a normal Lucene index this should also result in faster searching since when searching for the common docIDs you at first don't need the freq data.

Marvin has been doing some performance testing recently and seems to have concluded that keeping prx and frq as separate files (like Lucene does today but KinoSearch does not) gives better performance. Pushing that that same trend further, I think it may very well make sense to separate docID and frq as well.

A: there is also the influence of the reduction of data to be fetched (via various caches) from the index. As reported in the articles, PFor strikes a nice optimum between decompression speed and fetching speed from (fast) disks.

True, but we are seeing just a bit of compression vs Lucene's current encoding. I think splitting out frq from docID may show better compression.

>: I was thinking local CPU's native asm. A: I'd try a C version first. Iirc gcc has a decent optimizer for bit ops, but it's been a while for me that I used C.

Well... that would just make me depressed (if from Javaland the CPU level benefits of pFor don't really "survive", but from C-land they do) ;) But yes I agree.

For the record, to show the variation in decompression speeds for different numbers of frame bits without exceptions, here is my current output from TestPFor:

I have similar results (up to 7 bits – can you post your new TestPFor.java?).

The even-byte sizes (16, 24) have very sizable leads over the others. The 9-bit size is fantastically slow; it's insane that unrolling it isn't helping. Seems like we will need to peek at asm to understand what's going on at the "bare metal" level....

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

I'm working on the 10 right now, but I'm hitting another bug, it will take a while.

The performance of TestPFor2 should be better after running TestPFor for all bit sizes, this could be a good reason to combine them.

The correctness tests in TestPFor are not really exhaustive, so I'd like to have a simple test for correctness in TestPFor2, for example a running (long) sum of all decompressed ints that should equal before and after.

To avoid having to peek at the asm level, there is also the possibility to use a slightly higher number of frame bits when we know one number of bits will be slow to decompress.

asfimport commented 16 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached new TestPfor2.java:

If you run it on _x.frq, it will split that into _x.frq.frq and _x.frq.doc so we can separately test frq vs docs
Added checksum.
Added zeroing of reused IntBuffer before calling PFor.compress (seems to be necessary? Else I trip an assert inside PFor).

Paul, indeed I get a different checksum for vint vs pfor decoding.

I think the bug is somewhere in pfor, I'm guessing in the exception logic, because the difference I see is suddenly pfor returns 0 when it should have returned a large int relative to the other ints nearby.

Maybe this is why exception processing looked so much faster :)

I'll hold off posting more perf results until we can resolve that.

To see the checksum run it with asserts, eg like this:

java -ea oal.util.pfor.TestPFor2 /path/to/index _x.prx

It then prints out SUM lines after each iteration.

If you set DEBUG = true, it'll print the first 1000 values and then search for "v=0".

asfimport commented 16 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

(Attached autogen.tgz).

I started peeking at the ASM generated by different tweaks to the bit decoder methods.

It's rather depressing: small things, like using relative not absolute getInt from ByteBuffer, declaring things final, reading into array first instead of getInt per int, make sizable differences in the resulting asm and decode speed.

To try out these various changes to the Java sources, I created a Python script to generate the 31 different decode methods. I then made a new standalone perf test that reads in a prx file as a series of packed N-bit ints, and reports the best of 5 runs.

These results are NOT testing the full pfor – just different possible methods for doing the N-bit int unpacking part of pfor. Paul, I haven't integrated these into the pfor code, but I think we probably eventually should.

Finally, I switched the autogen to C++/gcc to see how much faster simple C code is.

In both the java and C tests, by far the fastest way was to mmap the file and read ints from it, sequentially (relative gets) using ByteBuffer, so all 3 take that approach.

(To run these, extract autogen.tgz, then open *.py and see the comment at the top; you'll have to edit the sources to fix the hardwired path to the prx file).

Java1 code calls getInt() one at a time, like this:

  static void decode4(final ByteBuffer in, final int[] out) {
    final int i1 = in.getInt();
    out[0] = i1 & 15;
    out[1] = (i1 >>> 4) & 15;
    out[2] = (i1 >>> 8) & 15;
    out[3] = (i1 >>> 12) & 15;
    out[4] = (i1 >>> 16) & 15;
    out[5] = (i1 >>> 20) & 15;
    out[6] = (i1 >>> 24) & 15;
    out[7] = (i1 >>> 28) & 15;
    final int i2 = in.getInt();
    out[8] = i2 & 15;
    out[9] = (i2 >>> 4) & 15;
    out[10] = (i2 >>> 8) & 15;
    ...
  }

Java 2 code gets all N ints up front, like this:

  static void decode4(IntBuffer in, int[] inbuf, int[] out) {
    in.get(inbuf, 0, 16);
    out[0] = inbuf[0] & 15;
    out[1] = (inbuf[0] >>> 4) & 15;
    out[2] = (inbuf[0] >>> 8) & 15;
    out[3] = (inbuf[0] >>> 12) & 15;
    out[4] = (inbuf[0] >>> 16) & 15;
    out[5] = (inbuf[0] >>> 20) & 15;
    out[6] = (inbuf[0] >>> 24) & 15;
    out[7] = (inbuf[0] >>> 28) & 15;
    out[8] = inbuf[1] & 15;
    out[9] = (inbuf[1] >>> 4) & 15;
    out[10] = (inbuf[1] >>> 8) & 15;
    ...
  }

C++ code is analogous to Java 1 (data is mmap'd):

static bool decode4(int* out) {
  int i1 = *(data++);
  *(out++) = i1 & 15;
  *(out++) = (i1 >> 4) & 15;
  *(out++) = (i1 >> 8) & 15;
  *(out++) = (i1 >> 12) & 15;
  *(out++) = (i1 >> 16) & 15;
  *(out++) = (i1 >> 20) & 15;
  *(out++) = (i1 >> 24) & 15;
  *(out++) = (i1 >> 28) & 15;
  int i2 = *(data++);
  *(out++) = i2 & 15;
  ...
}

Here's the performance for each bit size:

bits	Java 1 (kints/msec)	Java 2 (kints/msec)	C++ (kints/msec)	C advantage
1	916.6	648.5	1445.3	1.6x
2	793.8	593.4	1118.3	1.4x
3	616.7	541.9	1068.8	1.7x
4	656.6	512.1	1161.6	1.8x
5	499.8	469.0	897.3	1.8x
6	410.6	444.9	899.5	2.0x
7	367.4	409.0	801.7	2.0x
8	414.0	386.7	816.6	2.0x
9	306.3	366.9	710.8	1.9x
10	278.8	341.9	665.8	1.9x
11	258.1	307.8	623.6	2.0x
12	245.9	311.7	592.7	1.9x
13	223.9	285.0	574.5	2.0x
14	204.2	271.8	538.1	2.0x
15	190.3	260.0	522.6	2.0x
16	229.9	250.4	519.7	2.1x
17	190.8	223.7	488.3	2.2x
18	171.6	198.1	461.2	2.3x
19	158.3	180.5	436.2	2.4x
20	163.1	207.4	416.4	2.0x
21	156.3	166.3	403.0	2.4x
22	147.0	163.5	387.8	2.4x
23	141.6	174.1	357.5	2.1x
24	141.9	162.0	362.6	2.2x
25	133.2	177.6	335.3	1.9x
26	125.8	153.5	334.7	2.2x
27	121.6	139.6	314.0	2.2x
28	122.7	130.1	316.7	2.4x
29	107.3	123.9	296.7	2.4x
30	111.0	127.6	300.7	2.4x
31	108.1	94.0	290.5	2.7x

C code is between 1.4-2.7 X faster than the best Java run. Reading one int at a time is faster when the #bits is small <=5), but then reading all N ints up front is general faster for larger bit sizes.

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

I can't see the .tgz attachment at the moment.

Very nice table, this java/c++ comparison. Meanwhile I also have a java program generator for decompression, and I got very similar results.

I found a bug for the 17 bits case in decodeAnyFrame(), this might explain the sum test failure that you saw.

One detail on the generated code: the final mask can be omitted, for example the last bit extraction for the 5 bit case:

 (i4 >>> 27)

On the java side leaving this last mask out did not make a difference, but it might be noticeable in c++.

asfimport commented 16 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Woops I forgot the attachment...

asfimport commented 16 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

One detail on the generated code: the final mask can be omitted, for example the last bit extraction for the 5 bit case:

Ahh good catch. I'll fix in my autogen.

Meanwhile I also have a java program generator for decompression, and I got very similar results.

Excellent! Did you also switch to relative getInts? This way, I could change my TestPFor2 to rely on self-punctuation when reading. And, I can pass down a ByteBuffer derived directly from the file, instead of copying bytes into byte[] first. This would make the test more fair to pfor.

I found a bug for the 17 bits case in decodeAnyFrame(), this might explain the sum test failure that you saw.

OK I can re-test when you post this, to see if the checksums match.

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

Did you also switch to relative getInts?

No. I thought about that, but I wanted to make it work correctly first.

This way, I could change my TestPFor2 to rely on self-punctuation when reading. And, I can pass down a ByteBuffer derived directly from the file, instead of copying bytes into byte[] first.

The only concern I have there is that when using a byte[] directly from the file getting the int values may result in non 4 byte aligned fetches. Can current hardware do this well?

It's tempting to move to a full C implementation directly now. Should we do that? A similar move was made in the past by letting gcc deal with vInts, but meanwhile the jvms caught up there.

asfimport commented 16 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The only concern I have there is that when using a byte[] directly from the file getting the int values may result in non 4 byte aligned fetches. Can current hardware do this well?

Good question, though if that's the case we could presumably work around it by ensuring the header of the file is 0 mod 4?

Or... is this because a full block (header + bits + exception data) may not be 0 mod 4? (Though we too could pad if necessary)

I think the API we want to reach (eventually) is an IntBlockInput in Directory where you call read(int[]) and it returns the next 128 (say) ints in an array and moves itself to the end of that block (start of the next block).

It's tempting to move to a full C implementation directly now. Should we do that? A similar move was made in the past by letting gcc deal with vInts, but meanwhile the jvms caught up there.

Maybe we should explore this eventually, but I think for now we should first try to get the Java version online?

I'd really love to get a prototype integration working so that we can then do an end-to-end performance test (ie, actual searches). I'm still wondering how much speedups at this layer will actually affect overall search time.

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

Or... is this because a full block (header + bits + exception data) may not be 0 mod 4? (Though we too could pad if necessary)

The exceptions can be 1, 2 or 4 bytes and there is one more coding possible, we might reserve that for long. I'd like to use some padding space to get the exceptions aligned to their natural border, 2 bytes exceptions at even byte position, and 4 bytes exceptions at (... mod 4 == 0) natural int border. That way the rest of the padding space (if any) could be used to align to natural int border.

I'd really love to get a prototype integration working so that we can then do an end-to-end performance test (ie, actual searches). I'm still wondering how much speedups at this layer will actually affect overall search time.

The decoding speeds reported here so far are about the same as the ones reported by Zhang in 2008. That means there is room for disk speeds to catch up with CPU speed. In other words, adding this for proximities (and maybe docids and freqs) should make lucene a natural fit for ssd's, see also ZhangfFig. 15.

asfimport commented 16 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

That means there is quite a bit of room for disk speeds to catch up with CPU speed.

Exactly!

Search optimization is tricky because for an index that can't fit entirely in the OS's IO cache, reducing the CPU cost of searching (which we are diong, here) is basically usless: the end user won't see much benefit.

For an index entirely in the IO cache, I think these optimizations might make a big difference.

In some sense, we have been allowed to hide behind the slow performance of magnetic hard drives and not worry much about reducing the CPU cost of searching.

However: relatively soon most computers will use SSDs, and then suddenly it's as if every index is in the IO cache (albeit a somewhat slower one, but still far faster than magnetic media). So now is the time for us to reduce the cpu cost of searching for Lucene.

And this means for this issue and other sources of optimizing search performance, we should largely focus only on indices entirely cached in the IO cache.

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

The 1410c patch is much like the 1410b. It adds some bug fixes and code cleanups and it includes a generator for java decompression classes. After applying the patch, run the generator once in src/java/org/apache/lucene/util/pfor :

python gendecompress.py

After that, in the trunk directory:

ant -Dtestcase=TestPFor test-core

should finish successfully in just over a minute, as it also includes decompression performance tests.

To be done:

possibly split the performance tests from the functional tests, this might require adding an ant target for performance tests,
possibly add byte padding to move exceptions to natural position,
add performance tests for decompression with exception patching,
switch decompression to relative buffer addressing (get() instead of get(index)),
optimize the exception decoding (i.e. patching) code,
optimize the compression code,
generate/correct the javadocs.

asfimport commented 16 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Paul, in decompress I added "inputSize = -1" at the top, so that the header is re-read. I need this so I can re-use a single PFor instance during decompress.

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

Did you also move to relative addressing in the buffer?

Another question: I suppose the place to add this initially would be in IndexOutput and IndexInput? In that case it would make sense to reserve (some bits of) the first byte in the compressed buffer for the compression method, and use these bits there to call PFor or another (de)compression method.

asfimport commented 16 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Another thing that bit me was the bufferByteSize(): if this returns something that's not 0 mod 4, you must increase it to the next multiple of 4 otherwise you will lose data since ByteBuffer is big endian by default. We should test little endian to see if performance changes (on different CPUs).

Did you also move to relative addressing in the buffer?

No I haven't done that, but I think we should. I believe it's faster. I'm trying now to get a rudimentary test working for TermQuery using pfor.

Another question: I suppose the place to add this initially would be in IndexOutput and IndexInput? In that case it would make sense to reserve (some bits of) the first byte in the compressed buffer for the compression method, and use these bits there to call PFor or another (de)compression method.

This gets into flexible indexing...

Ideally we do this in a pluggable way, so that PFor is just one such plugin, simple vInts is another, etc.

I could see a compression layer living "above" IndexInput/Output, since logically how you encode an int block into bytes is independent from the means of storage.

But: such an abstraction may hurt performance too much since during read it would entail an extra buffer copy. So maybe we should just add methods to IndexInput/Output, or, make a new IntBlockInput/Output.

Also, some things you now store in the header of each block should presumably move to the start of the file instead (eg the compression method), or if we move to a separate "schema" file that can record which compressor was used per file, we'd put this there.

So I'm not yet exactly sure how we should tie this in "for real"...

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

I'm trying now to get a rudimentary test working for TermQuery using pfor.

It should really make a difference for stop words and disjunction queries depending on DocIdSetIterator.next().

Conjunctions that depend on skipTo(docNum) will probably make it necessary to impose an upperbound the size of the compressed arrays. This upperbound could be the same as the skip distance in the index.

I'm wondering whether it would make sense to add skip info to the term positions of very large documents. Any ideas on that?

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

The strange behaviour at 9 bits I reported earlier was due to a mistake in the test data. It contained only 31 integers (124 bytes, see posted listing), so the unrolled loop would never be called.

And another point to be done:

change the decompression code generator to use an array get() from the buffer for 6 or more frame bits (see the Java 1/2 performance number numbers listed above.)

asfimport commented 16 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

It should really make a difference for stop words and disjunction queries depending on DocIdSetIterator.next().

Yes.

Conjunctions that depend on skipTo(docNum) will probably make it necessary to impose an upperbound the size of the compressed arrays.

Yes. Though, I think when optimizing search performance we should focus entirely on the high-latency queries. TermQuery on very frequent terms, disjunctions queries involving common terms, phrase/span queries that have many matches, etc.

EG if PFOR speeds up high-latency queries say by 20% (say 10 sec -> 8 sec), but causes queries that are already fast (say 30 msec) to get a bit slower (say 40 msec) I think that's fine. It's the high-latency queries that kill us because those ones limit how large a collection you can put on one box before you're forced to shard your index.

At some point we should make use of concurrency when iterating over large result sets. EG if estimated # total hits is > X docs, use multiple threads where each threads skips to it's own "chunk" and iterates over it, and then merge the results. Then we should be able to cut down on the max latency query and handle more documents on a single machine. Computers are very quickly become very concurrent.

I'm wondering whether it would make sense to add skip info to the term positions of very large documents. Any ideas on that?

Probably we should – yet another issue :)

asfimport commented 16 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

In order to understand how time is spent overall during searching, I took TermQuery and reimplemented it in several different ways.

While each implementatation is correct (checksum of final top N docs matches) they are very much prototypes and nowhere near committable.

I then took the 100 most frequent terms (total 23.3 million hits) from my Wikipedia index and ran them each in sequence. Each result is best of 25 runs (all results are from the OS's IO cache):

Test	Time (msec)	Hits/msec	Speedup
Baseline	674	34496	1.00X
+ Code Speedups	591	39340	1.14X
+ Code Speedups + PFOR	295	78814	2.28X
+ Code Speedups + BITS	247	94130	2.73X
+ Code Speedups + BITS (native)	230	101088	2.93X

Here's what the test names mean:

Baseline is the normal TermQuery, searching with TopDocsCollector for top 10 docs.
Code Speedups means some basic optimizations, eg made my own specialized priority queue, unrolled loops, etc.
PFOR means switching to PFOR for storing docs & freqs as separate streams. Each term's posting starts a new PFOR block.
BITS just means using packed n-bit ints for each block (ie, it has no exceptions, so it sets N so that all ints will fit). The resulting frq file was 18% bigger and doc file was 10% bigger – but this is just for the 100 most frequent terms.
BITS (native) is BITS but running as JNI (C++) code.

Next, I tried running the same things above, but I turned off collection of hits. So this really just tests decode time:

Test	Time (msec)	Hits/msec	Speedup
+ Code Speedups	384	60547	1.76X
+ Code Speedups + PFOR	91	255497	7.41X
+ Code Speedups + BITS	49	474496	13.76X
+ Code Speedups + BITS (native)	32	726572	21.06X

Some observations:

PFOR really does speed up TermQuery overall, so, I think we should pursue it and get it committed.
BITS is a good speedup beyond PFOR, but we haven't optimized PFOR yet. Also, BITS would be very seekable. We could also use PFOR but increase the bit size of blocks, to get the same thing.
Once we swap in PFOR and/or BITS or other interesting int block compression, accumulating hits becomes the slowest part of TermQuery.
Native BITS decoding code is quite a bit faster for decoding, but, probably not worth pursuing for now since with PFOR decode time becomes a small part of the overall time.
TermQuery is the simplest query; other queries will spend more CPU time coordinating, or time handling positions, so we can't yet conclude how much of an impact PFOR will have on them.

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

So there's roughly a factor 2 between PFOR and BITS (which is actually just FOR i.e. PFOR without patches/exceptions). Meanwhile I've started working on speeding up the patching, combined with always using a multiple of 4 bytes. This also involves changing the layout for the exceptions slightly, but it should allow faster patching and avoid the byte ordering problems mentioned before. It will take another while to finish.

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

1410d patch: as 1410c, with the following further changes:

moved exceptions to their natural borders,
unrolled exception patching code,
some more testcases for exceptions,
a specialized version for "decompressing" with 32 frame bits,
all other frame decompression is left to generated code (not contained in the patch),
compressed and decompressed sizes are now reported in numbers of integers,
also contains an adapted version of TestPFor2.

This should diminish the difference in performance between PFOR and BITS.

asfimport commented 16 years ago

Paul Elschot (migrated from JIRA)

I've been at this quite irregularly. I'm trying to give the PFor class a more OO interface and to get exception patching working at more decent speeds. In case someone else wants to move this forward faster than it is moving now, please holler.

After rereading this, and also after reading up a bit on MonetDb performance improvement techniques, I have few more rants:

Taking another look at the decompression performance figures, and especially the differences between native C++ and java, it could become worthwhile to also implement TermQuery in native code.

With the high decompression speeds of FOR/BITS at lower numbers of frame bits it might also become worthwhile to compress character data, for example numbers with a low number of different characters. Adding a dictionary as in PDICT might help compression even further. This was probably one of the reasons for the column storage discussed earlier, I'm now sorry I ignored that discussion. In the index itself, column storage is also useful. One example is the splitting of document numbers and frequency into separate streams, another example is various offsets for seeking in the index.

I think it would be worthwhile to add a compressed integer array to the basic types used in IndexInput and IndexOutput. I'm still strugling with the addition of skip info into a tree of such compressed integer arrays (skip offsets don't seem to fit naturally into a column, and I don't know whether the skip size should be the same as the decompressed array size). Placement of such compressed arrays in the index should also be aware of CPU cache lines and of VM page (disk block) boundaries. In higher levels of a tree of such compressed arrays, frame exceptions would be best avoided to allow direct addressing, but the leafs could use frame exceptions for better compression.

For terms that will occur at most once in one document more compression is possible, so it might be worthwhile to add these as a key. At the moment I have no idea how to enforce the restriction of at most once though.

asfimport commented 15 years ago

Paul Elschot (migrated from JIRA)

I'd like to split off the performance test cases from the functional test cases for this, but there does not seem to be a nice way to do that using the current test methods via ant.

I'm thinking of using a special class name prefix like PerfTest... (or maybe something shorter like Perf...) and adding a test target for that in the build scripts.

Are there other places where splitting performance tests and functional tests could be useful? When so, what would be a good way to implement it?

asfimport commented 15 years ago

Paul Elschot (migrated from JIRA)

The 1410e patch splits off a FrameOfRef class as the superclass from the PFor class. Test cases are split similarly, all added test cases pass. This should make it easier to chose between the patching version PFor and the non patching version FrameOfRef.

There lots of small changes compared to the 1410d patch, mostly in the o.a.l.util.pfor package, and a few in the util package. Some javadocs are added, but javadoc generation is not yet tested.

After applying the patch, run gendecompress.py as for the 1410b patch to generate the java sources for decompression. Then: ant -Dtestcase=TestFrameOfRef test-core and ant -Dtestcase=TestPFor test-core should pass.

The following (at least) is still to be done:

peformance tests for patching decompression,
relative compressed data buffer addressing,
generate code for compression in the same way as for decompression,
a few fixme's and checkme's in the code,
javadoc generation.

asfimport commented 15 years ago

Eks Dev (migrated from JIRA)

It looks like Google went there as well (Block encoding),

see: Blog http://blogs.sun.com/searchguy/entry/google_s_postings_format http://research.google.com/people/jeff/WSDM09-keynote.pdf (Slides 47-63)

asfimport commented 15 years ago

Paul Elschot (migrated from JIRA)

The encoding in the google research slides is another one. They use 2 bits prefixing the first byte and indicating the number of bytes used for the encoded number (1-4), and then they group 4 of those prefixes together to get a single byte of 4 prefixes followed by the non prefixed bytes of the 4 encoded numbers. This requires a 256 way switch (indexed jump) for every 4 encoded numbers, and I would expect that jump to limit performance somewhat when compared to pfor that has a 32 way switch for 32/64/128 encoded numbers. But since the prefixes only indicate the numbers of bytes used for the encoded numbers, no shifts and masks are needed, only byte moves. So it could well be wortwhile to give this encoding a try, too, especially for lists of numbers shorter than 16 or 32.

asfimport commented 15 years ago

Paul Elschot (migrated from JIRA)

A very recent paper with some improvements to PFOR: Yan, Ding, Suel, Inverted Index Compression and Query Processing with Optimized Document Ordering, WWW 2009, April 20-24 2009, Madrid, Spain

Roughly quoting par. 4.2, Optimizing PForDelta compression: For an exception, we store its lower b bits instead of the offset to the next exception in its corresponding slot, while we store the higher overflow bits and the offset in two separate arrays. These two arrays are compressed using the Simple16 method. Also b is chosen to optimize decompression speed. This makes the dependence of b on the data quite simple, (in the PFOR above here this dependence is more complex) and this improves compression speed.

Btw. the document ordering there is by URL. For many terms this gives more shorter delta's between doc ids allowing a higher decompression speed of the doc ids.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Attaching sep, intblock and pfordelta codecs, spun out of the last patch on #2532.

Once #2532 is in, we should finish the pfordelta codec to make it a real choice.

I actually think some combination of pulsing, standard, pfordelta and simple bit packing (in order by increasing term's docFreq), within a single codec, may be best.

Ie, rare terms (only in a doc or two) could be inlined into the the terms dict. Slightly more common terms can use the more CPU intensive standard codec. Common terms can use cpu-friendly-yet-still-decent-compression pfordelta. Obsenely common terms can use bit packing for the fastest decode.

asfimport commented 15 years ago

Eks Dev (migrated from JIRA)

Mike, That is definitely the way to go, distribution dependent encoding, where every Term gets individual treatment.

Take for an example simple, but not all that rare case where Index gets sorted on some of the indexed fields (we use it really extensively, e.g. presorted doc collection on user_rights/zip/city, all indexed). There you get perfectly "compressible" postings by simply managing intervals of set bits. Updates distort this picture, but we rebuild index periodically and all gets good again. At the moment we load them into RAM as Filters in IntervalSets. if that would be possible in lucene, we wouldn't bother with Filters (VInt decoding on such super dense fields was killing us, even in RAMDirectory) ...

Thinking about your comments, isn't pulsing somewhat orthogonal to packing method? For example, if you load index into RAMDirecectory, one could avoid one indirection level and inline all postings.

Flex Indexing rocks, that is going to be the most important addition to lucene since it started (imo)... I would even bet on double search speed in first attempt for average queries :)

Cheers, eks

asfimport commented 14 years ago

Paul Elschot (migrated from JIRA)

To encode the exceptions as suggested by Yan,Ding&Suel, Simple9 or Simple16 can be used. A Simple9 implementation is at #3265.

asfimport commented 14 years ago

Renaud Delbru (migrated from JIRA)

Hi,

I have performed some benchmarks using the PFOR index I/O interface in order to check if the index reader and block reader were not adding too much overhead (I was afraid that the block reading interface was adding too much overhead, and, as a consequence, loosing the decompression speed benefits of block based compression.

In this benchmark, I have jsut tested FOR (and not PFOR) using various block size. The benchmark is setup as follow:

I generate an integer array of size 33554432, containing uniformly distributed integer (0 <= x < 65535)
I compress the data using PFORDeltaIndexOutput (in fact, aa similar class that I modified for this benchmark in order to support various blocksize)
I measure the time to decompress the full list of integers using PFORDeltaIndexInput#next()

I have performed a similar test using a basic IndexInput/Output with VInt encoding. The performance was 39Kints/msec.

For FrameOfRef, the best block size seems to be 4096 (256 - 270 kints / msec), larger block size do not provide significant performance improvement,

We can observe that FOR provides \~7 times read performance increase. To conclude, it looks like the reader and block reader interface do not add to much overhead ;o).

P.S.: It is possible that these results are not totally correct. I'll try to double check the code, and upload it here.

Results:

BlockSize = 32
FrameOfRef 0 decompressed 33554432 in 368 msecs, 91 kints/msec, (1 iters).
FrameOfRef 1 decompressed 33554432 in 314 msecs, 106 kints/msec, (1 iters).
FrameOfRef 2 decompressed 33554432 in 294 msecs, 114 kints/msec, (1 iters).
BlockSize = 64
FrameOfRef 0 decompressed 33554432 in 242 msecs, 138 kints/msec, (1 iters).
FrameOfRef 1 decompressed 33554432 in 239 msecs, 140 kints/msec, (1 iters).
FrameOfRef 2 decompressed 33554432 in 237 msecs, 141 kints/msec, (1 iters).
BlockSize = 128
FrameOfRef 0 decompressed 33554432 in 223 msecs, 150 kints/msec, (1 iters).
FrameOfRef 1 decompressed 33554432 in 228 msecs, 147 kints/msec, (1 iters).
FrameOfRef 2 decompressed 33554432 in 224 msecs, 149 kints/msec, (1 iters).
BlockSize = 256
FrameOfRef 0 decompressed 33554432 in 219 msecs, 153 kints/msec, (1 iters).
FrameOfRef 1 decompressed 33554432 in 218 msecs, 153 kints/msec, (1 iters).
FrameOfRef 2 decompressed 33554432 in 219 msecs, 153 kints/msec, (1 iters).
BlockSize = 512
FrameOfRef 0 decompressed 33554432 in 170 msecs, 197 kints/msec, (1 iters).
FrameOfRef 1 decompressed 33554432 in 176 msecs, 190 kints/msec, (1 iters).
FrameOfRef 2 decompressed 33554432 in 173 msecs, 193 kints/msec, (1 iters).
BlockSize = 1024
FrameOfRef 0 decompressed 33554432 in 136 msecs, 246 kints/msec, (1 iters).
FrameOfRef 1 decompressed 33554432 in 139 msecs, 241 kints/msec, (1 iters).
FrameOfRef 2 decompressed 33554432 in 147 msecs, 228 kints/msec, (1 iters).
BlockSize = 2048
FrameOfRef 0 decompressed 33554432 in 133 msecs, 252 kints/msec, (1 iters).
FrameOfRef 1 decompressed 33554432 in 135 msecs, 248 kints/msec, (1 iters).
FrameOfRef 2 decompressed 33554432 in 139 msecs, 241 kints/msec, (1 iters).
BlockSize = 4096
FrameOfRef 0 decompressed 33554432 in 124 msecs, 270 kints/msec, (1 iters).
FrameOfRef 1 decompressed 33554432 in 131 msecs, 256 kints/msec, (1 iters).
FrameOfRef 2 decompressed 33554432 in 131 msecs, 256 kints/msec, (1 iters).
BlockSize = 8192
FrameOfRef 0 decompressed 33554432 in 126 msecs, 266 kints/msec, (1 iters).
FrameOfRef 1 decompressed 33554432 in 128 msecs, 262 kints/msec, (1 iters).
FrameOfRef 2 decompressed 33554432 in 127 msecs, 264 kints/msec, (1 iters).
BlockSize = 16384
FrameOfRef 0 decompressed 33554432 in 127 msecs, 264 kints/msec, (1 iters).
FrameOfRef 1 decompressed 33554432 in 125 msecs, 268 kints/msec, (1 iters).
FrameOfRef 2 decompressed 33554432 in 129 msecs, 260 kints/msec, (1 iters).
BlockSize = 32768
FrameOfRef 0 decompressed 33554432 in 123 msecs, 272 kints/msec, (1 iters).
FrameOfRef 1 decompressed 33554432 in 132 msecs, 254 kints/msec, (1 iters).
FrameOfRef 2 decompressed 33554432 in 135 msecs, 248 kints/msec, (1 iters).

EDIT: Here is new results comparing FOR and a block-based VInt using various block size. The decompression loop is repeated to reach > 300ms (for JIT effect, see post below). The loop recreates a new block reader each time (which causes some overhead, as you can see with the FOR performance results, compared to the one below).

32	64	128	256	512	1024	2048	4096	8192	16384	32768
VInt (kints/msec)	28	30	30	31	48	65	84	94	100	104	104
FOR (kints/msec)	104	126	131	132	164	195	202	214	217	220	223

asfimport commented 14 years ago

Paul Elschot (migrated from JIRA)

Did you see any effect of the JIT? This is possible by making a loop that runs for about 1 second, and use that loop 3 times, see the output posting of 3 Oct 2008. When the 2nd and 3rd time are just about equal one can assume that the JIT is done. Perhaps a shorter time than 1 second can be used nowadays.

For the blocks that you've been measuring one might also see an effect of L1/L2 data cache size. This effect is probably not present in my previous postings here because I have not yet used this on larger blocks.

asfimport commented 14 years ago

Renaud Delbru (migrated from JIRA)

Did you see any effect of the JIT? This is possible by making a loop that runs for about 1 second, and use that loop 3 times, see the output posting of 3 Oct 2008. When the 2nd and 3rd time are just about equal one can assume that the JIT is done.

I haven't seen the effect of the JIT on FOR. I have re-performed the benchmark, repeating the decompression until the test time is superior to 300 ms or 1s (as it is done in TestFrameOfRef#doDecompPerfTestUsing() method), but I haven't observed a difference with the previous benchmark. In fact, it seems that it happens only for block size of 32, in the other cases, it seems imperceptible (in the previous benchmark, the number looks unstable, in the new benchmark however it looks more stable, but no significant increase between the first loop and the other ones).

In the first benchmark, I was not repeating the loop until 1 second is reached since there is no easy way to "reset" the reader. In the new benchmark, I am recreating the reader in the loop (which causes some overhead). Do you think it can have a consequence on the JIT ?

For the blocks that you've been measuring one might also see an effect of L1/L2 data cache size.

Yes, it should be an effect of the cache size. It was to check the increase of performance and find the optimal block size. This optimal block size may be dependent on my hardware (Intel(R) Core(TM)2 Duo CPU T7300 @ 2.00GHz, cache size : 4096 KB).

asfimport commented 14 years ago

Paul Elschot (migrated from JIRA)

In the first benchmark, I was not repeating the loop until 1 second is reached since there is no easy way to "reset" the reader. In the new benchmark, I am recreating the reader in the loop (which causes some overhead). Do you think it can have a consequence on the JIT ?

I don't know how wheather JIT deals with the code for invidual objects (other than classes and their code). There could be an indirect effect via the garbage collector.

The uniform random range of about 65000 generates almost always 14-16 bits per number, so these results are roughly comparable to the 14-16 bit cases posted on 6 October 2008 by Michael. As it happens they are quite close.

apache / lucene

PFOR implementation [LUCENE-1410] #2484