MMapDirectory speedups [LUCENE-2816]

asfimport commented 13 years ago

MMapDirectory has some performance problems:

When the file is larger than Integer.MAX_VALUE, we use MultiMMapIndexInput, which does a lot of unnecessary bounds-checks for its buffer-switching etc. Instead, like MMapIndexInput, it should rely upon the contract of these operations in ByteBuffer (which will do a bounds check always and throw BufferUnderflowException). Our 'buffer' is so large (Integer.MAX_VALUE) that its rare this happens and doing our own bounds checks just slows things down.
the readInt()/readLong()/readShort() are slow and should just defer to ByteBuffer.readInt(), etc This isn't very important since we don't much use these, but I think there's no reason users (e.g. codec writers) should have to readBytes() + wrap as bytebuffer + get an IntBuffer view when readInt() can be almost as fast...

Migrated from LUCENE-2816 by Robert Muir (@rmuir), resolved Dec 26 2010 Attachments: LUCENE-2816.patch

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Here's the most important benchmark: speeding up the MultiMMap's readByte(s) in general:

MultiMMapIndexInput readByte(s) improvements [trunk, Standard codec]

Query	QPS trunk	QPS patch	Pct diff
spanFirst(unit, 5)	12.72	12.85	1.0%
+nebraska +state	137.47	139.33	1.3%
spanNear([unit, state], 10, true)	2.90	2.94	1.4%
"unit state"	5.88	5.99	1.8%
unit\~2.0	7.06	7.20	2.0%
+unit +state	8.68	8.87	2.2%
unit state	8.00	8.23	2.9%
unit\~1.0	7.19	7.41	3.0%
unit*	22.66	23.41	3.3%
uni*	12.54	13.12	4.6%
united\~1.0	10.61	11.12	4.8%
united\~2.0	2.52	2.65	5.1%
state	28.72	30.23	5.3%
un*d	44.84	48.06	7.2%
u*d	13.17	14.51	10.2%

In the bulk postings branch, I've been experimenting with various techniques for FOR/PFOR and one thing i tried was simply decoding with readInt() from the DataInput. So I adapted For/PFOR to just take DataInput and work on it directly, instead of reading into a byte[], wrapping it with a ByteBuffer, and working on an IntBuffer view.

But when I did this, i found that MMap was slow for readInt(), etc. So we implement these primitives with ByteBuffer.readInt(). This isn't very important since lucene doesn't much use these, and mostly theoretical but I still think things like readInt(), readShort(), readLong() should be fast... for example just earlier today someone posted an alternative PFOR implementation on #2484 that uses DataInput.readInt().

MMapIndexInput readInt() improvements [bulkpostings, FrameOfRefDataInput codec]

Query	QPS branch	QPS patch	Pct diff
spanFirst(unit, 5)	12.14	11.99	-1.2%
united\~1.0	11.32	11.33	0.1%
united\~2.0	2.51	2.56	2.1%
unit\~1.0	6.98	7.19	3.0%
unit\~2.0	6.88	7.11	3.3%
spanNear([unit, state], 10, true)	2.81	2.96	5.2%
unit state	8.04	8.59	6.8%
+unit +state	10.97	12.12	10.5%
unit*	26.67	29.80	11.7%
"unit state"	5.59	6.27	12.3%
uni*	15.10	17.51	15.9%
state	33.20	38.72	16.6%
+nebraska +state	59.17	71.45	20.8%
un*d	35.98	47.14	31.0%
u*d	9.48	12.46	31.4%

Here's the same benchmark of DataInput.readInt() but with the MultiMMapIndexInput

MultiMMapIndexInput readInt() improvements [bulkpostings, FrameOfRefDataInput codec]

Query	QPS branch	QPS patch	Pct diff
united\~2.0	2.43	2.54	4.3%
united\~1.0	10.78	11.39	5.7%
unit\~1.0	6.81	7.21	5.8%
unit\~2.0	6.62	7.05	6.5%
spanNear([unit, state], 10, true)	2.77	2.96	6.6%
unit state	7.85	8.53	8.7%
spanFirst(unit, 5)	10.50	11.71	11.5%
+unit +state	10.26	11.94	16.3%
"unit state"	5.39	6.31	17.0%
state	31.95	39.17	22.6%
unit*	24.39	31.02	27.2%
+nebraska +state	54.68	71.98	31.6%
u*d	9.53	12.62	32.5%
uni*	13.72	18.23	32.9%
un*d	35.87	48.19	34.3%

Just to be sure, I ran this last one on sparc64 (bigendian) also.

MultiMMapIndexInput readInt() improvements [bulkpostings, FrameOfRefDataInput codec]

Query	QPS branch	QPS patch	Pct diff
united\~2.0	2.23	2.26	1.5%
unit\~2.0	6.37	6.47	1.6%
united\~1.0	11.33	11.59	2.3%
unit\~1.0	9.68	10.05	3.7%
spanNear([unit, state], 10, true)	15.60	17.54	12.5%
unit*	127.14	144.08	13.3%
unit state	44.93	51.30	14.2%
spanFirst(unit, 5)	58.42	68.37	17.0%
uni*	56.66	67.53	19.2%
+nebraska +state	215.62	262.99	22.0%
+unit +state	63.18	77.86	23.2%
"unit state"	32.24	40.05	24.2%
u*d	29.13	36.69	26.0%
state	145.99	188.33	29.0%
un*d	65.27	87.20	33.6%

I think some of these benchmarks also show that MultiMMapIndexInput might now be essentially just as fast as MMapIndexInput... but lets not go there yet and keep them separate for now.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Awesome results robert!! :)

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Awesome, Ro bert is changing to the MMap Performance Policeman!

I like the idea to simply delegate the methods and catch exception to fallback to manual read with boundary transition! I just wanted to be sure that the position pointer in the buffer does not partly go forward when you read request fails at a buffer boundary, but that seems to be the case.

asfimport commented 13 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

One thing to add: When using readFloat & co, we should make sure that we set the endianness explicitely in the ctor. I just want to explicitely make sure that the endianness is correct and document it that it is big endian for Lucene.

We don't need that: "The initial order of a byte buffer is always BIG_ENDIAN."

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I just wanted to be sure that the position pointer in the buffer does not partly go forward when you read request fails at a buffer boundary, but that seems to be the case.

Yes, this is guaranteed in the APIs, and also tested well by TestMultiMMap, which uses random chunk sizes between 20 and 100 (including odd numbers etc) Though we should enhance this test, i think it just retrieves documents at the moment... probably better if it did some searches too.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Good grief! What amazing gains, especially w/ PFor codec which of course makes super heavy use of .readInt(). Awesome Robert!

This will mean w/ the cutover to FORPFOR codec for 4.0, MMapDir will likely have a huge edge over NIOFSDir?

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Good grief! What amazing gains, especially w/ PFor codec which of course makes super heavy use of .readInt(). Awesome Robert! This will mean w/ the cutover to FORPFOR codec for 4.0, MMapDir will likely have a huge edge over NIOFSDir?

This isn't really a 'gain' for the bulkpostings branch? This is just making DataInput.readInt() faster. Currently the bulkpostings branch uses readByte(byte[]), then wraps into a ByteBuffer and processes an IntBuffer view of that. I switched to just using readInt() from DataInputDirectly [FrameOfRefDataInput] and found it to be much slower than this IntBuffer method.

this whole benchmark is just benching DataInput.readInt()...

So, we shouldn't change anything in bulkpostings, this isn't faster than the intbuffer method in my tests, at best its equivalent... but we should fix this slowdown in our APIs.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

committed revision 1050737. I'll wait a bit for branch_3x.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Committed revision 1052892 to branch_3x.

apache / lucene

MMapDirectory speedups [LUCENE-2816] #3890