apache / lucene

Apache Lucene open-source search software

https://lucene.apache.org/

Apache License 2.0

2.46k stars 980 forks source link

InstantiatedIndex - faster but memory consuming index [LUCENE-550] #1628

Closed asfimport closed 16 years ago

asfimport commented 18 years ago

Represented as a coupled graph of class instances, this all-in-memory index store implementation delivers search results up to a 100 times faster than the file-centric RAMDirectory at the cost of greater RAM consumption.

Performance seems to be a little bit better than log2n (binary search). No real data on that, just my eyes.

Populated with a single document InstantiatedIndex is almost, but not quite, as fast as MemoryIndex.

At 20,000 document 10-50 characters long InstantiatedIndex outperforms RAMDirectory some 30x, 15x at 100 documents of 2000 charachters length, and is linear to RAMDirectory at 10,000 documents of 2000 characters length.

Mileage may vary depending on term saturation.

Migrated from LUCENE-550 by Karl Wettin, 1 vote, resolved Mar 13 2008 Attachments: BinarySearchUtils.Apache.java, classdiagram.png, HitCollectionBench.jpg, LUCENE-550_20071021_no_core_changes.txt, LUCENE-550.patch (versions: 3), test-reports.zip Linked issues:

1659

asfimport commented 18 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Thanks Karl, it's interesting stuff...

> You might notice that norms are float[] and not byte[]. That is me who refactored it to see if it would do > any good. Bit shifting don't take many ticks, so I might just revert that.

Since there are only 256 byte values, many scorers use a simple lookup table Similarity.getNormDecoder() After I sped up norm decoding, a lookup table was only marginally faster anyway (see comments in SmallFloat class). So I wouldn't expect float[] norms to be mesurably faster than byte[] norms in the context of a complete search.

asfimport commented 18 years ago

Karl Wettin (migrated from JIRA)

> > You might notice that norms are float[] and not byte[]. That is me who refactored it to see if it would do > > any good. Bit shifting don't take many ticks, so I might just revert that.

> Since there are only 256 byte values, many scorers use a simple lookup table Similarity.getNormDecoder() > After I sped up norm decoding, a lookup table was only marginally faster anyway (see comments in SmallFloat > class). So I wouldn't expect float[] norms to be mesurably faster than byte[] norms in the context of a complete > search.

The hypthesis is that instanciation and unnecessary data parsing is the bad guy. Converting bytes to floats fit that profile, so I moved it to the IO-classes (readFloat -> readByte). I relize that for the the norms alone, it is a marginal win, but if I find enough of these things it might show in the end. Don't know if I'll find enough things to work with though. Been looking at getting ridth of things in the IndexReader as the information it returns in many situations already available in the information passed IndexReader, but I'm afraid it might be a Pyrrhus victory as the Jit usually automatically "caches" things like that. There are more obvious places to save ticks, e.g. replacing collections with arrays.

asfimport commented 18 years ago

Karl Wettin (migrated from JIRA)

The whole Lucene core branch.

I think I've messed something up, queries with Directory-implementations are much slower than normal. :-)

See the class diagram to understand what I did.

asfimport commented 18 years ago

Karl Wettin (migrated from JIRA)

Class diagram over InstanciatedIndex

asfimport commented 18 years ago

Karl Wettin (migrated from JIRA)

Due to read and write locks, this is how one must use the extention:

InstanciatedIndex ii = new InstanciatedIndex();

IndexWriter iw = ii.new InstanciatedIndexWriter(analyzer, clear); // locks iw.close(); // commits

IndexReader ir = ii.new InstanciatedIndexReader();

Searcher = ii.getSearcher();

asfimport commented 18 years ago

Karl Wettin (migrated from JIRA)

This is a class diagram that explains what it will look like when I'm done.

It is pretty much only the IndexReader that needs to be refactored.

asfimport commented 18 years ago

Karl Wettin (migrated from JIRA)

Some new statistics.

A corpus of 500 documents, 1-5K text per document.
Placed 150 000 term and boolean queries.
Retrieved the top <100 hits from each result.

Query alone is about 5x faster, but 9x if you include the hits collection.

I belive that span queries will be about 10x-20x faster as the skipTo() is really really optimized. There is a bug in my term position code, so I have not been able to messure it for real yet.

Hope to have that working and an updated class diagram for you soon.

asfimport commented 18 years ago

Karl Wettin (migrated from JIRA)

Oups

InstanciatedIndex: Corpus creation took 14011 ms. Term queries took 33608 ms.

RAMDirectory: Corpus creation took 9144 ms. Term queries took 1123565 ms.

That it 35x the speed.

Something might be wrong. But my initial tests tells me that it is right. Will look in to this tomorrow. Need to sleep now.

asfimport commented 18 years ago

Karl Wettin (migrated from JIRA)

There is a minor norms bug. The value differst +-3 from the Directory norms. Other than that it seems to work great.

Now about 40x faster than RAMDirectory.

Stats for test: 500 documents. 1-5K text content. 10 000 * 5 spans 10 000 * 13 term and boolean term queries. collected top 100 documents for each search results.

InstanciatedIndex is 40x faster than the RAMDirectory.

InstanciatedIndex running on Lucene 1.9-karl1 Corpus creation took 14903 ms. Span queries took 12884 ms. Term queries took 30221 ms.

RAMDirectory run on Licene 1.9 Corpus creation took 9337 ms. Span queries took 253412 ms. Term queries took 1188492 ms.

asfimport commented 18 years ago

Doug Cutting (@cutting) (migrated from JIRA)

This looks very promising. Unfortunately the code you provide makes many incompatible API changes (e.g., turning Term into an interface that has far fewer methods) removes lots of useful javadoc, etc. So please don't expect it to be committed soon!

A back-compatible way to add an interface is to add it above the old class. So you might add a TermInteface, AbstractTerm, and TermImpl, then change term to extend TermImpl and deprecate it.

Then there's also the question of whether you really must convert Term to an interface. I would not undertake that change for aesthetic reasons. Is it really required to achieve your goals? You should generally try hard to minimize the size of your diffs and maximize the back-compatiblity.

asfimport commented 18 years ago

Karl Wettin (migrated from JIRA)

Doug Cutting commented on LUCENE-550:

> This looks very promising. Unfortunately the code you provide makes many incompatible API > changes (e.g., turning Term into an interface that has far fewer methods) removes lots of > useful javadoc, etc. So please don't expect it to be committed soon!

I agree, there is lots of work to be done on it. It was eaiser for me to think clear when everything was seperated. Basically there are only a few changes to the API that is needed:

Document nor Term may be final.
Something other minor that I forgot about.

It can all be fixed, but is nothing that I prioritize right now. If you feel it would be a nice thing for 2.0, tolk me what changes you are OK with and gave me at least two weeks notice I /might/ find time to back-factor the code.

asfimport commented 18 years ago

Karl Wettin (migrated from JIRA)

This is the diagram of InstanciatedIndex as of 1.9-karl1

asfimport commented 18 years ago

Karl Wettin (migrated from JIRA)

This update makes InstanciatedIndex compatible with Lucene, given that issue 580 and 581 is adopted.

It depends on generics and concurrent locks from J2SE 5.0.

Contains one update in Field:

public setFieldData(Object fieldData)

And one in Document:

public List<Field> getFields() { return fields; }

asfimport commented 18 years ago

Karl Wettin (migrated from JIRA)

ArrayBoundsOutOfIndex-bugfix.

If eveything works as it should (I think so) then I'm happy to report that a FuzzyQuery seems to be about 1500 (one thousand five hundred) times faster on this memory implementation than on a RAMDirectory. The speed is gained by not creating a new instance of each Term in a TermEnum.

asfimport commented 18 years ago

Karl Wettin (migrated from JIRA)

> If eveything works as it should

I doesn't. I keep taking out the victories in advance. I'll try not to in the future. So forget about the 1500. I'll come with a new number soon enough.

asfimport commented 18 years ago

Karl Wettin (migrated from JIRA)

> I'll come with a new number soon enough.

Right, it was 25% faster. So forget everthing I said about anything.

asfimport commented 18 years ago

Karl Wettin (migrated from JIRA)

There is a bug with phrase queries. Possible term positions. Low priority for me.

asfimport commented 17 years ago

Karl Wettin (migrated from JIRA)

To make this index work flawless (I hope), remove the if-statement around the following row in InstatiatedIndexWriter (row 477 or so):

termDocumentInformation.termPositions.add(fieldSettings.position);

This will fix the termposition bug noted in an earlier comment.

I'll keep posting bugfixes as comments here, but when I work on it it's really in my branch of lucene 2.0.0, available here: http://www.ginandtonique.org/trac/snigel/wiki/Lucene2-karl

If someone feels that this layer is an interesting thing to add to Lucene, let me know what is required for commit and I'll make those changes. It still seems to be about 40 times (mean value on a "nomal" index with "normal" amount of terms. have seen 20x-200x) than RAMDirectory when comparing search and to retrieve documents time combined.

asfimport commented 17 years ago

Karl Wettin (migrated from JIRA)

A comment on memory usage: about 2x a RAMDirectory (900MB and 1800MB) on a 150,000 document corpus (when the corpus term count have been reached?)

asfimport commented 17 years ago

Karl Wettin (migrated from JIRA)

In order to find the norm-error I ported all test cases. I'm sorry to report that 70 of them fails.

So if anyone use this code, don't. :-)

Hopefully most of the problems share the same problem. I'll be at the code this weekend, and perhaps a few days next week if needed.

asfimport commented 17 years ago

Karl Wettin (migrated from JIRA)

New code. More backwards compatible. Just a very few changes required to the Lucene core.

Now with test cases from distribution, but only search/* has been ported. Fails some (11 of 172) score and RMI related tests that I can not explain. Could really need some help with that

Except for that this seems to work really great now. I've been running this in a live environment for a few hours (some hundred thousand user queries) and it is really fast.

Output from failing tests:

junit.framework.AssertionFailedError: expected:<3> but was:<0> at org.apache.lucene.search.TestPhraseQuery.testSlopScoring(TestPhraseQuery.java:298) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90)

junit.framework.AssertionFailedError: Using 10 documents per index: at org.apache.lucene.search.TestMultiSearcher.testNormalization(TestMultiSearcher.java:247) at org.apache.lucene.search.TestMultiSearcher.testNormalization10(TestMultiSearcher.java:220) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90)

------- testSimpleEqualScores1 -------

0: 1.000000000 - d3

1: 1.000000000 - d4

2: 0.500000000 - d1

3: 0.500000000 - d2

junit.framework.AssertionFailedError: score #2 is not the same expected:<1.0> but was:<0.5> at org.apache.lucene.search.TestDisjunctionMaxQuery.testSimpleEqualScores1(TestDisjunctionMaxQuery.java:142) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90)

------- testSimpleEqualScores2 -------

0: 1.000000000 - d2

1: 0.500000000 - d1

2: 0.500000000 - d4

junit.framework.AssertionFailedError: score #1 is not the same expected:<1.0> but was:<0.5> at org.apache.lucene.search.TestDisjunctionMaxQuery.testSimpleEqualScores2(TestDisjunctionMaxQuery.java:166) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90)

------- testSimpleEqualScores3 -------

0: 1.000000000 - d2

1: 1.000000000 - d3

2: 1.000000000 - d4

3: 0.500000000 - d1

junit.framework.AssertionFailedError: score #3 is not the same expected:<1.0> but was:<0.5> at org.apache.lucene.search.TestDisjunctionMaxQuery.testSimpleEqualScores3(TestDisjunctionMaxQuery.java:191) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90)

junit.framework.AssertionFailedError: A,B,D, only B in range expected:<1> but was:<2> at org.apache.lucene.search.TestRangeQuery.testExclusive(TestRangeQuery.java:39) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90)

junit.framework.AssertionFailedError: A,B,D - A and B in range expected:<2> but was:<5> at org.apache.lucene.search.TestRangeQuery.testInclusive(TestRangeQuery.java:63) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90)

junit.framework.AssertionFailedError: Using 10 documents per index: at org.apache.lucene.search.TestMultiSearcher.testNormalization(TestMultiSearcher.java:247) at org.apache.lucene.search.TestMultiSearcher.testNormalization10(TestMultiSearcher.java:220) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90)

java.rmi.server.ExportException: internal error: ObjID already in use at sun.rmi.transport.ObjectTable.putTarget(ObjectTable.java:197) at sun.rmi.transport.Transport.exportObject(Transport.java:90) at sun.rmi.transport.tcp.TCPTransport.exportObject(TCPTransport.java:231) at sun.rmi.transport.tcp.TCPEndpoint.exportObject(TCPEndpoint.java:398) at sun.rmi.transport.LiveRef.exportObject(LiveRef.java:131) at sun.rmi.server.UnicastServerRef.exportObject(UnicastServerRef.java:195) at sun.rmi.registry.RegistryImpl.setup(RegistryImpl.java:107) at sun.rmi.registry.RegistryImpl.<init>(RegistryImpl.java:93) at java.rmi.registry.LocateRegistry.createRegistry(LocateRegistry.java:198) at org.apache.lucene.search.TestSort.startServer(TestSort.java:704) at org.apache.lucene.search.TestSort.getRemote(TestSort.java:689) at org.apache.lucene.search.TestSort.testRemoteSort(TestSort.java:410) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90)

java.rmi.server.ExportException: internal error: ObjID already in use at sun.rmi.transport.ObjectTable.putTarget(ObjectTable.java:197) at sun.rmi.transport.Transport.exportObject(Transport.java:90) at sun.rmi.transport.tcp.TCPTransport.exportObject(TCPTransport.java:231) at sun.rmi.transport.tcp.TCPEndpoint.exportObject(TCPEndpoint.java:398) at sun.rmi.transport.LiveRef.exportObject(LiveRef.java:131) at sun.rmi.server.UnicastServerRef.exportObject(UnicastServerRef.java:195) at sun.rmi.registry.RegistryImpl.setup(RegistryImpl.java:107) at sun.rmi.registry.RegistryImpl.<init>(RegistryImpl.java:93) at java.rmi.registry.LocateRegistry.createRegistry(LocateRegistry.java:198) at org.apache.lucene.search.TestSort.startServer(TestSort.java:704) at org.apache.lucene.search.TestSort.getRemote(TestSort.java:689) at org.apache.lucene.search.TestSort.testRemoteCustomSort(TestSort.java:417) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90)

java.rmi.server.ExportException: internal error: ObjID already in use at sun.rmi.transport.ObjectTable.putTarget(ObjectTable.java:197) at sun.rmi.transport.Transport.exportObject(Transport.java:90) at sun.rmi.transport.tcp.TCPTransport.exportObject(TCPTransport.java:231) at sun.rmi.transport.tcp.TCPEndpoint.exportObject(TCPEndpoint.java:398) at sun.rmi.transport.LiveRef.exportObject(LiveRef.java:131) at sun.rmi.server.UnicastServerRef.exportObject(UnicastServerRef.java:195) at sun.rmi.registry.RegistryImpl.setup(RegistryImpl.java:107) at sun.rmi.registry.RegistryImpl.<init>(RegistryImpl.java:93) at java.rmi.registry.LocateRegistry.createRegistry(LocateRegistry.java:198) at org.apache.lucene.search.TestSort.startServer(TestSort.java:704) at org.apache.lucene.search.TestSort.getRemote(TestSort.java:689) at org.apache.lucene.search.TestSort.testNormalizedScores(TestSort.java:440) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.junit2.JUnitStarter.main(JUnitStarter.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:90)

asfimport commented 17 years ago

Karl Wettin (migrated from JIRA)

Updated to match the current svn with Fieldable, et.c.

All changes to Lucene core are now gathered in a small patch (de-finalized Document and Term) and one new class (InterfaceIndexWriter implemented by IndexWriter in patch) instead of attaching the whole trunk.

Still fails a few score- and RMI-tests.

asfimport commented 17 years ago

Karl Wettin (migrated from JIRA)

Performance from live environemt:

150,000 documents, average size is 2K.
Consumes 2x the memory of a RAMDirectory.
Average user query match 90 documents.
RAMDirectory takes 60x more time to collect and instantiate the resulting documents.

I would very much apreciate if someone with knowledge of the scoring code could take a look at the seven final(tm) failing tests. Them failing is not a problem for me, but it would be nice if they passed.

asfimport commented 17 years ago

Dejan Nenov (migrated from JIRA)

Can we please get the class diagrams in PDF format - the PNGs are so tny - they are undreadable :(

asfimport commented 17 years ago

Karl Wettin (migrated from JIRA)

> Can we please get the class diagrams in PDF format - > the PNGs are so tny - they are undreadable :(

Shamless promotion:

I'm actually in the progress of porting all my old diagrams to <http://www.appliedmodels.com/>, this fantastic MDA-tool a friend of mine just released to the public. So quite soon there will be new diagrams. Pehaps even PDF.

Until then you're stuck to zooming :)

asfimport commented 17 years ago

Dejan Nenov (migrated from JIRA)

And whil ewe wait - may we please have highres PNGs - so that the zoomed-in versions are a little more readable?

asfimport commented 17 years ago

Karl Wettin (migrated from JIRA)

Here is what I just sent to Wolgang. I've adapted his bench test case to also work with InstantiatedIndex. It is worth noticing this is a test with one document only, and the speed is not linear according to my previous tests. InstantiatedIndex is much more than 3x faster than RAMDirectory in a larger index. So this is really only to compare MemoryIndex with InstantiatedIndex, and not as a bench against RAMDirectory.

RAMDirectory:

secs = 95.159 queries/sec= 315.26184 MB/sec = 9.900338 Done benchmarking (without checking correctness).

MemoryIndex:

secs = 26.692 queries/sec= 1123.9323 MB/sec = 35.295456 Done benchmarking (without checking correctness).

InstantiatedIndex:

secs = 27.44 queries/sec= 1093.2944 MB/sec = 34.333317 Done benchmarking (without checking correctness).

MemoryIndex is a bit faster than InstantiatedIndex. But I'm aware of a couple of small optimizations I can do.

asfimport commented 17 years ago

Wolfgang Hoschek (@whoschek) (migrated from JIRA)

What's the benchmark configuration? For example, is throughput bounded by indexing or querying? Measuring N queries against a single preindexed document vs. 1 precompiled query against N documents? See the line

boolean measureIndexing = false; // toggle this to measure query performance

in my driver. If measuring indexing, what kind of analyzer / token filter chain is used? If measuring queries, what kind of query types are in the mix, with which relative frequencies?

You may want to experiment with modifying/commenting/uncommenting various parts of the driver setup, for any given target scenario. Would it be possible to post the benchmark code, test data, queries for analysis?

asfimport commented 17 years ago

Wolfgang Hoschek (@whoschek) (migrated from JIRA)

Other question: when running the driver in test mode (checking for equality of query results against RAMDirectory) does InstantiatedIndex pass all tests? That would be great!

asfimport commented 17 years ago

Karl Wettin (migrated from JIRA)

wolfgang hoschek [21/Nov/06 10:22 AM] > Other question: when running the driver in test mode (checking for equality of query > results against RAMDirectory) does InstantiatedIndex pass all tests? That would be great!

It sure does!

xfiles = [./CHANGES.txt, ./LICENSE.txt]

########### iteration=0

*********** FILE=./CHANGES.txt diff=-0.020341659, query=term, scoreII=0.020341659, scoreRAM=0.020341659 diff=-0.024093388, query=term*, scoreII=0.024093388, scoreRAM=0.024093388 diff=-0.025180675, query=term\~, scoreII=0.025180675, scoreRAM=0.025180675 diff=-0.018685007, query=Apache, scoreII=0.018685007, scoreRAM=0.018685007 diff=-0.014089426, query=Apach\~ AND Copy*, scoreII=0.014089426, scoreRAM=0.014089426

*********** FILE=./LICENSE.txt diff=0.0, query=term, scoreII=0.0, scoreRAM=0.0 diff=-0.027122213, query=term*, scoreII=0.027122213, scoreRAM=0.027122213 diff=-0.028767452, query=term\~, scoreII=0.028767452, scoreRAM=0.028767452 diff=-0.023488527, query=Apache, scoreII=0.023488527, scoreRAM=0.023488527 diff=-0.043373547, query=Apach\~ AND Copy*, scoreII=0.043373547, scoreRAM=0.043373547

secs = 3.766 queries/sec= 2.655337 MB/sec = 0.083386995 No bug found. done.

Process finished with exit code 0

asfimport commented 17 years ago

Wolfgang Hoschek (@whoschek) (migrated from JIRA)

Ok. That means a basic test passes. For some more exhaustive tests, run all the queries in

src/test/org/apache/lucene/index/memory/testqueries.txt

against matching files such as

String[] files = listFiles(new String[] {
  "**.txt", //"**.html", "**.xml", "xdocs/**.xml", 
  "src/java/test/org/apache/lucene/queryParser/\*.java",
  "src/java/org/apache/lucene/index/memory/\*.java",
});

See testMany() for details. Repeat for various analyzer, stopword toLowerCase settings, such as

boolean toLowerCase = true;

// boolean toLowerCase = false; // Set stopWords = null; Set stopWords = StopFilter.makeStopSet(StopAnalyzer.ENGLISH_STOP_WORDS);

Analyzer[] analyzers = new Analyzer[] {

// new SimpleAnalyzer(), // new StopAnalyzer(), // new StandardAnalyzer(), PatternAnalyzer.DEFAULT_ANALYZER, // new WhitespaceAnalyzer(), // new PatternAnalyzer(PatternAnalyzer.NON_WORD_PATTERN, false, null), // new PatternAnalyzer(PatternAnalyzer.NON_WORD_PATTERN, true, stopWords),
// new SnowballAnalyzer("English", StopAnalyzer.ENGLISH_STOP_WORDS), };

asfimport commented 17 years ago

Wolfgang Hoschek (@whoschek) (migrated from JIRA)

> diff=-0.024093388, query=term*, scoreII=0.024093388, scoreRAM=0.024093388

Actually, diff != 0 means the test fails, unless the diff is very small due too rounding error, say 10E-9. The driver should report a IllegalStateException("BUG DETECTED:"

asfimport commented 17 years ago

Karl Wettin (migrated from JIRA)

> > diff=-0.024093388, query=term*, scoreII=0.024093388, scoreRAM=0.024093388 > > Actually, diff != 0 means the test fails, unless the diff is very small due too rounding error, say 10E-9. > The driver should report a IllegalStateException("BUG DETECTED:"

Right, that was a bug in my code. The diff /output/ was calculated on scoreMEM - scoreRAM (were scoreMEM is 0) and not scoreII - scoreRAM ; )

asfimport commented 17 years ago

Karl Wettin (migrated from JIRA)

wolfgang hoschek [21/Nov/06 12:50 PM] > Ok. That means a basic test passes. For some more exhaustive tests, run all the queries in

All Lucene unit tests have been adapted to work with my alternate index. Everything but proximity queries pass. Have not looked in to why as I don't use them (yet). And I have written an in depth index comparator to make sure that an InstantiatedIndex equals a Directory implementation. Hence I have already verified that the index works as expected.

Todays postings from me is more to show that InstantiatedIndex is /almost/ as fast as MemoryIndex and could thus be an interesting replacement, as as it handles more than one document it might even be preferable in some cases.

I will however run your suggested tests tomorrow and report back. And post the latest patches, including my adaptation of your unit test, in case you want to explore it by your self.

asfimport commented 17 years ago

Wolfgang Hoschek (@whoschek) (migrated from JIRA)

> All Lucene unit tests have been adapted to work with my alternate index. Everything but proximity queries pass.

Sounds like you're almost there :-)

Regarding indexing performance with MemoryIndex: Performance is more than good enough. I've observed and measured that often the bottleneck is not the MemoryIndex itself, but rather the Analyzer type (e.g. StandardAnalayzer) or the I/O for the input files or term lower casing (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265809) or something else entirely.

Regarding query performance with MemoryIndex: Some queries are more efficient than others. For example, fuzzy queries are much less efficient than wild card queries, which in turn are much less efficient than simple term queries. Such effects seem partly inherent due too the nature of the query type, partly a function of the chosen data structure (RAMDirectory, MemoryIndex, II, ...), and partly a consequence of the overall Lucene API design.

The query mix found in testqueries.txt is more intended for correctness testing than benchmarking. Therein, certain query types dominate over others, and thus, conclusions about the performance of individual aspects cannot easily be drawn.

Wolfgang.

asfimport commented 17 years ago

Wolfgang Hoschek (@whoschek) (migrated from JIRA)

I've now checked in a version of MemoryIndexTest into contrib/memory that more easily allows to switch between measuring indexing or querying. Example output for measuring query throughput on simple term queries: \~500000 queries/sec on a MacBook Pro, jdk 1.5.0_06, server VM. As always, your mileage may vary.

asfimport commented 17 years ago

Karl Wettin (migrated from JIRA)

This is the current version of my local Lucene branch, including InstantiatedIndex. As I have not merged with the trunk for a while, it also features my locally patched version. It really is just a few small changes. Some classes are no longer final, plus I have introduced InterfaceIndexWriter and InterfaceIndexModifier.

/lucene2karl/lucene2-apache-karl-patched /lucene2karl/lucene2-karl/test <--- all (search) test cases adapted to run with instantiated index /lucene2karl/lucene2-karl/index /lucene2karl/lucene2-karl/instantiated /lucene2karl/lucene2-karl/searchfork <--- non important stuff /lucene2karl/lucene2-karl/analysis <--- just some stuff /lucene2karl/lucene2-karl/core <-- patches for the lucene trunk /memoryindex <--- stuff for wolfgang

All tests pass, except remote, multi and parallell searchers.

Jira admins: you are more than welcome to remove all old attachments, except images.

asfimport commented 17 years ago

Karl Wettin (migrated from JIRA)

> Jira admins: you are more than welcome to remove all old attachments, except images.

oh, i had no clue my status was upgraded. cool. fixed it my self.

asfimport commented 17 years ago

Doug Cutting (@cutting) (migrated from JIRA)

I don't see a patch file here. Your proposal would be easier to evaluate as a patch file. Also, a contribution like this will be easier to accept if your new classes are in the contrib tree. Then, if they prove popular, they can move into the core. Or perhaps folks will find them so obviously useful they'll want them in the core from the start, but contrib would require less convincing.

asfimport commented 17 years ago

Karl Wettin (migrated from JIRA)

Doug Cutting [12/Jan/07 10:16 AM] > I don't see a patch file here. Your proposal would be easier to evaluate as a patch file.

Attached!

> easier to accept if your new classes are in the contrib tree.

There are a couple of chages in the core, the rest has been moved to contrib/indexfacade and contrib/instantiated. There is some clean up to do: a couple of static tests in instantiated. And perhaps some common logging artifacts left from debugging.

I'm quite certain that both contrib/packages depends on java<1.5>. At least concurrency in instantiated.

asfimport commented 17 years ago

Karl Wettin (migrated from JIRA)

New patch has all assimilated test cases moved to a new non conflicting package.

Also contains contrib/cache that depends on everything else.

asfimport commented 17 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

I've been trying to follow the work you've been doing Karl, but i must admit a lot of it is over my head – but since i've got a long weekend and your patch now makes so few changes to the core i could acctually make sense of that part, so here are some comments on those changes...

1) some of these changes seem to be duplicated in #1849 and #1850 ... just pointing that out for other people who might get confused.

2) since the new ScoreDoc.docComparator and ScoreDoc.scoreComparator are public, they should have some javadocs clarifing what they are for.

3) i don't think the Hits.setSearcher method you added is safe ... i believe that at a minimum hitDocs, first, last, and weight all need to be reset – weight's a tricky one since the instance doesn't currently hang on to the orriginal query.

4) I would personally prefer IndexWriterInterface and IndexModifierInterface over InterfaceIndexWriter and InterfaceIndexModifier – if for no other reason then so they sort together .. but that's a minor nit.

I've only briefly looked at the new stuff in contrib, because I got lost ... there isn't any package or class level javadocs or a build.xml in either contrib. A big thing i did notice is that the code in indexfacade puts things in the o.a.l.search and o.a.l.index packages, which is being discouraged for contribs (among other reasons it makes it confusing to understand where a class is coming form) ideally those classes should live under o.a.l.indexfacade.index and o.a.l.indexfacade.index (or maybe just o.a.l.facade - but you get the idea)

asfimport commented 17 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

I just realized that all of the tests in contrib/instantiated/src/test/java/org/apache/lucene/instantiated/assimilated/ are duplicates of tests from the core with a few line changes so they use an InstantiatedIndex to get a reader/writer/seracher etc.

I think it would be much better if we changed the orriginal versions of these tests to include an accessors for constructing/fetching those objects which could be subclassed by tests in your contrib – that way any bugs found/fixed in those test classes and any additional test methods added to those classes would automatically be inherited by your versions (instead of winding up with duplicate cut/paste test code)

asfimport commented 17 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Karl: the trunk.diff i just attached fixes a small autoboxing dependency your patch introduced into the core (preventing compilation on java 1.4). I also added build.xml files to the new contrib dirs, rearanged the directory of the contribs so they match the default for contribs and the the build.xml files could be simple. Once i did this i discovered some unneccessary dependencies on commons-logging that i removed. Then i ran the tests, and got some errors – which are included in test-reports.zip so you can check them out.

asfimport commented 17 years ago

Karl Wettin (migrated from JIRA)

Thanks alot Hoss, for taking the time. I sure do appreciate it.

I'll get back on your comments.

asfimport commented 17 years ago

Karl Wettin (migrated from JIRA)

New sunday, new code.

Hoss Man [15/Jan/07 12:16 AM] > I've only briefly looked at the new stuff in contrib, because I got lost ... there isn't > any package or class level javadocs or a build.xml in either contrib.

Tried to do something about the java docs. Also made a new fresh class diagram with some comments in it. I can make it PDF or XUL if prefered.

That boxing error you fixed might be back. Where was it? Could not find it in the patch (all adding and no -+ fix) and it was too late to apply your patch on my local version..

> Hoss Man [15/Jan/07 12:16 AM] > > 1) some of these changes seem to be duplicated in #1849 and #1850 > ... just pointing that out for other people who might get confused.

Is it considered better practise to keep all my changes in this one huge issue? I thought it could be nice to pop in minor patches such as them.

> 4) I would personally prefer.. > but that's a minor nit.

There has been a lot of refactoring of packages and class names as suggested. (I'm still not happy with the notification listener classes.)

A few new changes to the core:

Lazy initialization of the fields collection in Document .

Some definalization to allow decoration of IndexReader. http://www.nabble.com/IndexReader-can-not-be-decorated-tf3041647.html#a8461125

> Hoss Man [15/Jan/07 12:16 AM] > > 3) i don't think the Hits.setSearcher method you added is safe

It smeared out on java-dev: http://www.nabble.com/Decorative-cache-%28and-Hits.setSearcher%29-tf3009848.html#a8428139

I did not investigate this any further with test code, but I have identitfied lazy fields as a problem. Instead I'm considering a supplementary decorated document cache on the IndexReader, and implementing a replacement for Hits.

Hoss Man [15/Jan/07 12:39 AM] > I just realized that all of the tests in contrib/instantiated/src/test/java/org/apache/lucene/ > instantiated/assimilated/ are duplicates of tests from the core with a few line changes > so they use an InstantiatedIndex to get a reader/writer/seracher etc.

This is not a bad idea at all, but I will not have time to do it right anytime soon. It would be a simpler task if the facade was a part of the core, as this is just the thing it was built for – unison index handling. ;-)

Hoss Man [15/Jan/07 01:35 AM] > Then i ran the tests, and got some errors – which are included in test-reports.zip so you can check them out.

What tool do you recommend to inspect these reports?

I know for a fact that remote searchable will fail. I hope for someone to show up, need it and fix it.

asfimport commented 17 years ago

Karl Wettin (migrated from JIRA)

Patch of the week.

Changes:

CachedSearcher – soft referenced hit collection-, TopDocs- and TopFieldDocs cache. Backed by NotifiableIndex.

Removed Hits cache due to uncertainty but introduced:

CachedIndexReader – soft referenced documents cache. Backed by NotifiableIndex.

TopDocs/TopFieldDocs- and IndexReader cache combined almost replace a fully cached Hits.

The number of unit tests and detail of them is increasing.

The plan is now to have the cached reader pre-loading documents to memory from an own thread when server load allows it.

Also added some abstractation levers used by above:

AutofreshedIndexReader – always up to date with the index.
ReadOnlyIndexReader – makes sure the user don't delete stuff with the decorated reader.

Had some problems with decorating the IndexModifierInterface against Directory in NotifiableIndex, so removed the Index.indexModifierFactory() and introduced a index facade backed version:

org.apache.lucene.index.facade.IndexModifier(myIndex, analyzer, create)

where all reader/writer creation is myIndex.indexReaderFactory() and indexWriterFactory();

Makes the Notifiable code a bit simpler.

asfimport commented 17 years ago

Karl Wettin (migrated from JIRA)

new diagram with lots of notes (this is also available in the patch as an uxf-file for umlet)

asfimport commented 17 years ago

Karl Wettin (migrated from JIRA)

Refactored the Term->Document relationships a bit for speed optimizations. It also resulted with getting all term frequency vector information except for offsets free of charge. More information on that in the class diagram.

Removed a whole bunch of todo:s in the writer and reader.

The current lock implementen is worthless. I need to read up on RentrentLock. Or should I perhaps use the lock Directory:s use?

(And that class diagram is of course granted for ASF, my misstake.)

asfimport commented 17 years ago

Karl Wettin (migrated from JIRA)

Added support for contrib/memory MemoryIndex, so now it works with readers and writers as if it was any other index.

Added a consumer level index implementation that handles cache, notifications, and all the stuff this issue is about:

// This is the instace one is supposed to use for all access against the index in this JVM. IndexFacade index = new IndexFacade(new RAMDirectoryIndex());

// Accessors IndexWriterInterface writer = index.indexWriterFactory(anayzler, true); Document doc = new Document(); doc.add(... writer.add(doc); writer.close(); IndexReader deleter = index.indexReaderFactory(); index.getSearcher().search(... index.getReader().doc(0) deleter.close(); assertEquals(0, index.getReader().numDocs());

public class IndexFacade {

/** wrapps any storage, optional cache settings */ public IndexFacade(I index, CachedSearcher.HitCollectionCacheState hitCollectionCache, boolean topDocsCache, boolean topFieldsCache, boolean documentsCache) throws IOException { public CachedSearcher getSearcher() throws IOException {

/** The general consumer searcher to be used when querying this index. Always fresh. */ public Searcher getSearcher() throws IOException {

/** The general consumer read only index reader to be used when inspecting this index. Always fresh. */ public IndexReader getReader() throws IOException {

Next