apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

Optimized iteration of finite strings [LUCENE-6365] #7425

Closed asfimport closed 9 years ago

asfimport commented 9 years ago

Replaced Operations.getFiniteStrings() by an optimized FiniteStringIterator.

Benefits: Avoid huge hash set of finite strings. Avoid massive object/array creation during processing.

"Downside": Iteration order changed, so when iterating with a limit, the result may differ slightly. Old: emit current node, if accept / recurse. New: recurse / emit current node, if accept.

The old method Operations.getFiniteStrings() still exists, because it eases the tests. It is now implemented by use of the new FiniteStringIterator.


Migrated from LUCENE-6365 by Markus Heiden, 1 vote, resolved Jul 06 2015 Attachments: FiniteStrings_noreuse.patch, FiniteStrings_reuse.patch, LUCENE-6365.patch

asfimport commented 9 years ago

Markus Heiden (migrated from JIRA)

Patch containing the proposed changes.

asfimport commented 9 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I looked at the patch (briefly, without delving into all the details – I'm on vacation :).

I like the idea overall. The only thing that bugs me is the explicitness of FiniteStringsIterator. I would rather have something like:

Iterator<IntsRef> Operations.iterateFiniteStrings(...)

so that we can leave FiniteStringsIterator as a package-private implementation and not proliferate it around. If Iterator interface gives you a headache (which it can) then I'd leave FiniteStringsIterator as an explicit return type but leave the factory method inside Operations.

The assignments inside the conditional part of for loops are very likely to drive somebody crazy sooner or later.

Again, I only looked at the patch, I don't even have a proper environment with me to check on all the details.

asfimport commented 9 years ago

Markus Heiden (migrated from JIRA)

I prefer a "standard" Iterator too. But by implementing it, my implementation loses some of its benefits, because an iterator needs a look-ahead for hasNext().

I chose the current implementation because: 1) I can reuse the iterator for different iterations (similar to Transition): See AnalyzingSuggester. So e.g. the internal stack does not need to be reallocated. This avoids not just the initial allocation, but the resizes during the iterations too. BTW: May we choose a bigger initial size for the stack? 2) By avoiding a look-ahead, my implementation can return the internal IntRefs (from the used IntsRefBuilder) in next(), without the need for a deep copy.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

BTW: May we choose a bigger initial size for the stack?

No, this is too much as a library, it impacts too many uses and makes the library too difficult to use (special jvm flags must be supplied).

asfimport commented 9 years ago

Markus Heiden (migrated from JIRA)

I talked about the stack like structure inside the iterator not the jvm stack size. This stack holds the PathNodes of the current finite string. It has an initial size of just 4, so it has to be dynamicly resized in almost all use cases.

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I like this change overall ... it means you can iterate over many more finite strings if you want. I think a dedicated iterator is OK; just mark it @lucene.experimental so we are free to improve it later.

Iteration order changed, so when iterating with a limit, the result may differ slightly.

I think that's fine and it's an impl detail (which strings you'll get when you hit the limit) ... can you update the javadocs to say so?

asfimport commented 9 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Oops, apologies Markus, this slipped my agenda somehow. As for implementing Iterator – ok, you don't have to stick to Java's one, I understand the lookahead is hairy. But I'd still change those for loops with compound assignment/ comparison to something that is less cryptic (like a simple while loop with the variable thrown out). The JIT doesn't care about it anyway I bet (variables undergo liveness analysis).

asfimport commented 9 years ago

Markus Heiden (migrated from JIRA)

Are you talking about this?

for (IntsRef finiteString; (finiteString = iterator.next()) != null;)

For me it is the standard iteration pattern for non-lookahead iterations, like e.g. iterating over an input stream (see e.g. FileCopyUtils of Spring framework).

Does this one look better for you?

for (IntsRef finiteString = iterator.next(); finiteString != null; finiteString = iterator.next())

I like my version better, because it is shorter and the iterator.next() is not doubled, but I will you use it, if you like it better.

A simple while loop looks even more bloated to me. It unnecessarily widens the scope of finiteString and splits things which belong together, which both is error prone for coding:

IntsRef finiteString = iterator.next();
while (finiteString != null) {
   // do something

   finiteString = iterator.next();
}

Something different: I marked Operations.getFiniteStrings() as deprecated in my patch, because it should be replaced by the new iterator. But I consider to remove the deprecated, because this method is easier to use for single iterations of small finite strings sets and makes some tests cases simpler. What do you think?

Again something different: What about the initial stack size in the new iterator (which needs to be at least as big as the max. length of the iterated finite strings)? May I raise it from 4 to e.g. 16? In my opinion this would be needed for roughly 90% of all cases.

asfimport commented 9 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

for (IntsRef finiteString; (finiteString = iterator.next()) != null;)

Many people have strong feelings about assignments in conditional expressions. In fact I just recently stumbled upon an Eclipse JDT refactoring bug that was evaluating these (and refactoring these) incorrectly. My comment was actually meant to go together with the "why don't we make it an iterator" one... If you did that then the problem of what kind of loop it is pretty much disappears.

Anyway, this is really minor and a matter of style rather than correctness. It can go in as-is.

I marked Operations.getFiniteStrings() as deprecated in my patch [...] What do you think?

No strong opinion. If it's only used from tests then you can mark it as deprecated I guess; no need to support redundant code.

May I raise it from 4 to e.g. 16?

It very likely won't matter in practice at all. I think increasing it to 16 won't do anybody any harm (you could try to squeeze it into a single line cache, but I think it's an overkill and premature optimization; it'll vanish in other processing noise).

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Can you add in the javadocs of the new iterator that the order that accepted strings are returned is implementation dependent and free to change across releases?

I marked Operations.getFiniteStrings() as deprecated

Can we simply remove it? The Operations methods are experimental and dangerous ... it need not live on?

asfimport commented 9 years ago

Markus Heiden (migrated from JIRA)

Sorry for the delay.

I moved Operations.getFiniteStrings() to TestOperations.getFiniteStrings(), because using the iterator for assertions is a pain. Production code using this method has been replaced by direct usage of the new iterator.

I got one problem with that: I am not sure if the implementation change of CompletionTokenStream is OK, because I set the position attribute at the end of the iteration instead of at the start of the iteration. The tests run fine, but someone should review that.

I marked the new iterator as @lucene.experimental and added a comment, that the iteration order may change.

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Markus Heiden, new patch looks great.

Can we remove the limit to FiniteStringsIterator.init? Seems like this ("abort iteration after N items") should be the caller's job?

Can we just pass the automaton to FSI's ctor? I don't think we need a reuse API here...

I am not sure if the implementation change of CompletionTokenStream is OK, because I set the position attribute at the end of the iteration instead of at the start of the iteration. The tests run fine, but someone should review that.

It is weird that CompletionTokenStream hijacks PositionIncrementAttribute like that, and I can't see anywhere that reads from that (and indeed tests pass if I comment it out). Maybe @areek knows? I think we should just remove it?

asfimport commented 9 years ago

Areek Zillur (@areek) (migrated from JIRA)

+1 to removing PositionIncrementAttr from CompletionTokenStream.

asfimport commented 9 years ago

Markus Heiden (migrated from JIRA)

Removed PositionIncrementAttribute from CompletionTokenStream.

asfimport commented 9 years ago

Markus Heiden (migrated from JIRA)

Removing init(): One my design goals was to be able to reuse this iterator, mainly to optimize the performance of the AnalyzingSuggester. This mainly comes from avoiding the step by step allocation of the stack inside the iterator over and over again. So maybe I can provide an additional constructor by combining the default constructor with the init() method?

Removing limit support: I like the idea, but as I tried to implement the change, it seems that there are some downsides: At least the AnalyzingSuggester and the CompletionTokenStream are using this feature. Especially annoying is to change the implementation of CompletionTokenStream, because it has no for loop. Maybe I can add a second init() method without limit, to not need to pass -1 each time the limitation is not needed?

asfimport commented 9 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

> mainly to optimize the performance of the AnalyzingSuggester. This mainly comes from avoiding the step by step allocation of the stack inside the iterator over and over again.

I wouldn't worry about it, Markus. With a TLAB these allocations are fairly cheap. You could just add a reasonably large initial stack and that's it. If it needs to reallocate, so be it. Unless it becomes a real bottleneck in real life (which it won't, I assure you), there is no need for premature optimizations at the cost of making the code more complex.

asfimport commented 9 years ago

Markus Heiden (migrated from JIRA)

For the normal use case I agree. But I had problems with long build times for lookups for big dictionaries (using the AnalyzingSuggester). I profiled the creation of the lookups and one hotspot was the allocation of the internal stack. One problem is, that the initial size of the internal stack is too small (4 entries), so the internal stack gets resized over and over again. I will increase its initial size to 16.

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I would really prefer not make API compromises (reuse, init method) for such optimizations, nor for the "limit" case (this is really the caller's responsibility...).

Especially annoying is to change the implementation of CompletionTokenStream, because it has no for loop.

It's fine to just add a member to the class tracking how many strings it has pulled so far from the iterator...

Also, you can remove that //TODO: make this return a Iterator<IntsRef> instead? since you are doing exactly that, here...

asfimport commented 9 years ago

Markus Heiden (migrated from JIRA)

The Iterator interface is not possible with the current implementation, because the lookahead of hasNext() would destroy the current value provided by the previous next(). Anyway, I removed that comment.

I provided now both interfaces (single use and multiple use) with the newest patch, so for the default case you can use the more intuitive one (single use). I hope you like it.

I kept the limit inside the iterator, but provided the new class LimitedFiniteStringsIterator for it. Because it was too complicated to transfer the limit into the yet complex iteration in AnalyzingSuggester etc.

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The Iterator interface is not possible with the current implementation, because the lookahead of hasNext() would destroy the current value provided by the previous next(). Anyway, I removed that comment.

OK it's fine if we don't implement Java's Iterator; the spirit of that TODO can still be removed :)

I provided now both interfaces (single use and multiple use) with the newest patch, so for the default case you can use the more intuitive one (single use). I hope you like it.

Sorry, I really don't want to pollute such a low level API with reuse ... can you remove the init reuse method?

I kept the limit inside the iterator, but provided the new class LimitedFiniteStringsIterator for it. Because it was too complicated to transfer the limit into the yet complex iteration in AnalyzingSuggester etc.

OK that seems like a good compromise...

asfimport commented 9 years ago

Markus Heiden (migrated from JIRA)

I adapted my patch to the latest changes in trunk.

I think the reuse of the iterator is one core part of this whole patch. I tried to rework the api of the iterator so that the reuse case and the no-reuse case are handled in a similar way. I hope you like it now (at least a bit). Lucene does this kind of reuse already, e.g. see Transition.

FuzzyCompletionQuery has been added lately and relies on the old big set of finite strings. I am not sure how to rework it. Currently it still uses the set, maybe it is better to use the iterator inside of FuzzyCompletionWeight, but this means recomputing the finite strings over and over again. What do you think?

BTW topoSortStates() is implemented by AnalyzingSuggester and CompletionTokenStream identically. Maybe it should be moved to one place, maybe to Operations?

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I adapted my patch to the latest changes in trunk.

Thanks.

I think the reuse of the iterator is one core part of this whole patch. I tried to rework the api of the iterator so that the reuse case and the no-reuse case are handled in a similar way. I hope you like it now (at least a bit).

Alas I still don't think this is an appropriate place for object reuse.

Lucene does this kind of reuse already, e.g. see Transition.

That's true: Lucene does reuse objects in many low-level places, but this is ugly and cancerous and dangerous (can easily cause bugs, e.g. accidentally reusing one iterator across threads) and anti-Java, etc., and it really should be used only sparingly, and should be the exception not the rule. I don't think this API qualifies, i.e. it's a bad tradeoff to pollute the API to eeek out a bit of GC perf gain that in real usage would be negligible because the cost of building an automaton and the cost of consuming each string that's iterated would normally dwarf the small GC cost of creating a new iterator per automaton. APIs are hard enough to "get right" as it is ...

FuzzyCompletionQuery has been added lately and relies on the old big set of finite strings. I am not sure how to rework it. Currently it still uses the set, maybe it is better to use the iterator inside of FuzzyCompletionWeight, but this means recomputing the finite strings over and over again. What do you think?

It's fine to leave this as the full Set<String> for now. It's no worse :) Progress not perfection...

BTW topoSortStates() is implemented by AnalyzingSuggester and CompletionTokenStream identically. Maybe it should be moved to one place, maybe to Operations?

Woops, I'll go move that to Operations now, good idea, thank you!

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1689046 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1689046

LUCENE-6365: add Operations.topoSort

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1689047 from @mikemccand in branch 'dev/branches/branch_5x' https://svn.apache.org/r1689047

LUCENE-6365: add Operations.topoSort

asfimport commented 9 years ago

Markus Heiden (migrated from JIRA)

I see, I am not able to convince you ;-) So I attached a version of the patch with eliminated reuse api.

I agree that reuse is no good design, but the profiler pointed me to that spot. I already did a patch for Automaton (#7021) for the same reasons.

It would be nice, if Automaton knows the size of the longest word it produces. That would eliminated the resizing of the internal stack array inside the iterator and could convince me that reuse is not needed for the iterator.

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1689079 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1689079

LUCENE-6365: fix buggy Operations.topoSort; add test

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1689081 from @mikemccand in branch 'dev/branches/branch_5x' https://svn.apache.org/r1689081

LUCENE-6365: fix buggy Operations.topoSort; add test

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Markus Heiden, the patch looks good.

I do still think this is a very nice change even without the reuse!

Was it accidental to remove `@lucene.experimental` from CompletionTokenStream's javadocs? I put that back.

I also made a couple additional members of FiniteStringsIterator final, and fixed up the javadocs a bit ... new patch attached.

I'll commit soon, thank you!

asfimport commented 9 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I agree that reuse is no good design, but the profiler pointed me to that spot. I already did a patch for Automaton (#7021) for the same reasons.

The problem with profilers is: They don't run the code with all hotspot optimizations turned on (because they add extra commands into the loop) to take measurement points. In general Object creation is in most cases not bad, except when they tend to live long time (e.g., across several method). Short living objects never affect GC, because they are created on stack automatically (escape analysis) - and those are in most cases never shown correctly by profiler, because escape analysis is mostly broken with profilers.

Reuse is only suitable for cases where the object goes through many stages of a query and needs to be cached between method calls and involves disk I/O,... This is not the case here.

Patch without reuse looks fine, although I dont fully understand it (not my area of work).

asfimport commented 9 years ago

Markus Heiden (migrated from JIRA)

@Michael: The removal of @lucene.experimental was a mistake of mine during merging.Thanks for your rework and your patience.

@Uwe: I measured the cpu runtime in sampling mode, so (almost) no additional overhead should occur. I did the reuse because there is not just one allocation of the array, but many. During runtime the array will be resized over and over again, because the initial size was rather small (4 entries). I changed that to 16 so the resizing occurs less frequent. My test case was the build of dictionary of 100000s of words, so even small things accumulate.

A better solution to that problem would be, if automatons know the length of their longest word. In that case that above mentioned array could initially be sized right. But I don't know, if that length is always known during construction of automatons.

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1689192 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1689192

LUCENE-6365: switch to iterator API to get all finite strings from an Automaton

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1689193 from @mikemccand in branch 'dev/branches/branch_5x' https://svn.apache.org/r1689193

LUCENE-6365: switch to iterator API to get all finite strings from an Automaton

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Markus Heiden maybe try the honest profiler: https://github.com/RichardWarburton/honest-profiler/wiki ?

I haven't tried it yet, but it apparently avoids the safepoint sampling bias ...

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Hmm this failure just appeared on Jenkins:

   [junit4] Started J0 PID(3197@localhost).
   [junit4] Suite: org.apache.lucene.util.automaton.FiniteStringsIteratorTest
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=FiniteStringsIteratorTest -Dtests.method=testRandomFiniteStrings1 -Dtests.seed=4A938C5F6E728DCC -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=es_CU -Dtests.timezone=America/Porto_Velho -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] FAILURE 0.26s | FiniteStringsIteratorTest.testRandomFiniteStrings1 <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: expected:<445> but was:<446>
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([4A938C5F6E728DCC:17BF51EE97CD2662]:0)
   [junit4]    >    at org.apache.lucene.util.automaton.FiniteStringsIteratorTest.assertFiniteStringsRecursive(FiniteStringsIteratorTest.java:204)
   [junit4]    >    at org.apache.lucene.util.automaton.FiniteStringsIteratorTest.testRandomFiniteStrings1(FiniteStringsIteratorTest.java:82)
   [junit4]    >    at java.lang.Thread.run(Thread.java:745)
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene53): {}, docValues:{}, sim=RandomSimilarityProvider(queryNorm=true,coord=no): {}, locale=es_CU, timezone=America/Porto_Velho
   [junit4]   2> NOTE: Linux 3.13.0-46-generic amd64/Oracle Corporation 1.8.0_40 (64-bit)/cpus=8,threads=1,free=483134008,total=504889344
   [junit4]   2> NOTE: All tests run in this JVM: [FiniteStringsIteratorTest]
   [junit4] Completed [1/1] in 0.41s, 1 test, 1 failure <<< FAILURES!

It reproduces ... I haven't dug into it yet.

asfimport commented 9 years ago

Markus Heiden (migrated from JIRA)

I have executed the tests in lucene/core again and they do not fail on the state of the trunk when I created the patch. There have to be changes elsewhere which lead to this failure.

asfimport commented 9 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I don't think so. The test in question (and the seed in question) fails because FiniteStringsIterator returns the same sequence twice ("h"), I just compared the outputs of the recursive and non-recursive outputs.

Looks like the seed in question adds the same string to the input automaton twice ("h") and then doesn't minimize/ determinize. getFiniteStringsRecursive makes sure no string is emitted twice, FiniteStringsIterator doesn't have this check.

Perhaps we should require that the input automaton doesn't have duplicates (Mike?). Alternatively, we could add compare-to-previous to the iterator and skip duplicates.

asfimport commented 9 years ago

Markus Heiden (migrated from JIRA)

Sorry, I missed to use the seed.

The de-duplication has prior been implemented by the big set containing the result, which has been removed. So no de-duplication takes places. Should this be the responsibility of the iterator at all?

asfimport commented 9 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I think it even makes more sense to leave the code as is (without removing duplicates). It should be an assertion that no duplicates occurred in the automaton, but let @mikemccand confirm this.

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Ahh thanks for digging @dweiss and Markus Heiden.

Markus Heiden usually all that's necessary to reproduce a test is to copy/past the exact text after "Reproduce with: ...", in this case:

ant test  -Dtestcase=FiniteStringsIteratorTest -Dtests.method=testRandomFiniteStrings1 -Dtests.seed=4A938C5F6E728DCC -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=es_CU -Dtests.timezone=America/Porto_Velho -Dtests.asserts=true -Dtests.file.encoding=UTF-8

That usually reproduces the failure, though sometimes you'll need the exact JVM version, added JVM flags, etc.

I agree it should not be this iterator's job to deal with duplicates: I think if you pass a non-minimal automaton to it, it's fair game for it to return dups ... so this is a test bug.

asfimport commented 9 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I'll fix the test to de-dup itself, and add a comment in FiniteStringsIterator's javadocs about this ...

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1689404 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1689404

LUCENE-6365: fix test to not add duplicate strings

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1689405 from @mikemccand in branch 'dev/branches/branch_5x' https://svn.apache.org/r1689405

LUCENE-6365: fix test to not add duplicate strings

asfimport commented 9 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Thanks Mike, thanks Markus.

asfimport commented 9 years ago

Shalin Shekhar Mangar (@shalinmangar) (migrated from JIRA)

Bulk close for 5.3.0 release