Closed asfimport closed 9 years ago
Markus Heiden (migrated from JIRA)
Patch containing the proposed changes.
Dawid Weiss (@dweiss) (migrated from JIRA)
I looked at the patch (briefly, without delving into all the details – I'm on vacation :).
I like the idea overall. The only thing that bugs me is the explicitness of FiniteStringsIterator. I would rather have something like:
Iterator<IntsRef> Operations.iterateFiniteStrings(...)
so that we can leave FiniteStringsIterator as a package-private implementation and not proliferate it around. If Iterator interface gives you a headache (which it can) then I'd leave FiniteStringsIterator as an explicit return type but leave the factory method inside Operations.
The assignments inside the conditional part of for
loops are very likely to drive somebody crazy sooner or later.
Again, I only looked at the patch, I don't even have a proper environment with me to check on all the details.
Markus Heiden (migrated from JIRA)
I prefer a "standard" Iterator too. But by implementing it, my implementation loses some of its benefits, because an iterator needs a look-ahead for hasNext().
I chose the current implementation because: 1) I can reuse the iterator for different iterations (similar to Transition): See AnalyzingSuggester. So e.g. the internal stack does not need to be reallocated. This avoids not just the initial allocation, but the resizes during the iterations too. BTW: May we choose a bigger initial size for the stack? 2) By avoiding a look-ahead, my implementation can return the internal IntRefs (from the used IntsRefBuilder) in next(), without the need for a deep copy.
Robert Muir (@rmuir) (migrated from JIRA)
BTW: May we choose a bigger initial size for the stack?
No, this is too much as a library, it impacts too many uses and makes the library too difficult to use (special jvm flags must be supplied).
Markus Heiden (migrated from JIRA)
I talked about the stack like structure inside the iterator not the jvm stack size. This stack holds the PathNodes of the current finite string. It has an initial size of just 4, so it has to be dynamicly resized in almost all use cases.
Michael McCandless (@mikemccand) (migrated from JIRA)
I like this change overall ... it means you can iterate over many more finite strings if you want. I think a dedicated iterator is OK; just mark it @lucene
.experimental so we are free to improve it later.
Iteration order changed, so when iterating with a limit, the result may differ slightly.
I think that's fine and it's an impl detail (which strings you'll get when you hit the limit) ... can you update the javadocs to say so?
Dawid Weiss (@dweiss) (migrated from JIRA)
Oops, apologies Markus, this slipped my agenda somehow. As for implementing Iterator – ok, you don't have to stick to Java's one, I understand the lookahead is hairy. But I'd still change those for loops with compound assignment/ comparison to something that is less cryptic (like a simple while loop with the variable thrown out). The JIT doesn't care about it anyway I bet (variables undergo liveness analysis).
Markus Heiden (migrated from JIRA)
Are you talking about this?
for (IntsRef finiteString; (finiteString = iterator.next()) != null;)
For me it is the standard iteration pattern for non-lookahead iterations, like e.g. iterating over an input stream (see e.g. FileCopyUtils of Spring framework).
Does this one look better for you?
for (IntsRef finiteString = iterator.next(); finiteString != null; finiteString = iterator.next())
I like my version better, because it is shorter and the iterator.next() is not doubled, but I will you use it, if you like it better.
A simple while loop looks even more bloated to me. It unnecessarily widens the scope of finiteString and splits things which belong together, which both is error prone for coding:
IntsRef finiteString = iterator.next();
while (finiteString != null) {
// do something
finiteString = iterator.next();
}
Something different: I marked Operations.getFiniteStrings() as deprecated in my patch, because it should be replaced by the new iterator. But I consider to remove the deprecated, because this method is easier to use for single iterations of small finite strings sets and makes some tests cases simpler. What do you think?
Again something different: What about the initial stack size in the new iterator (which needs to be at least as big as the max. length of the iterated finite strings)? May I raise it from 4 to e.g. 16? In my opinion this would be needed for roughly 90% of all cases.
Dawid Weiss (@dweiss) (migrated from JIRA)
for (IntsRef finiteString; (finiteString = iterator.next()) != null;)
Many people have strong feelings about assignments in conditional expressions. In fact I just recently stumbled upon an Eclipse JDT refactoring bug that was evaluating these (and refactoring these) incorrectly. My comment was actually meant to go together with the "why don't we make it an iterator" one... If you did that then the problem of what kind of loop it is pretty much disappears.
Anyway, this is really minor and a matter of style rather than correctness. It can go in as-is.
I marked Operations.getFiniteStrings() as deprecated in my patch [...] What do you think?
No strong opinion. If it's only used from tests then you can mark it as deprecated I guess; no need to support redundant code.
May I raise it from 4 to e.g. 16?
It very likely won't matter in practice at all. I think increasing it to 16 won't do anybody any harm (you could try to squeeze it into a single line cache, but I think it's an overkill and premature optimization; it'll vanish in other processing noise).
Michael McCandless (@mikemccand) (migrated from JIRA)
Can you add in the javadocs of the new iterator that the order that accepted strings are returned is implementation dependent and free to change across releases?
I marked Operations.getFiniteStrings() as deprecated
Can we simply remove it? The Operations methods are experimental and dangerous ... it need not live on?
Markus Heiden (migrated from JIRA)
Sorry for the delay.
I moved Operations.getFiniteStrings() to TestOperations.getFiniteStrings(), because using the iterator for assertions is a pain. Production code using this method has been replaced by direct usage of the new iterator.
I got one problem with that: I am not sure if the implementation change of CompletionTokenStream is OK, because I set the position attribute at the end of the iteration instead of at the start of the iteration. The tests run fine, but someone should review that.
I marked the new iterator as @lucene
.experimental and added a comment, that the iteration order may change.
Michael McCandless (@mikemccand) (migrated from JIRA)
Thanks Markus Heiden, new patch looks great.
Can we remove the limit to FiniteStringsIterator.init? Seems like this ("abort iteration after N items") should be the caller's job?
Can we just pass the automaton to FSI's ctor? I don't think we need a reuse API here...
I am not sure if the implementation change of CompletionTokenStream is OK, because I set the position attribute at the end of the iteration instead of at the start of the iteration. The tests run fine, but someone should review that.
It is weird that CompletionTokenStream hijacks PositionIncrementAttribute like that, and I can't see anywhere that reads from that (and indeed tests pass if I comment it out). Maybe @areek knows? I think we should just remove it?
Areek Zillur (@areek) (migrated from JIRA)
+1 to removing PositionIncrementAttr from CompletionTokenStream.
Markus Heiden (migrated from JIRA)
Removed PositionIncrementAttribute from CompletionTokenStream.
Markus Heiden (migrated from JIRA)
Removing init(): One my design goals was to be able to reuse this iterator, mainly to optimize the performance of the AnalyzingSuggester. This mainly comes from avoiding the step by step allocation of the stack inside the iterator over and over again. So maybe I can provide an additional constructor by combining the default constructor with the init() method?
Removing limit support: I like the idea, but as I tried to implement the change, it seems that there are some downsides: At least the AnalyzingSuggester and the CompletionTokenStream are using this feature. Especially annoying is to change the implementation of CompletionTokenStream, because it has no for loop. Maybe I can add a second init() method without limit, to not need to pass -1 each time the limitation is not needed?
Dawid Weiss (@dweiss) (migrated from JIRA)
> mainly to optimize the performance of the AnalyzingSuggester. This mainly comes from avoiding the step by step allocation of the stack inside the iterator over and over again.
I wouldn't worry about it, Markus. With a TLAB these allocations are fairly cheap. You could just add a reasonably large initial stack and that's it. If it needs to reallocate, so be it. Unless it becomes a real bottleneck in real life (which it won't, I assure you), there is no need for premature optimizations at the cost of making the code more complex.
Markus Heiden (migrated from JIRA)
For the normal use case I agree. But I had problems with long build times for lookups for big dictionaries (using the AnalyzingSuggester). I profiled the creation of the lookups and one hotspot was the allocation of the internal stack. One problem is, that the initial size of the internal stack is too small (4 entries), so the internal stack gets resized over and over again. I will increase its initial size to 16.
Michael McCandless (@mikemccand) (migrated from JIRA)
I would really prefer not make API compromises (reuse, init method) for such optimizations, nor for the "limit" case (this is really the caller's responsibility...).
Especially annoying is to change the implementation of CompletionTokenStream, because it has no for loop.
It's fine to just add a member to the class tracking how many strings it has pulled so far from the iterator...
Also, you can remove that //TODO: make this return a Iterator<IntsRef> instead?
since you are doing exactly that, here...
Markus Heiden (migrated from JIRA)
The Iterator interface is not possible with the current implementation, because the lookahead of hasNext() would destroy the current value provided by the previous next(). Anyway, I removed that comment.
I provided now both interfaces (single use and multiple use) with the newest patch, so for the default case you can use the more intuitive one (single use). I hope you like it.
I kept the limit inside the iterator, but provided the new class LimitedFiniteStringsIterator for it. Because it was too complicated to transfer the limit into the yet complex iteration in AnalyzingSuggester etc.
Michael McCandless (@mikemccand) (migrated from JIRA)
The Iterator interface is not possible with the current implementation, because the lookahead of hasNext() would destroy the current value provided by the previous next(). Anyway, I removed that comment.
OK it's fine if we don't implement Java's Iterator; the spirit of that TODO can still be removed :)
I provided now both interfaces (single use and multiple use) with the newest patch, so for the default case you can use the more intuitive one (single use). I hope you like it.
Sorry, I really don't want to pollute such a low level API with reuse ... can you remove the init reuse method?
I kept the limit inside the iterator, but provided the new class LimitedFiniteStringsIterator for it. Because it was too complicated to transfer the limit into the yet complex iteration in AnalyzingSuggester etc.
OK that seems like a good compromise...
Markus Heiden (migrated from JIRA)
I adapted my patch to the latest changes in trunk.
I think the reuse of the iterator is one core part of this whole patch. I tried to rework the api of the iterator so that the reuse case and the no-reuse case are handled in a similar way. I hope you like it now (at least a bit). Lucene does this kind of reuse already, e.g. see Transition.
FuzzyCompletionQuery has been added lately and relies on the old big set of finite strings. I am not sure how to rework it. Currently it still uses the set, maybe it is better to use the iterator inside of FuzzyCompletionWeight, but this means recomputing the finite strings over and over again. What do you think?
BTW topoSortStates() is implemented by AnalyzingSuggester and CompletionTokenStream identically. Maybe it should be moved to one place, maybe to Operations?
Michael McCandless (@mikemccand) (migrated from JIRA)
I adapted my patch to the latest changes in trunk.
Thanks.
I think the reuse of the iterator is one core part of this whole patch. I tried to rework the api of the iterator so that the reuse case and the no-reuse case are handled in a similar way. I hope you like it now (at least a bit).
Alas I still don't think this is an appropriate place for object reuse.
Lucene does this kind of reuse already, e.g. see Transition.
That's true: Lucene does reuse objects in many low-level places, but this is ugly and cancerous and dangerous (can easily cause bugs, e.g. accidentally reusing one iterator across threads) and anti-Java, etc., and it really should be used only sparingly, and should be the exception not the rule. I don't think this API qualifies, i.e. it's a bad tradeoff to pollute the API to eeek out a bit of GC perf gain that in real usage would be negligible because the cost of building an automaton and the cost of consuming each string that's iterated would normally dwarf the small GC cost of creating a new iterator per automaton. APIs are hard enough to "get right" as it is ...
FuzzyCompletionQuery has been added lately and relies on the old big set of finite strings. I am not sure how to rework it. Currently it still uses the set, maybe it is better to use the iterator inside of FuzzyCompletionWeight, but this means recomputing the finite strings over and over again. What do you think?
It's fine to leave this as the full Set<String>
for now. It's no worse :) Progress not perfection...
BTW topoSortStates() is implemented by AnalyzingSuggester and CompletionTokenStream identically. Maybe it should be moved to one place, maybe to Operations?
Woops, I'll go move that to Operations now, good idea, thank you!
ASF subversion and git services (migrated from JIRA)
Commit 1689046 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1689046
LUCENE-6365: add Operations.topoSort
ASF subversion and git services (migrated from JIRA)
Commit 1689047 from @mikemccand in branch 'dev/branches/branch_5x' https://svn.apache.org/r1689047
LUCENE-6365: add Operations.topoSort
Markus Heiden (migrated from JIRA)
I see, I am not able to convince you ;-) So I attached a version of the patch with eliminated reuse api.
I agree that reuse is no good design, but the profiler pointed me to that spot. I already did a patch for Automaton (#7021) for the same reasons.
It would be nice, if Automaton knows the size of the longest word it produces. That would eliminated the resizing of the internal stack array inside the iterator and could convince me that reuse is not needed for the iterator.
ASF subversion and git services (migrated from JIRA)
Commit 1689079 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1689079
LUCENE-6365: fix buggy Operations.topoSort; add test
ASF subversion and git services (migrated from JIRA)
Commit 1689081 from @mikemccand in branch 'dev/branches/branch_5x' https://svn.apache.org/r1689081
LUCENE-6365: fix buggy Operations.topoSort; add test
Michael McCandless (@mikemccand) (migrated from JIRA)
Thanks Markus Heiden, the patch looks good.
I do still think this is a very nice change even without the reuse!
Was it accidental to remove `@lucene
.experimental` from CompletionTokenStream's javadocs? I put that back.
I also made a couple additional members of FiniteStringsIterator final, and fixed up the javadocs a bit ... new patch attached.
I'll commit soon, thank you!
Uwe Schindler (@uschindler) (migrated from JIRA)
I agree that reuse is no good design, but the profiler pointed me to that spot. I already did a patch for Automaton (#7021) for the same reasons.
The problem with profilers is: They don't run the code with all hotspot optimizations turned on (because they add extra commands into the loop) to take measurement points. In general Object creation is in most cases not bad, except when they tend to live long time (e.g., across several method). Short living objects never affect GC, because they are created on stack automatically (escape analysis) - and those are in most cases never shown correctly by profiler, because escape analysis is mostly broken with profilers.
Reuse is only suitable for cases where the object goes through many stages of a query and needs to be cached between method calls and involves disk I/O,... This is not the case here.
Patch without reuse looks fine, although I dont fully understand it (not my area of work).
Markus Heiden (migrated from JIRA)
@M
ichael: The removal of @lucene
.experimental was a mistake of mine during merging.Thanks for your rework and your patience.
@Uwe
: I measured the cpu runtime in sampling mode, so (almost) no additional overhead should occur. I did the reuse because there is not just one allocation of the array, but many. During runtime the array will be resized over and over again, because the initial size was rather small (4 entries). I changed that to 16 so the resizing occurs less frequent. My test case was the build of dictionary of 100000s of words, so even small things accumulate.
A better solution to that problem would be, if automatons know the length of their longest word. In that case that above mentioned array could initially be sized right. But I don't know, if that length is always known during construction of automatons.
ASF subversion and git services (migrated from JIRA)
Commit 1689192 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1689192
LUCENE-6365: switch to iterator API to get all finite strings from an Automaton
ASF subversion and git services (migrated from JIRA)
Commit 1689193 from @mikemccand in branch 'dev/branches/branch_5x' https://svn.apache.org/r1689193
LUCENE-6365: switch to iterator API to get all finite strings from an Automaton
Michael McCandless (@mikemccand) (migrated from JIRA)
Markus Heiden maybe try the honest profiler: https://github.com/RichardWarburton/honest-profiler/wiki ?
I haven't tried it yet, but it apparently avoids the safepoint sampling bias ...
Michael McCandless (@mikemccand) (migrated from JIRA)
Hmm this failure just appeared on Jenkins:
[junit4] Started J0 PID(3197@localhost).
[junit4] Suite: org.apache.lucene.util.automaton.FiniteStringsIteratorTest
[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=FiniteStringsIteratorTest -Dtests.method=testRandomFiniteStrings1 -Dtests.seed=4A938C5F6E728DCC -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=es_CU -Dtests.timezone=America/Porto_Velho -Dtests.asserts=true -Dtests.file.encoding=UTF-8
[junit4] FAILURE 0.26s | FiniteStringsIteratorTest.testRandomFiniteStrings1 <<<
[junit4] > Throwable #1: java.lang.AssertionError: expected:<445> but was:<446>
[junit4] > at __randomizedtesting.SeedInfo.seed([4A938C5F6E728DCC:17BF51EE97CD2662]:0)
[junit4] > at org.apache.lucene.util.automaton.FiniteStringsIteratorTest.assertFiniteStringsRecursive(FiniteStringsIteratorTest.java:204)
[junit4] > at org.apache.lucene.util.automaton.FiniteStringsIteratorTest.testRandomFiniteStrings1(FiniteStringsIteratorTest.java:82)
[junit4] > at java.lang.Thread.run(Thread.java:745)
[junit4] 2> NOTE: test params are: codec=Asserting(Lucene53): {}, docValues:{}, sim=RandomSimilarityProvider(queryNorm=true,coord=no): {}, locale=es_CU, timezone=America/Porto_Velho
[junit4] 2> NOTE: Linux 3.13.0-46-generic amd64/Oracle Corporation 1.8.0_40 (64-bit)/cpus=8,threads=1,free=483134008,total=504889344
[junit4] 2> NOTE: All tests run in this JVM: [FiniteStringsIteratorTest]
[junit4] Completed [1/1] in 0.41s, 1 test, 1 failure <<< FAILURES!
It reproduces ... I haven't dug into it yet.
Markus Heiden (migrated from JIRA)
I have executed the tests in lucene/core again and they do not fail on the state of the trunk when I created the patch. There have to be changes elsewhere which lead to this failure.
Dawid Weiss (@dweiss) (migrated from JIRA)
I don't think so. The test in question (and the seed in question) fails because FiniteStringsIterator returns the same sequence twice ("h"), I just compared the outputs of the recursive and non-recursive outputs.
Looks like the seed in question adds the same string to the input automaton twice ("h") and then doesn't minimize/ determinize. getFiniteStringsRecursive makes sure no string is emitted twice, FiniteStringsIterator doesn't have this check.
Perhaps we should require that the input automaton doesn't have duplicates (Mike?). Alternatively, we could add compare-to-previous to the iterator and skip duplicates.
Markus Heiden (migrated from JIRA)
Sorry, I missed to use the seed.
The de-duplication has prior been implemented by the big set containing the result, which has been removed. So no de-duplication takes places. Should this be the responsibility of the iterator at all?
Dawid Weiss (@dweiss) (migrated from JIRA)
I think it even makes more sense to leave the code as is (without removing duplicates). It should be an assertion that no duplicates occurred in the automaton, but let @mikemccand confirm this.
Michael McCandless (@mikemccand) (migrated from JIRA)
Ahh thanks for digging @dweiss and Markus Heiden.
Markus Heiden usually all that's necessary to reproduce a test is to copy/past the exact text after "Reproduce with: ...", in this case:
ant test -Dtestcase=FiniteStringsIteratorTest -Dtests.method=testRandomFiniteStrings1 -Dtests.seed=4A938C5F6E728DCC -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=es_CU -Dtests.timezone=America/Porto_Velho -Dtests.asserts=true -Dtests.file.encoding=UTF-8
That usually reproduces the failure, though sometimes you'll need the exact JVM version, added JVM flags, etc.
I agree it should not be this iterator's job to deal with duplicates: I think if you pass a non-minimal automaton to it, it's fair game for it to return dups ... so this is a test bug.
Michael McCandless (@mikemccand) (migrated from JIRA)
I'll fix the test to de-dup itself, and add a comment in FiniteStringsIterator's javadocs about this ...
ASF subversion and git services (migrated from JIRA)
Commit 1689404 from @mikemccand in branch 'dev/trunk' https://svn.apache.org/r1689404
LUCENE-6365: fix test to not add duplicate strings
ASF subversion and git services (migrated from JIRA)
Commit 1689405 from @mikemccand in branch 'dev/branches/branch_5x' https://svn.apache.org/r1689405
LUCENE-6365: fix test to not add duplicate strings
Dawid Weiss (@dweiss) (migrated from JIRA)
Thanks Mike, thanks Markus.
Shalin Shekhar Mangar (@shalinmangar) (migrated from JIRA)
Bulk close for 5.3.0 release
Replaced Operations.getFiniteStrings() by an optimized FiniteStringIterator.
Benefits: Avoid huge hash set of finite strings. Avoid massive object/array creation during processing.
"Downside": Iteration order changed, so when iterating with a limit, the result may differ slightly. Old: emit current node, if accept / recurse. New: recurse / emit current node, if accept.
The old method Operations.getFiniteStrings() still exists, because it eases the tests. It is now implemented by use of the new FiniteStringIterator.
Migrated from LUCENE-6365 by Markus Heiden, 1 vote, resolved Jul 06 2015 Attachments: FiniteStrings_noreuse.patch, FiniteStrings_reuse.patch, LUCENE-6365.patch