apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.63k stars 1.02k forks source link

Optimize TermsEnum.seek when caller doesn't need next term [LUCENE-3225] #4298

Closed asfimport closed 13 years ago

asfimport commented 13 years ago

Some codecs are able to save CPU if the caller is only interested in exact matches. EG, Memory codec and SimpleText can do more efficient FSTEnum lookup if they know the caller doesn't need to know the term following the seek term.

We have cases like this in Lucene, eg when IW deletes documents by Term, if the term is not found in a given segment then it doesn't need to know the ceiling term. Likewise when TermQuery looks up the term in each segment.

I had done this change as part of #4104, which is a new terms index that's able to save seeking for exact-only lookups, but now that we have Memory codec that can also save CPU I think we should commit this today.

The change adds a "boolean onlyExact" param to seek(BytesRef).


Migrated from LUCENE-3225 by Michael McCandless (@mikemccand), resolved Jun 26 2011 Attachments: LUCENE-3225.patch (versions: 2)

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch. All tests pass... I think it's ready!

asfimport commented 13 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

+1, looks good.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Mike this seems like a good improvement but I think letting a user change the behavior of method X by passing true / false to method Y is no good. I think this is kind of error prone plus its cluttering the seek method though. Once Boolean is enough here. I think we should rather restrict this to allow users to pull an exactMatchOnly TermsEnum which does only support exact matches and throws a clear exception if next is called. I know that makes things slightly harder especially to deal with our ThreadLocal cached TermsEnum instances but I think that is better here. Can we somehow leave the extra CPU work to the term() call and make this entirely lazy?

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Mike this seems like a good improvement but I think letting a user change the behavior of method X by passing true / false to method Y is no good. I think this is kind of error prone plus its cluttering the seek method though. Once Boolean is enough here. I think we should rather restrict this to allow users to pull an exactMatchOnly TermsEnum which does only support exact matches and throws a clear exception if next is called. I know that makes things slightly harder especially to deal with our ThreadLocal cached TermsEnum instances but I think that is better here.

Well, it only means the enum is unpositioned if you get back NOT_FOUND? Ie, it's just like if you get back null from next(), or END from seek(): in these cases, the enum is unpositioned and you need to call seek again.

My worry if we force an up-front decision here ("exact only" enum vs "non-exact only enum") is we prevent legitimate use cases where the caller wants to mix & match with one enum.

EG, when AutomatonQuery intersects w/ the terms, when it hits are region where terms are denser than what the automaton will accept (such as an "infinite" part), it should use exact seeking, but then when it's in a region where terms are less dense (eg a "finite" part) it should use exact seeking.... I'll open a separate issue for this.

The TermsEnum impls can be efficient in this case, ie re-using internal seek state for the exat and non-exact cases (MemoryCodec does this).

But I agree another boolean to seek isn't great; maybe instead we can make a seperate seekExact method? Default impl would just call seek (and get no perf gains).

BTW, similarly, I think we have a missing API in DISI (for scoring): advance always does a next() if the target doc doesn't match. But we can get substantial performance gains in some cases (see LUCENE-1536) if we had an advanceExact that would not do the next and simply tell us if this doc matched or not.

Can we somehow leave the extra CPU work to the term() call and make this entirely lazy?

Not sure what you meant here?

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This patch gives nice gains for MemoryCodec: I did a quick test w/ my NRT stress test (reopen at 2X Twitter's peak indexing rate) and the reopen time dropped from \~49 msec to \~43 msec (\~12% faster). This is impressive because resolving deletes is just one part of opening the NRT reader, ie we also must write the new segment, open SegmentReader against it, etc.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

BTW, similarly, I think we have a missing API in DISI (for scoring): advance always does a next() if the target doc doesn't match. But we can get substantial performance gains in some cases (see LUCENE-1536) if we had an advanceExact that would not do the next and simply tell us if this doc matched or not.

+1!!

But I agree another boolean to seek isn't great; maybe instead we can make a seperate seekExact method? Default impl would just call seek (and get no perf gains).

thats another option and I like that better though. Yet the other should the be seekFloor no?

not sure what you meant here?

nevermind I only looked at the top of the patch and figured that we only safe the loading into bytesref but there is more about it...

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Yet the other should the be seekFloor no?

Ahhh right, we had discussed on the dev list. I agree!

But, we should do this in another issue. Though, I think we should rename the current seek to seekCeil; I'll do that here.

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK, new patch: I added a new seekExact method (instead of new boolean to seek); renamed existing seek methods to either seekCeil or seekExact; changed seekExact(long ord) to not return a value (it's an error to pass out-of-bounds ord to this method). I think it's ready!

asfimport commented 13 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I like this one better. boolean args are cryptic (even if I do use them from time to time).

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

looks good +1 to commit! thanks for working on that