apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.69k stars 1.04k forks source link

Merge DocsEnum and DocsAndPositionsEnum into PostingsEnum [LUCENE-4524] #5590

Closed asfimport closed 9 years ago

asfimport commented 12 years ago

spinnoff from http://www.gossamer-threads.com/lists/lucene/java-dev/172261

hey folks, 

I have spend a hell lot of time on the positions branch to make 
positions and offsets working on all queries if needed. The one thing 
that bugged me the most is the distinction between DocsEnum and 
DocsAndPositionsEnum. Really when you look at it closer DocsEnum is a 
DocsAndFreqsEnum and if we omit Freqs we should return a DocIdSetIter. 
Same is true for 
DocsAndPostionsAndPayloadsAndOffsets*YourFancyFeatureHere*Enum. I 
don't really see the benefits from this. We should rather make the 
interface simple and call it something like PostingsEnum where you 
have to specify flags on the TermsIterator and if we can't provide the 
sufficient enum we throw an exception? 
I just want to bring up the idea here since it might simplify a lot 
for users as well for us when improving our positions / offset etc. 
support. 

thoughts? Ideas? 

simon 

Migrated from LUCENE-4524 by Simon Willnauer (@s1monw), 2 votes, resolved Feb 06 2015 Attachments: LUCENE-4524.patch (versions: 6) Linked issues:

asfimport commented 11 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

here is an initial patch that moves this over. I really just did some initial porting and this patch has still some problems.

I removed DocsAndPosEnum entirely and changed how the DocsEnum Flags work such that we only have TermsEnum#docs and a simple sugar method for docsAndPos which should go away IMO. We need to figure out what kind of behavior those flags should trigger ie. if we have no freqs we still return and enum while no pos we return null.

anyway, most of the patch is rename etc. all test pass, comments welcome

asfimport commented 11 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

new patch bringing back TermsEnum#docsAndPositions(...) this make this entire thing way simpler and I think this is how it should be. All tests pass and I think this is pretty close already.

asfimport commented 11 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

The javadocs on PostingsReaderBase are a bit funky, but otherwise this looks great. Simplifies things a lot.

asfimport commented 11 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

alan, I agree this might make a lot of things simple in #3952 like passing instances down to the actual scorer. This might even help us to get the API straight.

I will go over the patch tomorrow again and straight out javadocs etc. I plan to commit this to trunk and then backport to 4.2 I think this this case we should really break BW compat and just go ahead and remove the DocsAndPositionsEnum class entirely. Any objections?

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I agree its a great! change, but perhaps this should be committed to the 2878 branch for now to see how it plays in practice?

Otherwise I have a few concerns about committing to trunk (none of which have to do with the particular patch, just problems in general)

asfimport commented 11 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

this change is really unrelated to #3952 it removes a unnecessary duplication. The fact that Scorer extends DocEnum is not concerning me here since the main purpose of this class is not the Scorer API. This should really go on trunk since given the discussion on the list this is independent of #3952. If there are bugs in reuse we should catch them in the tests no? I mean we can add even more tests for this particular problem so we catch them quicker. I would be ok with makeing them abstract but this really is not a big deal here. I would want to move forward here quickly on trunk at least we can merge later into 4.x if needed since this might go out of date quickly.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Bulk move 4.4 issues to 4.5 and 5.0

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Move issue to Lucene 4.9.

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

In an attempt to make #3952 a bit more manageable, I'm trying to split this patch back out again. In addition to merging DocsEnum and DocsAndPositionsEnum, I've removed TermsEnum.docsAndPositions(), moving all the functionality into TermsEnum.docs(). However, I'm bumping into an API issue here, because our previous guarantee was that docs() would never return null, while docsAndPositions() returns null if the relevant postings information wasn't indexed.

One option would be to add a postings() method to DocsEnum which returns what postings details are available. So instead of returning null, we return a DocsEnum that contains whatever postings the index supports, and clients then check postings() to see if it supports what they want to do.

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Here's what I've got so far. Warning: tests fail, due to some things returning null when they're not expected to.

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

In fact postings() turns out to be unnecessary, I just needed to fix MappingMultiDocsEnum to check for nulls properly. The API seems to work OK, actually, in that it works as before if you pass DocsEnum.FLAG_NONE or DocsEnum.FLAG_FREQ, and works as docsAndPositions() did before if you pass anything higher.

We should even be able to keep the API backwards-compatible if we rename the merged enum to PostingsEnum (as the issue title suggests), and then have deprecated DocsEnum and DocsAndPositionsEnum classes that extend it.

I'm going to add some more testing around re-use first.

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

This patch merges the old DocsEnum and DocsAndPositionsEnum into a new PostingsEnum class (which is basically the old DaPE class), with DocsEnum extending it as a convenience class returning empty values for positions, offsets and payloads.

TermsEnum.docs() methods are renamed to TermsEnum.postings().

The old docs() and docsAndPositions() methods can be added back to keep backwards compatibility.

Next up: some basic re-use tests. I think we should be able to assert that things aren't reused when we have different postings requested for all postings formats, and check specific cases for those formats where re-use is actually implemented.

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Patch adding a basic re-use test to BasePostingsFormatTestCase. The verifyEnum method already does a lot of randomized testing of reuse, so the new test just asserts that a TermsEnum is reused or not reused in a couple of cases.

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

I'm travelling for FOSDEM from tomorrow, but I'd like to get this into trunk early next week if possible. Comments welcome :-)

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

This is a better patch, the old one still had some of the Weight API changes from #3952 in it.

Scorer extends PostingsEnum directly at the moment, which means that there are lots of Scorer implementations that have to implement empty position, offset and payload methods. Might be worth having it extend DocsEnum instead.

asfimport commented 9 years ago

David Smiley (@dsmiley) (migrated from JIRA)

FWIW The main thing I'm looking for is a way to access the current position of a scorer, and then the ability to advance the scorer tree to the next position. With this, an accurate highlighter (what Robert calls a "query debugger") can be built. You've made references to having a highlighter using this code... is this true? Can you share more about what it's features are or at least point me at it?

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

This is a pre-requisite issue for that, but it doesn't implement it for now. All it does is merge the two different postings enumerations, which has a follow-on effect that Scorer now has position, offset and payload methods. For now they're not actually implemented for most Scorers though. That can be done in follow-up issues.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks Alan: I benchmarked the current patch, i dont see any performance problems:

                    Task   QPS trunk      StdDev   QPS patch      StdDev                Pct diff
                  IntNRQ        8.73      (5.7%)        8.21      (8.8%)   -6.0% ( -19% -    8%)
               MedPhrase      261.88      (5.1%)      249.79      (4.6%)   -4.6% ( -13% -    5%)
                 Prefix3      218.79      (5.3%)      210.03      (6.0%)   -4.0% ( -14% -    7%)
              HighPhrase       18.04      (4.2%)       17.35      (3.3%)   -3.8% ( -10% -    3%)
                Wildcard       46.18      (3.3%)       44.65      (5.0%)   -3.3% ( -11% -    5%)
               LowPhrase       34.95      (2.2%)       34.35      (1.5%)   -1.7% (  -5% -    1%)
                HighTerm      119.48      (3.7%)      117.73      (5.2%)   -1.5% ( -10% -    7%)
                 MedTerm      175.53      (3.5%)      173.14      (5.0%)   -1.4% (  -9% -    7%)
                 LowTerm      931.30      (2.9%)      924.38      (4.5%)   -0.7% (  -7% -    6%)
            HighSpanNear      143.52      (4.7%)      142.48      (3.5%)   -0.7% (  -8% -    7%)
             LowSpanNear       27.97      (3.4%)       27.79      (2.6%)   -0.7% (  -6% -    5%)
              AndHighLow     1167.87      (2.0%)     1161.24      (2.1%)   -0.6% (  -4% -    3%)
             MedSpanNear      143.90      (4.1%)      143.30      (3.6%)   -0.4% (  -7% -    7%)
            OrNotHighLow      953.80      (2.0%)      951.43      (1.7%)   -0.2% (  -3% -    3%)
         LowSloppyPhrase      119.56      (3.0%)      119.79      (2.7%)    0.2% (  -5% -    6%)
                  Fuzzy1      107.94      (2.7%)      108.20      (3.0%)    0.2% (  -5% -    6%)
                 Respell       88.19      (3.3%)       88.51      (3.1%)    0.4% (  -5% -    6%)
            OrNotHighMed      182.77      (2.6%)      183.48      (2.1%)    0.4% (  -4% -    5%)
         MedSloppyPhrase       15.83      (4.8%)       15.91      (4.5%)    0.5% (  -8% -   10%)
                  Fuzzy2       66.59      (2.9%)       66.96      (3.1%)    0.5% (  -5% -    6%)
             AndHighHigh       87.34      (1.9%)       88.01      (1.6%)    0.8% (  -2% -    4%)
              AndHighMed      122.26      (2.0%)      123.39      (1.5%)    0.9% (  -2% -    4%)
           OrNotHighHigh       48.70      (3.6%)       49.29      (4.4%)    1.2% (  -6% -    9%)
           OrHighNotHigh       29.09      (3.7%)       29.44      (4.5%)    1.2% (  -6% -    9%)
               OrHighLow       55.62      (7.6%)       56.35      (9.5%)    1.3% ( -14% -   19%)
            OrHighNotMed       87.78      (3.9%)       88.99      (5.0%)    1.4% (  -7% -   10%)
            OrHighNotLow      106.31      (4.1%)      107.84      (5.4%)    1.4% (  -7% -   11%)
               OrHighMed       57.15      (7.7%)       58.06      (9.5%)    1.6% ( -14% -   20%)
              OrHighHigh       26.80      (8.3%)       27.26     (10.1%)    1.7% ( -15% -   21%)
        HighSloppyPhrase       13.10     (11.3%)       13.43     (12.1%)    2.5% ( -18% -   29%)

I will try to go thru it today and review the changes.

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

This is kind of a pain to keep up-to-date with trunk. If everybody's happy, I'll commit on Friday.

asfimport commented 9 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I quickly glanced thru, my only comment is, can we defer the startPosition() and endPosition()? These are not yet used by anything, and we may want to think of the semantics before putting these into all of our postings list apis.

asfimport commented 9 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Committing this once I've gone through and changed back all the places that my IDE renamed /docs directories to /postings...

I've also removed startPosition() and endPosition()

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1657800 from @romseygeek in branch 'dev/trunk' https://svn.apache.org/r1657800

LUCENE-4524: Replace DocsEnum and DocsAndPositionsEnum with PostingsEnum

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1657801 from @romseygeek in branch 'dev/trunk' https://svn.apache.org/r1657801

LUCENE-4524: Revert some bogus text changes

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1657880 from @romseygeek in branch 'dev/trunk' https://svn.apache.org/r1657880

LUCENE-4524: Fix javadocs

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1657890 from @romseygeek in branch 'dev/branches/branch_5x' https://svn.apache.org/r1657890

LUCENE-4524: Replace DocsEnum and DocsAndPositionsEnum with PostingsEnum

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1659018 from @rmuir in branch 'dev/trunk' https://svn.apache.org/r1659018

LUCENE-4524: remove fixed @Seed

asfimport commented 9 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1659021 from @rmuir in branch 'dev/branches/branch_5x' https://svn.apache.org/r1659021

LUCENE-4524: remove fixed @Seed

asfimport commented 9 years ago

Timothy Potter (@thelabdude) (migrated from JIRA)

Bulk close after 5.1 release