Closed asfimport closed 9 years ago
Simon Willnauer (@s1monw) (migrated from JIRA)
here is an initial patch that moves this over. I really just did some initial porting and this patch has still some problems.
I removed DocsAndPosEnum entirely and changed how the DocsEnum Flags work such that we only have TermsEnum#docs and a simple sugar method for docsAndPos which should go away IMO. We need to figure out what kind of behavior those flags should trigger ie. if we have no freqs we still return and enum while no pos we return null.
anyway, most of the patch is rename etc. all test pass, comments welcome
Simon Willnauer (@s1monw) (migrated from JIRA)
new patch bringing back TermsEnum#docsAndPositions(...) this make this entire thing way simpler and I think this is how it should be. All tests pass and I think this is pretty close already.
Alan Woodward (@romseygeek) (migrated from JIRA)
The javadocs on PostingsReaderBase are a bit funky, but otherwise this looks great. Simplifies things a lot.
Simon Willnauer (@s1monw) (migrated from JIRA)
alan, I agree this might make a lot of things simple in #3952 like passing instances down to the actual scorer. This might even help us to get the API straight.
I will go over the patch tomorrow again and straight out javadocs etc. I plan to commit this to trunk and then backport to 4.2 I think this this case we should really break BW compat and just go ahead and remove the DocsAndPositionsEnum class entirely. Any objections?
Robert Muir (@rmuir) (migrated from JIRA)
I agree its a great! change, but perhaps this should be committed to the 2878 branch for now to see how it plays in practice?
Otherwise I have a few concerns about committing to trunk (none of which have to do with the particular patch, just problems in general)
Simon Willnauer (@s1monw) (migrated from JIRA)
this change is really unrelated to #3952 it removes a unnecessary duplication. The fact that Scorer extends DocEnum is not concerning me here since the main purpose of this class is not the Scorer API. This should really go on trunk since given the discussion on the list this is independent of #3952. If there are bugs in reuse we should catch them in the tests no? I mean we can add even more tests for this particular problem so we catch them quicker. I would be ok with makeing them abstract but this really is not a big deal here. I would want to move forward here quickly on trunk at least we can merge later into 4.x if needed since this might go out of date quickly.
Steven Rowe (@sarowe) (migrated from JIRA)
Bulk move 4.4 issues to 4.5 and 5.0
Uwe Schindler (@uschindler) (migrated from JIRA)
Move issue to Lucene 4.9.
Alan Woodward (@romseygeek) (migrated from JIRA)
In an attempt to make #3952 a bit more manageable, I'm trying to split this patch back out again. In addition to merging DocsEnum and DocsAndPositionsEnum, I've removed TermsEnum.docsAndPositions(), moving all the functionality into TermsEnum.docs(). However, I'm bumping into an API issue here, because our previous guarantee was that docs() would never return null, while docsAndPositions() returns null if the relevant postings information wasn't indexed.
One option would be to add a postings() method to DocsEnum which returns what postings details are available. So instead of returning null, we return a DocsEnum that contains whatever postings the index supports, and clients then check postings() to see if it supports what they want to do.
Alan Woodward (@romseygeek) (migrated from JIRA)
Here's what I've got so far. Warning: tests fail, due to some things returning null when they're not expected to.
Alan Woodward (@romseygeek) (migrated from JIRA)
In fact postings() turns out to be unnecessary, I just needed to fix MappingMultiDocsEnum to check for nulls properly. The API seems to work OK, actually, in that it works as before if you pass DocsEnum.FLAG_NONE or DocsEnum.FLAG_FREQ, and works as docsAndPositions() did before if you pass anything higher.
We should even be able to keep the API backwards-compatible if we rename the merged enum to PostingsEnum (as the issue title suggests), and then have deprecated DocsEnum and DocsAndPositionsEnum classes that extend it.
I'm going to add some more testing around re-use first.
Alan Woodward (@romseygeek) (migrated from JIRA)
This patch merges the old DocsEnum and DocsAndPositionsEnum into a new PostingsEnum class (which is basically the old DaPE class), with DocsEnum extending it as a convenience class returning empty values for positions, offsets and payloads.
TermsEnum.docs() methods are renamed to TermsEnum.postings().
The old docs() and docsAndPositions() methods can be added back to keep backwards compatibility.
Next up: some basic re-use tests. I think we should be able to assert that things aren't reused when we have different postings requested for all postings formats, and check specific cases for those formats where re-use is actually implemented.
Alan Woodward (@romseygeek) (migrated from JIRA)
Patch adding a basic re-use test to BasePostingsFormatTestCase. The verifyEnum method already does a lot of randomized testing of reuse, so the new test just asserts that a TermsEnum is reused or not reused in a couple of cases.
Alan Woodward (@romseygeek) (migrated from JIRA)
I'm travelling for FOSDEM from tomorrow, but I'd like to get this into trunk early next week if possible. Comments welcome :-)
Alan Woodward (@romseygeek) (migrated from JIRA)
This is a better patch, the old one still had some of the Weight API changes from #3952 in it.
Scorer extends PostingsEnum directly at the moment, which means that there are lots of Scorer implementations that have to implement empty position, offset and payload methods. Might be worth having it extend DocsEnum instead.
David Smiley (@dsmiley) (migrated from JIRA)
FWIW The main thing I'm looking for is a way to access the current position of a scorer, and then the ability to advance the scorer tree to the next position. With this, an accurate highlighter (what Robert calls a "query debugger") can be built. You've made references to having a highlighter using this code... is this true? Can you share more about what it's features are or at least point me at it?
Alan Woodward (@romseygeek) (migrated from JIRA)
This is a pre-requisite issue for that, but it doesn't implement it for now. All it does is merge the two different postings enumerations, which has a follow-on effect that Scorer now has position, offset and payload methods. For now they're not actually implemented for most Scorers though. That can be done in follow-up issues.
Robert Muir (@rmuir) (migrated from JIRA)
Thanks Alan: I benchmarked the current patch, i dont see any performance problems:
Task QPS trunk StdDev QPS patch StdDev Pct diff
IntNRQ 8.73 (5.7%) 8.21 (8.8%) -6.0% ( -19% - 8%)
MedPhrase 261.88 (5.1%) 249.79 (4.6%) -4.6% ( -13% - 5%)
Prefix3 218.79 (5.3%) 210.03 (6.0%) -4.0% ( -14% - 7%)
HighPhrase 18.04 (4.2%) 17.35 (3.3%) -3.8% ( -10% - 3%)
Wildcard 46.18 (3.3%) 44.65 (5.0%) -3.3% ( -11% - 5%)
LowPhrase 34.95 (2.2%) 34.35 (1.5%) -1.7% ( -5% - 1%)
HighTerm 119.48 (3.7%) 117.73 (5.2%) -1.5% ( -10% - 7%)
MedTerm 175.53 (3.5%) 173.14 (5.0%) -1.4% ( -9% - 7%)
LowTerm 931.30 (2.9%) 924.38 (4.5%) -0.7% ( -7% - 6%)
HighSpanNear 143.52 (4.7%) 142.48 (3.5%) -0.7% ( -8% - 7%)
LowSpanNear 27.97 (3.4%) 27.79 (2.6%) -0.7% ( -6% - 5%)
AndHighLow 1167.87 (2.0%) 1161.24 (2.1%) -0.6% ( -4% - 3%)
MedSpanNear 143.90 (4.1%) 143.30 (3.6%) -0.4% ( -7% - 7%)
OrNotHighLow 953.80 (2.0%) 951.43 (1.7%) -0.2% ( -3% - 3%)
LowSloppyPhrase 119.56 (3.0%) 119.79 (2.7%) 0.2% ( -5% - 6%)
Fuzzy1 107.94 (2.7%) 108.20 (3.0%) 0.2% ( -5% - 6%)
Respell 88.19 (3.3%) 88.51 (3.1%) 0.4% ( -5% - 6%)
OrNotHighMed 182.77 (2.6%) 183.48 (2.1%) 0.4% ( -4% - 5%)
MedSloppyPhrase 15.83 (4.8%) 15.91 (4.5%) 0.5% ( -8% - 10%)
Fuzzy2 66.59 (2.9%) 66.96 (3.1%) 0.5% ( -5% - 6%)
AndHighHigh 87.34 (1.9%) 88.01 (1.6%) 0.8% ( -2% - 4%)
AndHighMed 122.26 (2.0%) 123.39 (1.5%) 0.9% ( -2% - 4%)
OrNotHighHigh 48.70 (3.6%) 49.29 (4.4%) 1.2% ( -6% - 9%)
OrHighNotHigh 29.09 (3.7%) 29.44 (4.5%) 1.2% ( -6% - 9%)
OrHighLow 55.62 (7.6%) 56.35 (9.5%) 1.3% ( -14% - 19%)
OrHighNotMed 87.78 (3.9%) 88.99 (5.0%) 1.4% ( -7% - 10%)
OrHighNotLow 106.31 (4.1%) 107.84 (5.4%) 1.4% ( -7% - 11%)
OrHighMed 57.15 (7.7%) 58.06 (9.5%) 1.6% ( -14% - 20%)
OrHighHigh 26.80 (8.3%) 27.26 (10.1%) 1.7% ( -15% - 21%)
HighSloppyPhrase 13.10 (11.3%) 13.43 (12.1%) 2.5% ( -18% - 29%)
I will try to go thru it today and review the changes.
Alan Woodward (@romseygeek) (migrated from JIRA)
This is kind of a pain to keep up-to-date with trunk. If everybody's happy, I'll commit on Friday.
Robert Muir (@rmuir) (migrated from JIRA)
I quickly glanced thru, my only comment is, can we defer the startPosition() and endPosition()? These are not yet used by anything, and we may want to think of the semantics before putting these into all of our postings list apis.
Alan Woodward (@romseygeek) (migrated from JIRA)
Committing this once I've gone through and changed back all the places that my IDE renamed /docs directories to /postings...
I've also removed startPosition() and endPosition()
ASF subversion and git services (migrated from JIRA)
Commit 1657800 from @romseygeek in branch 'dev/trunk' https://svn.apache.org/r1657800
LUCENE-4524: Replace DocsEnum and DocsAndPositionsEnum with PostingsEnum
ASF subversion and git services (migrated from JIRA)
Commit 1657801 from @romseygeek in branch 'dev/trunk' https://svn.apache.org/r1657801
LUCENE-4524: Revert some bogus text changes
ASF subversion and git services (migrated from JIRA)
Commit 1657880 from @romseygeek in branch 'dev/trunk' https://svn.apache.org/r1657880
LUCENE-4524: Fix javadocs
ASF subversion and git services (migrated from JIRA)
Commit 1657890 from @romseygeek in branch 'dev/branches/branch_5x' https://svn.apache.org/r1657890
LUCENE-4524: Replace DocsEnum and DocsAndPositionsEnum with PostingsEnum
ASF subversion and git services (migrated from JIRA)
Commit 1659018 from @rmuir in branch 'dev/trunk' https://svn.apache.org/r1659018
LUCENE-4524: remove fixed @Seed
ASF subversion and git services (migrated from JIRA)
Commit 1659021 from @rmuir in branch 'dev/branches/branch_5x' https://svn.apache.org/r1659021
LUCENE-4524: remove fixed @Seed
Timothy Potter (@thelabdude) (migrated from JIRA)
Bulk close after 5.1 release
spinnoff from http://www.gossamer-threads.com/lists/lucene/java-dev/172261
Migrated from LUCENE-4524 by Simon Willnauer (@s1monw), 2 votes, resolved Feb 06 2015 Attachments: LUCENE-4524.patch (versions: 6) Linked issues:
3952