Allow Scorer to expose positions and payloads aka. nuke spans [LUCENE-2878]

asfimport commented 13 years ago

Currently we have two somewhat separate types of queries, the one which can make use of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring comparable to what other queries do and at the end of the day they are duplicating lot of code all over lucene. Span*Queries are also limited to other Span*Query instances such that you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. Beside of the Span*Query limitation other queries lacking a quiet interesting feature since they can not score based on term proximity since scores doesn't expose any positional information. All those problems bugged me for a while now so I stared working on that using the bulkpostings API. I would have done that first cut on trunk but TermScorer is working on BlockReader that do not expose positions while the one in this branch does. I started adding a new Positions class which users can pull from a scorer, to prevent unnecessary positions enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this API and other simply return null instead. To show that the API really works and our BulkPostings work fine too with positions I cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A nice sideeffect of this was that the Position BulkReading implementation got some exercise which now :) work all with positions while Payloads for bulkreading are kind of experimental in the patch and those only work with Standard codec.

So all spans now work on top of TermScorer ( I truly hate spans since today ) including the ones that need Payloads (StandardCodec ONLY)!! I didn't bother to implement the other codecs yet since I want to get feedback on the API and on this first cut before I go one with it. I will upload the corresponding patch in a minute.

I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk first but after that pain today I need a break first :).

The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't look into the MemoryIndex BulkPostings API yet)

Migrated from LUCENE-2878 by Simon Willnauer (@s1monw), 11 votes, resolved Apr 11 2018 Attachments: LUCENE-2878_trunk.patch (versions: 2), LUCENE-2878.patch (versions: 30), LUCENE-2878-OR.patch, LUCENE-2878-vs-trunk.patch, PosHighlighter.patch (versions: 2) Linked issues:

5590

Sub-tasks:

4391
- 4392
- 4393
- 4394
- 5617
- 5618
- 5621
- 5638

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

This patch removes the abstract BooleanIntervalIterator, as it doesn't seem to gain us anything.

Other than writing javadocs, we need to replace PayloadTermQuery and PayloadNearQuery, I think. I'll work on that next.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

alan +1 to the patch BooleanIntervalIterator is a relict. I will go ahead and commit it.

Other than writing javadocs, we need to replace PayloadTermQuery and PayloadNearQuery, I think. I'll work on that next.

Honestly, fuck it! PayloadTermQuery and PayloadNearQuery are so exotic I'd leave it out and move it into a sep. issue and maybe add them once we are on trunk. We can still just convert them to pos iters eventually. For now that is not important. we should focus on getting this on trunk.

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

OK! I think we're nearly there...

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

alan FYI - I committed some refactorings (renamed Scorer#positions to Scorere#intervals) etc. so you should update. I also committed your lattest patch

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

I've committed a whole bunch more javadocs, and a package.html.

There's still a big nocommit in SloppyPhraseScorer, but other than that we're looking good. We could probably do with more test coverage, but then that's never not the case, so...

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

alan, I just committed some more javadocs including more content for the package.html (review would be appreciated) I also fixed all nocommits in the latest commit. The nocommit in SloppyPhraseScorer I removed last week or so while I cheated here a bit. I current throw an UnsupportedOE if there are multiple terms per position right now since I think its not crucial for us to have this for now. I really want it since its the entire point of this feature but moving towards trunk is really what we want so other people get into it too. Being on trunk is very helpful.

Regarding tests - I agree we should have more tests especially with bigger documents. I might add a couple of random tests next week but feel free to jump on it. Next step would also be to run ant precommit on top level and see where it barfs. Other than that I really need other committers to look over the API but if nobody does we just gonna put up a patch and tell them we gonna reintegrate in X days :)

simon

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

I committed a few more javadoc fixes. Ant precommit passes when run from the top level. Let's get this in trunk!

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I committed a few more javadoc fixes. Ant precommit passes when run from the top level. Let's get this in trunk!

good stuff! lets give other folks the chance to jump on it / comment on what we have and then move forward! BTW. I'd rename the package o.a.l.s.positions to o.a.l.s.intervals what do you think?

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

> I'd rename the package o.a.l.s.positions to o.a.l.s.intervals what do you think?

+1

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I renamed the package and fixed the package html.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

here is a diff against trunk for better reviewing

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I'm still trying to catch up here (net/net this looks awesome!), but here's some minor stuff I noticed:

Instead of PostingFeatures.isProximityFeature, can we just use X.compareTo(PostingsFeatures.POSITIONS) >= 0? (We do this for IndexOptions).

Should we move PostingFeatures to its own source instead of hiding it in Weight.java?

Can we put back single imports (not wildcard, eg "import org.apache.lucene.index.*")?

PostingFeatures is very similar to FieldInfo.IndexOptions (except the latter does not cover payloads) ... would be nice if we could somehow combine them ...

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I'm confused on how one uses IntervalIterator along with the Scorer it "belongs" to. Say I want to visit all Intervals for a given TermQuery ... do I first get the TermScorer and then call .intervals, up front? And then call TermScorer.nextDoc(), but then how to iterate over all intervals for that one document? EG, is the caller supposed to call IntervalIterator.scorerAdvanced for each next'd doc?

Or ... am I supposed to call .intervals() after each .nextDoc() (but that looks rather costly/wasteful since it's a newly alloc'd TermIntervalIterator each time).

I'm also confused why TermIntervalIterator.collect only collects one interval (I think?). Shouldn't it collect all intervals for the current doc?

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

thanks mike for taking a look at this. It still has it's edges so every review is very valuable.

Instead of PostingFeatures.isProximityFeature, can we just use X.compareTo(PostingsFeatures.POSITIONS) >= 0? (We do this for IndexOptions).

sure, I was actually thinking about this for a while though. After a day of playing with different ways of doing it I really asked myself why we have 2 different docs enums and why not just one and one set of features / flags this would make a lot of things easier. Different discussion / progress over perfection..

Should we move PostingFeatures to its own source instead of hiding it in Weight.java?

alone the same lines, sure lets move it out.

Can we put back single imports (not wildcard, eg "import org.apache.lucene.index.*")?

yeah I saw that when I created the diff I will go over it and bring it back.

PostingFeatures is very similar to FieldInfo.IndexOptions (except the latter does not cover payloads) ... would be nice if we could somehow combine them ...

I agree it would be nice to unify all of this. Lets open another issue - we have a good set of usecases now.

I'm confused on how one uses IntervalIterator along with the Scorer it "belongs" to. Say I want to visit all Intervals for a given TermQuery ... do I first get the TermScorer and then call .intervals, up front? And then call TermScorer.nextDoc(), but then how to iterate over all intervals for that one document? EG, is the caller supposed to call IntervalIterator.scorerAdvanced for each next'd doc?

so my major goal here was to make this totally detached, optional and lazy ie no additional code in scorer except of IntervalIterator creation on demand. once you have a scorer you can call intervals() and get an iterator. This instance can and should be reused while docs are collected / scored / matched on a given reader. For each doc I need to iterate over intervals I call scorerAdvanced and update the internal structures this prevents any additional work if it is not really needed ie. on a complex query/scorer tree. Once the iterator is setup (scorerAdvanced is called) you can just call next() on it in a loop --> while ((interval = iter.next) != null) and get all the intervals. makes sense?

Or ... am I supposed to call .intervals() after each .nextDoc() (but that looks rather costly/wasteful since it's a newly alloc'd TermIntervalIterator each time).

no that is not what you should do. I think the scorer#interval method javadoc make this clear, no? if not I should make it clear!

I'm also confused why TermIntervalIterator.collect only collects one interval (I think?). Shouldn't it collect all intervals for the current doc?

the collect method is special. It's an interface that allows to collect the "current" interval or all "current" intervals that contributed to a higher level interval. For each next call you should call collect if you need all the subtrees intervals or the leaves. one usecase where we do this right now is highlighing. you can highlight based on phrases ie. if you collect on a BQ or you can do individual terms ie. collect leaves. makes sense?

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This is what I had in mind to remove PostingFeatures.isProximityFeature (it's only used in one place...).

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I'm confused on how one uses IntervalIterator along with the Scorer it "belongs" to. Say I want to visit all Intervals for a given TermQuery ... do I first get the TermScorer and then call .intervals, up front? And then call TermScorer.nextDoc(), but then how to iterate over all intervals for that one document? EG, is the caller supposed to call IntervalIterator.scorerAdvanced for each next'd doc?

so my major goal here was to make this totally detached, optional and lazy ie no additional code in scorer except of IntervalIterator creation on demand. once you have a scorer you can call intervals() and get an iterator. This instance can and should be reused while docs are collected / scored / matched on a given reader. For each doc I need to iterate over intervals I call scorerAdvanced and update the internal structures this prevents any additional work if it is not really needed ie. on a complex query/scorer tree. Once the iterator is setup (scorerAdvanced is called) you can just call next() on it in a loop --> while ((interval = iter.next) != null) and get all the intervals. makes sense?

OK so it sounds like I pull one IntervalIterator up front (and use it for the whole time), and it's my job to call .scorerAdvanced(docID) every time I either .nextDoc or .advance the original Scorer? Ie this "resets" my IntervalIterator onto the current doc's intervals.

I think the scorer#interval method javadoc make this clear, no? if not I should make it clear!

I was still confused :) I'll take a stab at improving it ... also we should add @experimental...

I'm also confused why TermIntervalIterator.collect only collects one interval (I think?). Shouldn't it collect all intervals for the current doc?

the collect method is special. It's an interface that allows to collect the "current" interval or all "current" intervals that contributed to a higher level interval. For each next call you should call collect if you need all the subtrees intervals or the leaves. one usecase where we do this right now is highlighing. you can highlight based on phrases ie. if you collect on a BQ or you can do individual terms ie. collect leaves. makes sense?

Ahhh .... so it visits all intervals in the query tree leading up to the current match interval (of the top query) that you've iterated to? OK. Maybe we can find a better name ... can't think of one now :)

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

thanks mike for the commits! much apprecitated!

OK so it sounds like I pull one IntervalIterator up front (and use it for the whole time), and it's my job to call .scorerAdvanced(docID) every time I either .nextDoc or .advance the original Scorer? Ie this "resets" my IntervalIterator onto the current doc's intervals.

exactly!

Ahhh .... so it visits all intervals in the query tree leading up to the current match interval (of the top query) that you've iterated to? OK. Maybe we can find a better name ... can't think of one now

a better name for "collect"?

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

a better name for "collect"?

Yeah, to somehow reflect that it's visiting/collecting/recursing on the full interval tree ... but nothing comes to mind ...

When I first saw it / read the docs I thought this was analogous to Scorer.score(Collector), ie that it would "bulk collect" all intervals from the iterator.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think the patch for review is incomplete? e.g. I see PostingsFeatures was added to the Scorer api but its not in the patch.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I think the patch for review is incomplete? e.g. I see PostingsFeatures was added

its a inner class of Weight in that patch but we might move it out!

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Instead of PostingFeatures.isProximityFeature, can we just use X.compareTo(PostingsFeatures.POSITIONS) >= 0? (We do this for IndexOptions).

I think this is a confusing part of the current patch. For example:

// Check if we can return a BooleanScorer
-      if (!scoreDocsInOrder && topScorer && required.size() == 0) {
+      if (!scoreDocsInOrder && flags == PostingFeatures.DOCS_AND_FREQS && topScorer && required.size() == 0) {

I don't think we should be doing these == comparisons. What if someone sends DOCS_ONLY? (which really ConstantScoreQuery i think should pass down to its subs and so on, so they can skip freq blocks, but thats another thing to tackle).

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

There seems to be a lot of unrelated formatting changes in important classes like TermWeight.java etc

Can we factor these out and do any formatting changes separately?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I don't like the general style of things like Collector.postingsFeatures()

From the naming, you cant tell this is a "getter". In general I think methods like this should be getPostingsFeatures() ?

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

It's sort of disturbing that if you iterate over intervals for a PhraseQuery we pull two DocsAndPositionsEnums per term in the phrase ...

But then ... this would "typically" be used to find the locations to hilite, right? Ie not for the "main" query? Because if you wanted to do this for the main query you should really use one of the oal.search.intervals.* queries instead, and those pull only a single D&PEnum per Term I think?

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

It seems like the new oal.search.interval queries are meant to replace spans? So ... should we remove spans? Or is there functionality missing in intervals?

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

But then ... this would "typically" be used to find the locations to hilite, right? Ie not for the "main" query? Because if you wanted to do this for the main query you should really use one of the oal.search.intervals.* queries instead, and those pull only a single D&PEnum per Term I think?

correct.

It seems like the new oal.search.interval queries are meant to replace spans? So ... should we remove spans? Or is there functionality missing in intervals?

eventually yes. Currently they don't score based on positions they only filter. My plan was to bring this on trunk including spans. Once on trunk move spans to a module and cut over the functionality query by query. We are currently missing payload support which I think we should add once we are on trunk. makes sense?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

To me Interval.java looks a lot like a span. I think it would be good to resolve this before landing on trunk. If we cant score based on positions, it seems to me the api is not fully baked? e.g. i think it would be better to score based on positions and run benchmarks and so on first.

here we also get a lot more index option flags. I dont like how many of these we have:

indexoptions
flags on docsenum
flags on docsandpositionsenum
now here, flags on scorer/collector/etc

I am working on some ideas to clean some of this up in trunk separate from this branch. i think this makes the apis really confusing.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Interval.java looks a lot like a span. I think it would be good to resolve this before landing on trunk.

what exactly did you expect? I mean its basically the same thing but reusing the name sucks. what do you wanna resolve here?

regarding your comments could you put your ideas up here in a somewhat more compact form than 5 1-line comments? This would be great especially what kind of ideas you have and you resovle / work on on trunk so we can maybe be more productive here.

thanks

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Where do we stand on this now? It sounds as though we need to get more implemented before everyone's happy with it being merged in. I can make a start at cutting TestSpansAdvanced and TestSpansAdvanced2 over to intervals tests this week (at a glance they're the only tests we have for Span scoring at the moment), although I guess things are going to go on hold a bit for the ApacheCon.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

+1 to add scoring! go ahead this would be great.

will you be at apache con?

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

> will you be at apache con?

Not this year :-( Will try and make one next year, though!

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Here's a first attempt at duplicating the Span scoring in IntervalFilterQuery. It needs more tests, and the Explanation needs to be modified, but it's something :-)

One thing I'm uncertain of is the algorithm being used for scoring here. It treats all matching 'spans' (or 'intervals' now, I suppose) in a document as equivalent for the purposes of scoring, weighting only by match distance, but this seems to throw away certain bits of possibly useful info - for example, if we're filtering over a BooleanQuery with a number of SHOULD clauses, then an interval that contains all of them is going to score the same as an interval that contains only two, but with the same overall match distance. This seems wrong... Or maybe I'm just misunderstanding how this gets put together.

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

I've started to use this branch in an (experimental!) system I'm developing for a client. The good news is that performance is generally much better than the existing system that uses SpanQueries - faster query time and smaller memory footprint, and also nicer GC behaviour (I can't give exact numbers, but suffice to say that where the previous system regularly ran out of memory, this one hasn't yet)!

There are definitely some rough edges, though, which I'll try and smooth out and add as patches.

1) There isn't a replacement for SpanNotQueries - the BrouwerianIterator comes close, but doesn't quite cover all the use cases. In this instance, I need to have the equivalent of a 'not within' operator - match intervals that do not fall within a given another interval. I've written a new iterator, which I've called an 'InverseBrouwerianIntervalIterator' for want of a better name, but it definitely could do with some more eyes on it...

2) The API is not very nice when it comes to subclassing Iterators. For example, I have 'anchor' terms at the start and end of documents, which allow users to query for terms within a certain distance from them. These shouldn't be highlighted, so I created an AnchorTermQuery which returned a different type of IntervalIterator that didn't do anything in its collect() method. To do this, I had to create an AnchorTermWeight, an AnchorTermScorer and an AnchorTermIntervalIterator, all of which were more or less copy-pastes of the equivalent Term* classes; it would be nice to make this easier...

3) MultiTermQueries don't return iterators unless you set their rewrite policies to something other than CONSTANT_SCORE_REWRITE.

4) I found a bug in the iterators() method of DisjunctionSumScorer - if all subscorers are PositionFilterScorers, then you can get NPEs if the subscorers have matches that don't pass the filters. I'll add a test case shortly

5) I had to run this without my scoring patch (this case doesn't actually use scoring, so it doesn't matter that much), because MultiTermQueries can blow up in scoring if they get rewritten into blank queries; I guess this wasn't a problem with Span* queries, but I haven't had a chance to work out how to get round it. Will add another test case for this as well.

All in all, though, these are looking much better than the equivalent SpanQueries. Position filters on boolean queries in particular work much better - the semantics of SpanQueries are completely wrong for this, and involved generating very heavy queries for pretty simple cases. Nice work!

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

hey alan,

I've started to use this branch in an (experimental!) system I'm developing for a client.

very good news! cool stuff - can you provide more infos what you are doing there? Do you highlight too?

regarding your latest patch - commit it!

1) There isn't a replacement for SpanNotQueries - the BrouwerianIterator comes close, but doesn't quite cover all the use cases. I

can you provide a testcase what it doesn't cover? you can go ahead and commit it even if you don't have a fix.

2) The API is not very nice when it comes to subclassing Iterators. For example, I have 'anchor' terms at the start and end of documents, which allow

I am not sure I understand this. if you have marker terms how do they differ from ordinary terms can't you just do a nearOrdered("X", "ENDMARKER") query? I don't see where you need to subclass here. can you elaborate?

4) I found a bug in the iterators() method of DisjunctionSumScorer

great, can you submit the testcase?

3) MultiTermQueries don't return iterators unless you set their rewrite policies to something other than CONSTANT_SCORE_REWRITE.

yeah the problem here is that we use a filter instead of a scorer, you should see an exception right? I think it would make sense to have a MTQ rewrite a query on a ConstantScoreQuery instead of a filter - we can't get a interval iter from a filter :/

I think overall we should move out of this issue and create separate issues for all you cases. Also for the things robert mentioned like exploring "Scorere extends DocsAndPosEnum"

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Hi Simon,

I'll open separate sub-tasks for the issues.

The system I'm building is basically an equivalent of the elasticsearch percolator - we register a bunch of queries, and then run them all against individual documents passing through the system. We then emit the exact positions which have matched, which is a type of highlighting, I guess. The point of the anchor terms is that we don't want to highlight them - if you're searching for a term within five positions of the start of a document, you don't want the first term of the document highlighted as well.

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Just committed a massive test refactoring, which should show where the problems are in DisjunctionIntervalScorer and MultiTermQuery. Lots of the tests fail now, as the previous ones weren't necessarily picking up false positives (UnorderedNearQuery is particularly bad for this).

asfimport commented 11 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

I want to get this moving again - will get the branch up to date tomorrow and then iterate from there.

asfimport commented 11 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

YEAH!!!

asfimport commented 11 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Since the last patch went up I've fixed a bunch of bugs (BrouwerianQuery works properly now, as do various nested IntervalQuery subtypes that were throwing NPEs), as well as adding Span-type scoring and fleshing out the explain() methods. The only Span functionality that's missing I think is payload queries. If we want to have all the span functionality in here before it can land on trunk I can work on that next.

It would also be good to do some proper benchmarking. Do we already have something that can compare sets of queries?

asfimport commented 11 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

So at the moment, IntervalFilterScorer doesn't consume all the intervals on a given document when advancing, it just checks if the document has any matching intervals at all. Which is great for speed, but bad for scoring - you want to iterate through the intervals on a document to get the within-doc frequency, which can then be passed to the docscorer. You also need to iterate through everything to deal with payloads.

Is it worth specialising here? Have two query types (or maybe just a flag on the query), so you can optimize for query speed or for scoring. SpanScorer always iterates over all spans, by comparison.

asfimport commented 11 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

The only Span functionality that's missing I think is payload queries. If we want to have all the span functionality in here before it can land on trunk I can work on that next.

I really think we can skip that for now.

It would also be good to do some proper benchmarking. Do we already have something that can compare sets of queries?

We do have LuceneUtil but its not like straight forward. I will take a look what we can do here.

Is it worth specialising here? Have two query types (or maybe just a flag on the query), so you can optimize for query speed or for scoring. SpanScorer always iterates over all spans, by comparison.

I think we should specialize the Scorere here. Visiting the least amount of intervals possible is maybe worth it.

So from my perspective what we should try exploring is making the scorer a DocsAndPosEnum in the branch and see if we can remove the Interval API mostly in favor of the DocsAndPos API. The only problem I have with this is really that if a given scorer consumes intervals from a subscorer it needs to buffer all those if it's parent needs all of them too. Not sure if it is worth it at this point. Ideally I would want to have DocsAndPosEnum to be folded into DocsEnum first too.

asfimport commented 11 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

I'm going to try applying the patch from #5590 here and see if that helps. Next step would be to add startPosition() and endPosition() to DocsEnum, and try re-implementing the filter queries using methods directly on child scorers, rather than pulling a separate interval iterator.

asfimport commented 11 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Alan, I don't think you can cut over to DocsEnum or DocsAndPositionsEnum. DocsAndPosEnum has a significant problem that doesn't allow efficient PosIterator impl underneath it. It defines that DocsAndPosEnum#nextPosition should be called at max DocsEnum#freq() times which is fine on a low level but bogus for lazy pos iterators since we don't know ahead of time how many intervals we might have. I think we first need to fix this problem before we can go and do this refactoring, makes sense? PhraseQuery does only know his freq currently because it's greedy and pulls all intervals at once.

asfimport commented 11 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Hm, OK. So can we change the nextPosition() API to return -1 once the positions have been exhausted, rather than becoming undefined? So a consumer would look something like: monospaced int pos; while ((pos = dp.nextPosition()) != -1) { // do stuff here } monospaced

Implementations that need to know the frequency call freq(), others can iterate lazily.

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Alan: its a nice idea... we should seriously consider something this (-1 or NO_MORE_POSITIONS or whatever) if it would allow this stuff to just work over the existing D&P api.

I dont know what the cost would be to existing impls (e.g. Lucene41PostingsReader would need some code changes), but hopefully small or nil.

And of course having an API like this would be well worth any small performance hit.

asfimport commented 11 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Hm, OK. So can we change the nextPosition() API to return -1 once the positions have been exhausted, rather than becoming undefined? So a consumer would look something like:

yeah I was saying the same thing yesterday when I talked about this to rob. This would make stuff more consistent too. I will open an issue

asfimport commented 11 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think this should be explored in the branch versus a separate issue E.g. we shouldnt impose this on postings implementations unless it sorta works with the whole design here.

I'd also really recommend NO_MORE_POSITIONS not -1. -1 currently means "invalid" (e.g. you should not have called nextPosition).

Its not like any Scorer would need to check for this, because if you try to do prox operations on a field that omits position information, the user should be getting an exception up-front from the Weight.

asfimport commented 11 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

I've been chipping away at this for a bit. Here's a summary of what I've done:

Applied #5590, and also added startPosition() and endPosition() to DocsEnum
Changed the postings readers to return NO_MORE_POSITIONS once nextPosition() has been called freq() times
Extended the ConjunctionScorer and DisjunctionScorer implementations to return positions
Added an abstract PositionFilteredScorer with reset(int doc) and doNextPosition() methods
Added a bunch of concrete implementations (ExactPhraseQuery, NotWithinQuery, OrderedNearQuery, UnorderedNearQuery, RangeFilterQuery) with tests - these are all in the posfilter package

I still need to implement SloppyPhraseQuery and MultiPhraseQuery, but I actually think these won't be too difficult with this API. Plus there are a bunch of nocommits regarding freq() calculations, and this doesn't work at all with BooleanScorer - we'll probably need a way to tell the scorer that we do or don't want position information.

@s1monw and I talked about this on IRC the other day, about resolving collisions in ExactPhraseQuery, but I think that problem may go away doing things this way. I may have misunderstood though - if so, could you add a test to TestExactPhraseQuery showing what I'm missing, Simon?

asfimport commented 11 years ago

ASF subversion and git services (migrated from JIRA)

Commit 1535436 from @romseygeek in branch 'dev/branches/LUCENE-2878' https://svn.apache.org/r1535436

LUCENE-2878: Merge from trunk

asfimport commented 10 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Now we are talking....

Sent from my iPhone

apache / lucene

Allow Scorer to expose positions and payloads aka. nuke spans [LUCENE-2878] #3952

5590

4391

4392

4393

4394

5617

5618

5621

5638