apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.61k stars 1.02k forks source link

Allow Scorer to expose positions and payloads aka. nuke spans [LUCENE-2878] #3952

Closed asfimport closed 6 years ago

asfimport commented 13 years ago

Currently we have two somewhat separate types of queries, the one which can make use of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring comparable to what other queries do and at the end of the day they are duplicating lot of code all over lucene. Span*Queries are also limited to other Span*Query instances such that you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. Beside of the Span*Query limitation other queries lacking a quiet interesting feature since they can not score based on term proximity since scores doesn't expose any positional information. All those problems bugged me for a while now so I stared working on that using the bulkpostings API. I would have done that first cut on trunk but TermScorer is working on BlockReader that do not expose positions while the one in this branch does. I started adding a new Positions class which users can pull from a scorer, to prevent unnecessary positions enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this API and other simply return null instead. To show that the API really works and our BulkPostings work fine too with positions I cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A nice sideeffect of this was that the Position BulkReading implementation got some exercise which now :) work all with positions while Payloads for bulkreading are kind of experimental in the patch and those only work with Standard codec.

So all spans now work on top of TermScorer ( I truly hate spans since today ) including the ones that need Payloads (StandardCodec ONLY)!! I didn't bother to implement the other codecs yet since I want to get feedback on the API and on this first cut before I go one with it. I will upload the corresponding patch in a minute.

I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk first but after that pain today I need a break first :).

The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't look into the MemoryIndex BulkPostings API yet)


Migrated from LUCENE-2878 by Simon Willnauer (@s1monw), 11 votes, resolved Apr 11 2018 Attachments: LUCENE-2878_trunk.patch (versions: 2), LUCENE-2878.patch (versions: 30), LUCENE-2878-OR.patch, LUCENE-2878-vs-trunk.patch, PosHighlighter.patch (versions: 2) Linked issues:

Sub-tasks:

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

here is the patch

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I took a quick glance at the issue (not a proper review though: its too easy to get lost in the spans!) Here are my thoughts/questions:

Seems like if we could get the API how we like on this issue, and fix all codecs bulkpositions implementations to work with payloads, that this would be a nice step we could make...

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I don't like PayloadSpanUtil either, i think its a huge trap and it should move to src/test or be removed.

It's not that bad of a trap :) I has warnings all over it - some people use it out there. I think it's more useful than trapful - if not just as demo code.

It is src/test quality at best though though. I never would have committed it to core - I expected to put it in contrib at best. But someone else committed it at the time, so I just did a 'meh' and put warnings about experimental on it. I forgot to take it out of the patch on the issue or something. I never even used it or tested it much myself - I just wanted to put in some example/test code for this use case - I know some users on the list have used it for some form of simple highlighting, where the coords to highlight are in the payloads. I wouldn't trust it these days though :)

It will be found less if its in src/tests, but that is probably a fine place for it - it would need some love/support above what it has to really make sense in any module/contrib. I've never really meant to maintain it myself - and I've never really felt it belonged in core (just never had the itch to squash it either).

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

It's not that bad of a trap I has warnings all over it - some people use it out there. I think it's more useful than trapful - if not just as demo code.

In this case maybe contrib? And if we were careful/re-organized tests a bit, MultiSpansWrapper could be pkg-private.

asfimport commented 13 years ago

Paul Elschot (migrated from JIRA)

A first impression, no time to read the whole patch yet.

I like Positions being abstract, instead of the current Spans interface. The name Positions is understated though. It has a begin and an end that can be iterating/iterated in place, so it is actually "a" spans. I wish I had a better name. I hope that the addition to TermScorer does not slow it down.

The patch has some duplicates (see Scorer for example), but that is easy to fix. It also contains a 2005 copyright for a new file. Time goes by.

More later...

asfimport commented 13 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

In this case maybe contrib?

I dunno - I think I would be for it if I was willing to work on it and maintain it. But honestly, I am not. Some users have figured out how to use it for something, but I never really used it for anything. So my interest is low, and list of other stuff I'd love to tackle high. I'm still pro removing some stuff from contrib, so I don't want to be responsible for more cruft going in.

Perhaps anyone using it should just copy out the code to their project ... thats really the spirit it was written in anyway - here is sample code to get you started - showing what you can do with spans and payloads.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

First of all thanks for the reviews so far... seems like the overall feedback is positive though. I should have made my point more clear that the most of this patch is a PoC more than anything else. The changes to spans are rather to proof correctness and give the bulk api a little exercise.

I don't like the MultiSpansWrapper... it only exists to support PayloadSpanUtil and tests? I don't like PayloadSpanUtil either, i think its a huge trap and it should move to src/test or be removed.

The reason for MultSpansWrapper is pretty simple. MultiBulkEnum is not entirely implemented yet so I had to move the Spans to per-segment using AtomicReaderContext. What I needed was an easy way to cut over all the tests and that seemed to be straight forward. Its really just a testing class and should go away over time.

any ideas of the performance comparison of this patch as-is (SpanScorerWrapper over the Scorer/bulk enum) versus the existing Spans which doesn't use the bulk API?

not yet, once I get back to this I will run a benchmark

A first impression, no time to read the whole patch yet.

my plans are to cut over spans to AtomicREaderContext on trunk so this patch will be way way smaller and easier to read maybe you should just skip all the span stuff for now. those are just ARC based changes

PayloadSpanUtil

<dream> Spans as they exist today should go away anyway so this is not much of a deal for now.</dream>

The name Positions is understated though. It has a begin and an end that can be iterating/iterated in place, so it is actually "a" spans.

I agree we might need to go away from this start end thing and have a wrapper on top of Position(s)

I hope that the addition to TermScorer does not slow it down.

Those positions == null checks are well predictable and should be optimzed away - if hotspot plays tricks on us we can still specialize a scorer....

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Attaching my current state. I committed #3956 today which contained all the cut over to SpanQuery#getSpans(AtomicReaderContext) that made this patch much smaller and easier to read now. I also fixed MemoryIndex such that now all tests pass with -Dtests.codec=Standard

Not much progress other than that - more tomorrow....

asfimport commented 13 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

I haven't looked at the patch, but one of the biggest issues with Spans is the duality within the spans themselves. The whole point of spans is that you care about position information. However, in order to get both the search results and the positions, you have to, effectively, execute the query twice, once to get the results and once to get the positions. A Collector like interface, IMO, would be ideal because it would allow applications to leverage position information as the queries are being scored and hits being collected. In other words, if we are rethinking how we handle position based queries, let's get it right this time and make it so it is actually useful for people who need the functionality.

As for PayloadSpanUtil, I think that was primarily put in to help w/ highlighting at the time, but if it has outlived it's usefulness, than dump it. If we are consolidating all queries to support positions and payloads, then it shouldn't be needed, right?

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I haven't looked at the patch, but one of the biggest issues with Spans is the duality within the spans themselves. The whole point of spans is that you care about position information. However, in order to get both the search results and the positions, you have to, effectively, execute the query twice, once to get the results and once to get the positions. A Collector like interface, IMO, would be ideal because it would allow applications to leverage position information as the queries are being scored and hits being collected. In other words, if we are rethinking how we handle position based queries, let's get it right this time and make it so it is actually useful for people who need the functionality.

Grant I completely agree! Any help here very much welcome. I am so busy fixing all the BulkEnums and spinnoffs from this issue but I hope I have a first sketch of how I think this should work by the end of the week!

As for PayloadSpanUtil, I think that was primarily put in to help w/ highlighting at the time, but if it has outlived it's usefulness, than dump it. If we are consolidating all queries to support positions and payloads, then it shouldn't be needed, right?

Yeah!

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Attaching my current state - still rough & work in progress though.... In this patch I added a PositionsIntervalIterator returned from Scorer#positions() and implemented some usecases like filtering Near / Within (unordered) with Term & BooleanScorer2. BooleanScorer2 if in conjunction mode now has a PositionIntervalIter implementation that returns the minimal interval of its unordered query terms and can easily be filtered within a range of pos (range, first) or within a relative positions (near) simply by wrapping it in PositionFilterQuery. An example for this is in TestBooleanQuery. The PosIntervalIterator decouples positional operation nicely from Scoring / matching so a Boolean AND query gets the normal query score but can be restricted further based on positions. Even adding positional scoring can simple be plugged on top of it.

Further, with a specialized Collector implementation - positions could be fetched only if the score is within the top N to prevent positions matching for all documents.

One big problem is BooleanScorer which does the bucket based scoring - I will ignore that one for now. I need to run some benchmarks to see if that does any good though but haven't had time to do so. If somebody has time and take a look at this patch - feedback would be very much appreciated.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

just attaching my current state

asfimport commented 13 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This patch looks awesome (and, enormous)! Finally we are making progress merging Span* into their corresponding non-positional queries :)

I like how you added payloads to the BulkPostings API, and how someone is finally testing the bulk positions code.

So now I can run any Query, and ask it to enumerate its positions (PositionInterval iterator), but not paying any price if I don't want positions. And it's finally single source... caller must say up-front (when pulling the scorer) if it will want positions (and, separately, also payloads – great).

It's great that you can just emulate spans on the new api with SpanScorerWrapper/MockSpanQuery, and use PositionFilterQuery to filter positions from a query, eg to turn a BooleanQuery into whatever SpanQuery is needed – very nice!

How does/should scoring work? EG do the SpanQueries score according to the details of which position intervals match?

The part I'm wondering about is what API we should use for communicating positions of the sub scorers in a BooleanQuery to consumers like position filters (for matching) or eg Highlighter (which really should be a core functionality that works w/ any query). Multiplying out ("denormalizing") all combinations (into a flat stream of PositionIntervals) is going to be too costly in general, I think?

Maybe, instead of the denormalized stream, we could present a UnionPositionsIntervalIterator, which has multiple subs, where each sub is its own PositionIntervalIterator? This way eg a NEAR query could filter these subs in parallel (like a merge sort) looking for a match, and (I think) then presenting its own union iterator to whoever consumes it? Ie it'd only let through those positions of each sub that satisfied the NEAR constraint.

asfimport commented 13 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

And it's finally single source... caller must say up-front (when pulling the scorer) if it will want positions (and, separately, also payloads – great).

Does it make sense that we could just want AttributeSources as we go here?

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

How does/should scoring work? EG do the SpanQueries score according to the details of which position intervals match?

I didn't pay any attention to scoring yet. IMO scoring should be left to the query which is using the positions so a higher level query like a NearQuery could just put a custom scorer on top of a boolean conjunction and apply its own proximity based score. This should be done after we have the infrastructure to do it. I think that opens up some nice scoring improvements. I am not sure if we should add proximity scoring to existing queries, I rather aim towards making it easy to customize.

The part I'm wondering about is what API we should use for communicating positions of the sub scorers in a BooleanQuery to consumers like position filters (for matching) or eg Highlighter (which really should be a core functionality that works w/ any query). Multiplying out ("denormalizing") all combinations (into a flat stream of PositionIntervals) is going to be too costly in general, I think?

I thought about that for a while and I think we should enrich the PosIntervalIterator API to enable the caller to pull the actual subintervals instead of an Interval from the next method. Something like this:

 public abstract class PositionIntervalIterator implements Serializable{
  public abstract PositionInterval next() throws IOException;
 /**
  *Returns all sub interval for the next accepted interval.
  **/
  public abstract PositionIntervalIterator nextSubIntervals() throws IOException;
  public abstract PositionIntervalIterator[] subs(boolean inOrder);

so that if you are interested in the plain positions for eventually each term like highlighting you can pull them per match occurence. That way you have positional matching and you can iterate the subs.

Maybe, instead of the denormalized stream, we could present a UnionPositionsIntervalIterator, which has multiple subs, where each sub is its own PositionIntervalIterator? This way eg a NEAR query could filter these subs in parallel (like a merge sort) looking for a match, and (I think) then presenting its own union iterator to whoever consumes it? Ie it'd only let through those positions of each sub that satisfied the NEAR constraint.

I don't get that entirely ;)

Does it make sense that we could just want AttributeSources as we go here?

you mean like we are not extending Scorer but add an AttributeSource to it? I think this is really a core API and should be supported directly

asfimport commented 13 years ago

Liwei (migrated from JIRA)

What should I do for it?

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

it doesn't seem that this issue is worth staying so tight coupled to the bulkpostings branch. I originally did this on bulk postings since it had support for positions in termscorer (new bulk API) where on trunk we don't have positions there at all. Yet, I think bulk postings should be sorted out separately and we should rather move this over to trunk. On trunk we can get rid of all the low level bulk API hacks in the patch. the only thing that is missing here is a TermScorer that can score based on positions / payloads. I think since we have ScoreContext and how this works here in the patch we can simply implement a TermScorer that works on DocsEnumAndPositions and swap it in once positions are requested.

I think I can move this over to trunk soon.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

here is a patch that applies to trunk. I added a simple maybe slowish PositionTermScorer that is used when pos are required. This is really work in progress but I am uploading it just in case somebody is interested.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

some more cleanups, all tests pass now on trunk

asfimport commented 13 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

I've been fiddling with highlighter performance, and this looks a great step towards being able to do hl in an integrated way that doesn't require a lot of post-hoc recalculation etc. I worked up a hackish highlighter that uses it as a POC, partly just as a way of understanding what you've done, but this could eventually become usable.

Here are a few comments:

I found it convenient to add: boolean Collector.needsPositions() and needsPayloads() and modified IndexSearcher.search(AtomicReaderContext[] leaves, Weight weight, Filter filter, Collector collector) to set up the ScorerContext accordingly

And then I am accessing the scorer.positions() from Collector.collect(), which I think is a very natural use of this API? At least it was intuitive for me, and I am pretty new to all this.

I think that when it comes to traversing the tree of PositionsIntervalIterators, the API you propose above might have some issues. What would the status of the returned iterators be? Would they have to be copies of some sort in order to preserve the state of the iteration (so scoring isn't impacted by some other consumer of position intervals)? The iterators that are currently in flight shouldn't be advanced by the caller usually (ever?), or else the state of the dependent iterator (the parent) won't be updated correctly, I think? I wonder if (1) you could add PositionInterval PositionIntervalIterator.current() and (2) return from subs() and nextSubIntervals() some unmodifiable wrappers - maybe a superclass of PII that would only provide current() and subs(), but not allow advancing the iterator.

I hope you'll be able to pick it up again soon, Simon!

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Hey Mike, great to see interest here! :)

boolean Collector.needsPositions() and needsPayloads()

+1 that makes lots of sense.

Let me give you some insight of the current patches state. This whole thing is still a prototype and needs lots of cleanups all over the place. I moved it to trunk lately since I don't want to wait for bulkpostings to move forward. I think there are lots of perf impacts with its current state but eventually I think it will be much better, more powerful and cleaner than spans after all.

And then I am accessing the scorer.positions() from Collector.collect(), which I think is a very natural use of this API? At least it was intuitive for me, and I am pretty new to all this.

this is one way of doing it for sure. The other way would be to wrap the top level scorer and do your work in there with a PositionScoringQueryWrapper or something like that which would set up the ScorerContext for you. The main question is what you want to do with positions. For matching based on positions you have to use some scorer I guess since you need to check every document if it is within your position constraints, something like near(a AND b). If you want to boost based on your positions I think you need to do a 2 phase collection, Phase 1 simply running the query collecting n + X results and Phase 2 re-ranking the results from Phase 1 by pulling the positions.

I think that when it comes to traversing the tree of PositionsIntervalIterators, the API you propose above might have some issues

I agree this is very flaky right now and I only tried to mimic the spans behavior here to show that this is as powerful as spans for now. But eventually we need a better API for this, so its good you are jumping in with a usecase!

What would the status of the returned iterators be?

currently if you pull an iterator you are depending on the state of your scorer. Let me give you an example on TermScorer, if you are on document X you can iterate the positions for this document if you exhaust them or not once the scorer is advanced your PositionInterator points to the documents position you advanced to. The same is true for all other Scorers that expose positions. Yet, some problems arise here with BooleanScorer (in contrast to BooleanScorer2) since it reads documents in blocks which makes it very hard (nearly impossible) to get efficient positions for this scorer (its used for OR queries only with NOT clauses < 32). So PositionsInterators are never preserve positions for a document you pulled the interval for. You can basically pull the iterator only once and keep it until you scorer is exhausted. Bottom line here is that you are depending on the DocsAndPositionsEnum your TermScorer is using. Once this is advanced your positions are advanced too. We could think of a separate Enum here that advances independently, hmm that could actually work too, lets keep that in mind.

(so scoring isn't impacted by some other consumer of position intervals)

there should be only one consumer really. Which usecase have you in mind where multiple consumers are using the iterator?

PositionInterval PositionIntervalIterator.current()

what is the returned PI here again? In the TermScorer case that is trivial but what would a BooleanSocorer return here?

(2) return from subs() and nextSubIntervals() some unmodifiable wrappers - maybe a superclass of PII that would only provide current() and subs(), but not allow advancing the iterator.

I think that could make sense but let me explain the reason why this is there right now. So currently a socrer has a defined PositionIterator which could be a problem later. for instance I want to have the minimal positions interval (ordered) of all boolean clauses for query X but for query Y I want the same interval unorderd (out of order) I need to replace the logic in the scorer somehow. So to make that more flexible I exposed all subs here so you can run your own alg. I would love to see better solutions since I only hacked this up in a couple of days though.

Currently this patch provides an AND (ordered & un-ordered) and a BLOCK PositionIterator based on this paper http://vigna.dsi.unimi.it/ftp/papers/EfficientAlgorithmsMinimalIntervalSemantics while the OR implementation is still missing so if you want to jump on that issue and help there is tons of space for improvements.

Eventually I think we can leave spans as they are right now and concentrate on the API / functionality, making things fast under the hood can be done later but getting things right to be flexible is the most important part here.

Mike, would you be willing to upload a patch for your hacked collector etc to see what you have done?

I hope you'll be able to pick it up again soon, Simon!

I would love to ASAP, currently I have so much DocValues stuff todo so this might take a while until I get back to this.

asfimport commented 13 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

there should be only one consumer really. Which usecase have you in mind where multiple consumers are using the iterator?

I guess I am coming at this from the perspective of a Highlighter; the Highlighter wants to iterate over the top-level Scorer, finding each of its matching positions, and then for each of those, it wants to iterate over all the individual terms' positions. Possibly some clever HL of the future will be interested in intermediate-level nodes in the tree as well, like highlighting a near-span, or coalescing phrases. The problem I see is that with the current API the only way to retrieve the lower-level positions is to advance their iterators, but if that is done directly (without the knowledge of the enclosing scorer and its iterator), the scoring will be messed up. I guess that's what I meant by multiple consumers - of course you are right, there should be only one "writer" consumer that can advance the iteration. My idea is that there could be many readers, though. In any case, I think it is typical for an iterator that you can read the current position as many times as you want, rather than "read once" and expect the caller to cache the value?

what is the returned PI here again? In the TermScorer case that is trivial but what would a BooleanScorer return here?

It has its own PI right? I think it is the minimum interval containing some terms that satisfy the boolean conditions.

I think that could make sense but let me explain the reason why this is there right now. So currently a socrer has a defined PositionIterator which could be a problem later. for instance I want to have the minimal positions interval (ordered) of all boolean clauses for query X but for query Y I want the same interval unorderd (out of order) I need to replace the logic in the scorer somehow. So to make that more flexible I exposed all subs here so you can run your own alg. I would love to see better solutions since I only hacked this up in a couple of days though.

Hmm I haven't yet looked at how BooleanScorer2 and BooleanScorer works, but I understand there is some additional complexity there. Perhaps if the only distinction is order/unordered there might be a special case for that when you create the Scorer, rather than exposing internals to the caller? But I don't know - would have to understand this better. Maybe there are other cases where that could be needed.

Mike, would you be willing to upload a patch for your hacked collector etc to see what you have done?

The PosHiglighter is a bit messy - filled with debugging and testing code and so on, and it's also slow because of the need to match positions->offsets in kind of a gross way.. Robert M had an idea for storing this mapping in the index which would improve things there, but I haven't done that. In any case, I'll be happy to share the patch when I get back home and can clean it up a bit. Maybe if I have a chance I will look into implementing OR-queries - I stumbled on that limitation right away!

asfimport commented 13 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Attaching a patch with a simple highlighter using the positions() iterators. Includes a PositionTreeIterator for pulling out leaf positions. The names are getting a bit long: I almost wrote PositionIntervalIteratorTree? Maybe a PositionIntervalIterator could just be a Positions? The variables are all called positions...

asfimport commented 13 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

updated patch provides positions() for Boolean OR (disjunction) queries, but only with n should match = 1. I refactored the ConjunctionPositionIterator and its Queue in order to do this without a lot of cut-and-paste.

asfimport commented 13 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Updated PosHighlighter patch that actually works :) PosCollector now caches positions when collecting docs. This way highlighting can be decoupled, and doesn't have to be performed on docs that will eventually drop off the top n list. This seems to work ok, but it would be nice to come up with a way to make this usable with existing collectors.

asfimport commented 13 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

So PositionsInterators are never preserve positions for a document you pulled the interval for. You can basically pull the iterator only once and keep it until you scorer is exhausted. Bottom line here is that you are depending on the DocsAndPositionsEnum your TermScorer is using. Once this is advanced your positions are advanced too. We could think of a separate Enum here that advances independently, hmm that could actually work too, lets keep that in mind.

So after working with this a bit more (and reading the paper), I see now that it's really not necessary to cache positions in the iterators. So never mind all that! In the end, for some uses like highlighting I think somebody needs to cache positions (I put it in a ScorePosDoc created by the PosCollector), but I agree that doesn't belong in the "lower level" iterator.

Eventually I think we can leave spans as they are right now and concentrate on the API / functionality, making things fast under the hood can be done later but getting things right to be flexible is the most important part here.

As I'm learning more, I am beginning to see this is going to require sweeping updates. Basically everywhere we currently create a DocsEnum, we might now want to create a DocsAndPositionsEnum, and then the options (needs positions/payloads) have to be threaded through all the surrounding APIs. I wonder if it wouldn't make sense to encapsulate those options (needsPositions/needsPayloads) in some kind of EnumConfig object. Just in case, down the line, there is some other information that gets stored in the index, and wants to be made available during scoring, then the required change would be much less painful to implement.

I'm thinking for example (Robert M's idea), that it might be nice to have a positions->offsets map in the index (this would be better for highlighting than term vectors). Maybe this would just be part of payload, but maybe not? And it seems possible there could be other things like that we don't know about yet?

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Mike, its so awesome that you help here. I will be back on wednesday and post comments / suggestions then.

simon

asfimport commented 13 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Yeah that would be great - I hope I'm not being too irritating – enjoy your vacation! I worked on using the standard Highlighter with a TokenStream built on a PosIterator, and the results are promising in terms of performance, but I think I am still grappling with the right way to retrieve positions.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

hey Mike, I applied all your patches and walked through, this looks great. I mean this entire thing is far from committable but I think we should take this further and open a branch for it. I want to commit both your latest patch and the highlighter prototype and work from there.

So after working with this a bit more (and reading the paper), I see now that it's really not necessary to cache positions in the iterators. So never mind all that! In the end, for some uses like highlighting I think somebody needs to cache positions (I put it in a ScorePosDoc created by the PosCollector), but I agree that doesn't belong in the "lower level" iterator.

after looking into your patch I think I understand now what is needed to enable low level stuff like highlighting. what is missing here is a positions collector interface that you can pass in and that collects positions on the lowest levels like for pharses or simple terms. The PositionIterator itself (btw. i think we should call it Positions or something along those lines - try to not introduce spans in the name :) ) should accept this collector and simply call back each low level position if needed. For highlighting I think we should also go a two stage approach. First stage does the matching (with or without positions) and second stage takes the first stages resutls and does the highlighting. that way we don't slow down the query and the second one can even choose a different rewrite method (for MTQ this is needed as we don't have positions on filters)

As I'm learning more, I am beginning to see this is going to require sweeping updates. Basically everywhere we currently create a DocsEnum, we might now want to create a DocsAndPositionsEnum, and then the options (needs positions/payloads) have to be threaded through all the surrounding APIs. I wonder if it wouldn't make sense to encapsulate those options (needsPositions/needsPayloads) in some kind of EnumConfig object. Just in case, down the line, there is some other information that gets stored in the index, and wants to be made available during scoring, then the required change would be much less painful to implement.

what do you mean by sweeping updates? For the enum config I think we only have 2 or 3 places where we need to make the decision. 1. TermScorer 2. PhraseScorer (maybe 2. goes away anyway) so this is not needed for now I think?

I'm thinking for example (Robert M's idea), that it might be nice to have a positions->offsets map in the index (this would be better for highlighting than term vectors). Maybe this would just be part of payload, but maybe not? And it seems possible there could be other things like that we don't know about yet?

yeah this would be awesome... next step :)

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

For highlighting I think we should also go a two stage approach. First stage does the matching (with or without positions) and second stage takes the first stages resutls and does the highlighting. that way we don't slow down the query and the second one can even choose a different rewrite method (for MTQ this is needed as we don't have positions on filters)

I think this would be a good approach, its the same algorithm really that you generally want for positional scoring: score all the docs the 'fast' way then reorder only the top-N (e.g. first two pages of results), which will require using the positioniterator and doing some calculation that you typically add to the score.

So if we can generalize this in a way where you can do this in your collector, I think it would be reusable for this as well.

asfimport commented 13 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

what do you mean by sweeping updates?

I meant adding positions to filters would be a sweeping update. But it sounds as if the idea of rewriting differently is a better approach (certainly much less change).

For highlighting I think we should also go a two stage approach.

I think I agree. The only possible trade-off that goes the other way is in the case where you have the positions available already during initial search/scoring, and there is not too much turnover in the TopDocs priority queue during hit collection. Then a Highlighter might save some time by not re-scoring and re-iterating the positions if it accumulated them up front (even for docs that were eventually dropped off the queue). I think it should be possible to test out both approaches given the right API here though?

The callback idea sounds appealing, but I still think we should also consider enabling the top-down approach: especially if this is going to run in two passes, why not let the highlighter drive the iteration? Keep in mind that positions consumers (like highlighters) may possibly be interested in more than just the lowest-level positions (they may want to see phrases, eg, and near-clauses - trying to avoid the s-word).

Another consideration is ordering. I think ? that positions are retrieved from the index in document order. This could be a natural order for many cases, but score order will also be useful. I'm not sure whose responsibility the sorting should be. Highlighters will want to be able to optimize their work (esp for very large documents) by terminating after considering only the first N matches, where the ordering could either be score or document-order.

I'm glad you will create a branch - this patch is getting a bit unwieldy. I think the PosHighlighter code should probably ? end up as test code only - I guess we'll see. It seems like we could get further faster using the existing Highlighter, with a positions-based TokenStream; I'll post a patch once the branch is in place.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I think I agree. The only possible trade-off that goes the other way is in the case where you have the positions available already during initial search/scoring, and there is not too much turnover in the TopDocs priority queue during hit collection. Then a Highlighter might save some time by not re-scoring and re-iterating the positions if it accumulated them up front (even for docs that were eventually dropped off the queue). I think it should be possible to test out both approaches given the right API here though?

Yes, I think we should go and provide both possibilities here.

The callback idea sounds appealing, but I still think we should also consider enabling the top-down approach: especially if this is going to run in two passes, why not let the highlighter drive the iteration? Keep in mind that positions consumers (like highlighters) may possibly be interested in more than just the lowest-level positions (they may want to see phrases, eg, and near-clauses - trying to avoid the s-word).

I am not sure if I understand this correctly. I think the collector should be some kind of a visitor that walks down the query/scorer tree and each scorer can ask if it should pass the current positions to the collector something like this:

class PositionCollector {

  public boolean register(Scorer scorer) {
    if(interestedInScorere(scorere)) {
       // store infor about the scorer
       return true;
    }
    return false;
  }

  /*
   * Called by a registered scorer for each position change
   */
  public void nexPosition(Scorer scorer) {
   // collect positions for the current scorer
  } 
}

that way the iteration process is still driven by the top-level consumer but if you need information about intermediate positions you can collect them.

Another consideration is ordering. I think that positions are retrieved from the index in document order. This could be a natural order for many cases, but score order will also be useful. I'm not sure whose responsibility the sorting should be. Highlighters will want to be able to optimize their work (esp for very large documents) by terminating after considering only the first N matches, where the ordering could either be score or document-order.

so the order here depends on the first collector I figure. the usual case it that you do your search and retrieve the top N documents (those are also the top N you want to highlight right?) then you pass in your top N and do the highlighting collection based on those top N. In that collection you are not interested all matches but only in the top N from the previous collection. The simplest yet maybe not the best way to do this is using a simple filter that is build from the top N docs.

I will go ahead and create the branch now

asfimport commented 13 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

I am not sure if I understand this correctly. I think the collector should be some kind of a visitor that walks down the query/scorer tree and each scorer can ask if it should pass the current positions to the collector something like this:

Yes that sounds right

Re: ordering; I was concerned about the order in which the positions are iterated within each document, not so much the order in which the documents are returned. I think this is an issue for the highlighter mostly, which can "score" position-ranges in the document so as to return the "best" snippet. This kind of score may be built up from tfidf scores for each term, proximity, length of the position-ranges and so on.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

mike I created a branch here: https://svn.apache.org/repos/asf/lucene/dev/branches/positions

asfimport commented 13 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Not sure what to call these patch files now? This is relative to the positions branch...

asfimport commented 13 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

To make further progress, I think we need to resolve the position API. The testMultipleDocumentsOr test case illustrates the problem with the approach I was trying: walking the PositionIterator tree when collecting documents. Something like the PositionCollector API could work, but I think we still need to solve the problem Mike M alluded to back at the beginning:

... a NEAR query could filter these subs in parallel (like a merge sort) looking for a match, and (I think) then presenting its own union iterator to whoever consumes it? Ie it'd only let through those positions of each sub that satisfied the NEAR constraint.

We want to highlight positions that explain why the document matches the query. Not all terms that match the term queries will count - some of them should be "filtered out" by near-conditions; ie in a PhraseQuery, matching terms not in the phrase should not be highlighted. I think if I just register a callback with the sub-scorers (scoring terms), I would see all the terms, right?

Also, I wonder if callers could use the existing Scorer.ScorerVisitor to walk the Scorer tree and register interest in positions with a scorer?

asfimport commented 13 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

I made some progress with a Collector-based positions API; my tests (and existing tests) are passing now. I am getting all the right positions back for highlighting from a variety of queries, so I think the basic idea is workable. But there are a bunch of things I don't like still (and more queries/scorers to instrument); needs more work, for sure.

The basic idea is to extend Collector so it can accept positions. Collectors simply return true from Collector.needsPositions() and implement collectPositions(Scorer, PositionIntervalIterator). The way of hooking this up is the easiest possible for now; I added Scorer.setPositionCollector(Collector); Collectors would call it in Collector.setScorer(). This approach forced one change that probably isn't good for performance, so we might want to revisit this, and "hook up" the Scorer.positionCollector during construction, say with the register() idea you mentioned. The problem currently is that in DisjunctionSumScorer, there is some initialization that happens (advancing all the iterators) during the constructor, before setScorer() is ever called. I've changed it to happen lazily, but this introduces an extra if(initialized) check into every call to advance() and nextDoc().

Another problem with this callback approach (from the performance perspective) is that you end up paying the cost to check whether there is a positionCollector registered in a fairly tight inner loop (while matching/scoring), I think, which you'd rather not do if there is no interest in positions. It would probably be better to introduce some kind of Scorer wrapper that calls collectPositions() and only gets created if needed. Another thought I had was to push the callback into PositionIntervalIterator, but somehow someone has to know to actually iterate over those iterators; if it's the Scorer, you have the same problem as now, but the Collector can't do it (in collect(), say) because it is often too late then; the Scorer will have advanced its internal state (and thus its positions() iterator) to the next document.

The core problem solved here is how to report positions that are not consumed during scoring, and also those that are, but not to report positions that don't contribute to the query match (terms outside their phrase). In order to do this, different Scorers do different things: the Boolean family of scorers (and wrapping scorers like ConstantScorer) simply pass any positionCollector down to their child Scorers, since they effectively want to report all positions found. PositionTermScorer reports its positions, naturally. The interesting case is PositionFilterScorer, which filters its child Scorers. I added PositionIntervalIterator.getTermPositions() to enable this; this walks the tree of position iterators and returns a snapshot of their current state (as another iterator) so the consumer can retrieve all the term positions as filtered by intermediate iterators without advancing them.

There are a few (11) failing tests with this branch+patch (ran lucene tests only), but they seem unrelated (TestFlushByRamOrCountsPolicy has 5, eg) I am ignoring?

Simon - thanks for setting up the positions branch - I would really appreciate it if you have the time to review this because I think there is headway, but I'm also sure there is lots of room for improvement.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

We want to highlight positions that explain why the document matches the query. Not all terms that match the term queries will count - some of them should be "filtered out" by near-conditions; ie in a PhraseQuery, matching terms not in the phrase should not be highlighted. I think if I just register a callback with the sub-scorers (scoring terms), I would see all the terms, right?

this is why I think we should add a dedicated collector API (ie. not part of Collector maybe an interface?). the current api gives you a "view" for each match meaning that once you advance the iterator you get the positions for the "current" positional match. I think the caller should also drive the collection of intermediate positions / intervals. The big challenge here is to collect the positions you are interested in efficiently. I agree that the if(foo==null) is a problem as long as foo is not final so maybe we should try to make them final and make the pos collector part of the scorer setup (just a thought), we could do that using a ScorerContext for instance.

make further progress, I think we need to resolve the position API. The testMultipleDocumentsOr test case illustrates the problem with the approach I was trying: walking the PositionIterator tree when collecting documents. Something like the PositionCollector API could work, but I think we still need to solve the problem Mike M alluded to back at the beginning:

Agreed we should work on the API. I looked at your patch and some changes appear to be not necessary IMO. Like the problems in testMultipleDocumentsOr are not actually a problem if we sketch this out properly. As I said above if the collector is part of the initialization we can simply pass them to the leaves or intermediate scorers and collect safely even if scorers are advanced. Since during Documents collection the view should be stable, right? So bottom line here is that we need an api that is capable of collecting fine grained parts of the scorer tree. The only way I see doing this is 1. have a subscribe / register method and 2. do this subscription during scorer creation. Once we have this we can implement very simple collect methods that only collect positions for the current match like in a near query, while the current matching document is collected all contributing TermScorers have their positioninterval ready for collection. The collect method can then be called from the consumer instead of in the loop this way we only get the positions we need since we know the document we are collecting.

The core problem solved here is how to report positions that are not consumed during scoring, and also those that are,

this can be solved by my comment above?

The interesting case is PositionFilterScorer, which filters its child Scorers. I added PositionIntervalIterator.getTermPositions() to enable this; this walks the tree of position iterators and returns a snapshot of their current state (as another iterator) so the consumer can retrieve all the term positions as filtered by intermediate iterators without advancing them.

this would work the same way ey? We register during setup, something like

void PositinoCollector#registerScorer(Scorer)

then we can decide that if we need that scorer or rather its positions for collection or not. The entire iteration should only be driven by the top-level consumer, if you advance the iterator on an intermediate iterator you might break some higher level algs. like conjunction / disjunction though. So lets drive this further, lets say we have all collectors that we are interested in, when should we collect positions? I think the top level consumer should 1. advance the positions 2. call collect on the scorers we are interested. While I talk about this I start realizing that it might even be easier that this if we walk the PositionInterator tree rather than the scorer tree and collect the positin iterators from there. This is already possible with the subs() call right? What we essentially need is a method that returns the current interval for each of the iterators. It still might be needed to have a collect method on the iterator so that something like Conjunctions can call collect on the subs if needed?

Oh man this is all kind of tricky ey :)

There are a few (11) failing tests with this branch+patch (ran lucene tests only), but they seem unrelated (TestFlushByRamOrCountsPolicy has 5, eg) I am ignoring?

I don't see anything failing... can you attach a file with the failures?

asfimport commented 13 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

if(foo==null) is a problem as long as foo is not final so maybe we should try to make them final and make the pos collector part of the scorer setup (just a thought), we could do that using a ScorerContext for instance.

Yes, agreed. I just wanted to implement something simple first. I think we can fix the setup problem separately from the actual collection/reporting of intervals. We can eventually undo the changes to DisjunctionSumScorer, and get rid of those extra if()s. Also, as you say the other tests can be made final if we do them during setup.

While I talk about this I start realizing that it might even be easier that this if we walk the PositionInterator tree rather than the scorer tree and collect the positin iterators from there.

Did you look at SubTermPositionIterator (I think that's what I called it) and getTermPositions() yet? They are supposed to be providing pretty much just that capability.

Oh man this is all kind of tricky ey

I tore my hair out all weekend and lost sleep! But I think it actually is where we want now, aside from the registration piece.

asfimport commented 13 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

OK I think I brushed by some of your comments, Simon, in my hasty response, sorry. Here's a little more thought, I hope:

So bottom line here is that we need an api that is capable of collecting fine grained parts of the scorer tree. The only way I see doing this is 1. have a subscribe / register method and 2. do this subscription during scorer creation. Once we have this we can implement very simple collect methods that only collect positions for the current match like in a near query, while the current matching document is collected all contributing TermScorers have their positioninterval ready for collection. The collect method can then be called from the consumer instead of in the loop this way we only get the positions we need since we know the document we are collecting.

I think it's necessary to have both a callback from within the scoring loop, and a mechanism for iterating over the current state of the iterator. For boolean queries, the positions will never be iterated in the scoring loop (all you care about is the frequencies, positions are ignored), so some new process: either the position collector (highlighter, say), or a loop in the scorer that knows positions are being consumed (needsPositions==true) has to cause the iteration to be performed. But for position-aware queries (like phrases), the scorer will iterate over positions, and in order to score properly, I think the Scorer has to drive the iteration? I tried a few different approaches at this before deciding to just push the iteration into the Scorer, but none of them really worked properly.

Let's say, for example that a document is collected. Then the position consumer comes in to find out what positions were matched - it may already too late, because during scoring, some of the positions may have been consumed (eg to score phrases)? It's possible I may be suffering from some delusion, though :) But if I'm right, then it means there has to be some sort of callback mechanism in place during scoring, or else we have to resign ourselves to scoring first, and then re-setting and iterating positions in a second pass.

I actually think that if we follow through with the registration-during-construction idea, we can have the tests done in an efficient way during scoring (with final boolean properties of the scorers), and it can be OK to have them in the scoring loop.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

But if I'm right, then it means there has to be some sort of callback mechanism in place during scoring, or else we have to resign ourselves to scoring first, and then re-setting and iterating positions in a second pass.

But I think this is what I think we want? If there are 10 million documents that match a query, but our priority queue size is 20 (1 page), we only want to do the expensive highlighting on those 20 documents.

Its the same for the positional scoring, its too expensive to look at positions for all documents, so you re-order maybe the top 100 or so.

Or maybe I'm totally confused by the comments!

asfimport commented 13 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

But I think this is what I think we want? If there are 10 million documents that match a query, but our priority queue size is 20 (1 page), we only want to do the expensive highlighting on those 20 documents.

Yes - the comments may be getting lost in the weeds a bit here; sorry. I've been assuming you'd search once to collect documents and then search again with the same query plus a constraint to limited by gathered docids, with an indication that positions are required - this pushes you towards some sort of collector-style callback API. Maybe life would be simpler if instead you could just say getPositionIterator(docid, query). But that would force you actually into a third pass (I think), if you wanted positional scoring too, wouldn't it?

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

But that would force you actually into a third pass (I think), if you wanted positional scoring too, wouldn't it?

I think thats ok? because the two things are different:

asfimport commented 13 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

FWIW, I do think there are use cases where one wants positions over all hits (or most such that you might as well do all), so if it doesn't cause problems for the main use case, it would be nice to support it. In fact, in these scenarios, you usually care less about the PQ and more about the positions.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

FWIW, I do think there are use cases where one wants positions over all hits (or most such that you might as well do all), so if it doesn't cause problems for the main use case, it would be nice to support it. In fact, in these scenarios, you usually care less about the PQ and more about the positions.

I don't think this issue should try to solve that problem: if you are doing that, it sounds like you are using the wrong Query!

asfimport commented 13 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

I don't think this issue should try to solve that problem: if you are doing that, it sounds like you are using the wrong Query!

It's basically a boolean match on any arbitrary Query where you care about the positions. Pretty common in e-discovery and other areas. You have a query that tells you all the matches and you want to operate over the positions. Right now, it's a pain as you have to execute the query twice. Once to get the scores and once to get the positions/spans. If you have a callback mechanism, one can do both at once.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I don't understand the exact use case... it still sounds like the wrong query? What "operating" over the positions do you need to do?

asfimport commented 13 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

In the cases where I've both done this and seen it done, you often have an arbitrary query that matches X docs. You then want to know where exactly the matches occur and then you often want to do something in a window around those matches. Right now, w/ Spans, you have to run the query once to get the scores and then run a second time to get the windows. The times I've seen it, the result is most often given to some downstream process that does deeper analysis of the window, so in these cases X can be quite large (1000's if not more). In those cases, some people care about the score, some do not. For instance, if one is analyzing all the words around the name of a company, you search term would be the company name and you want to iterate over all the positions where it matched, looking for other words near it (perhaps sentiment words or other things)

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

In those cases, some people care about the score, some do not. For instance, if one is analyzing all the words around the name of a company, you search term would be the company name and you want to iterate over all the positions where it matched, looking for other words near it

Grant, I'm not sure this sounds like an inverted index is even the best data structure for what you describe.

I just don't want us to confuse the issue with the nuking of spans/speeding up highlighting/enabling positional scoring use cases which are core to search.

asfimport commented 13 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

I'm not sure this sounds like an inverted index is even the best data structure for what you describe

The key is you usually have a fairly complex Query to begin with, so I do think it is legitimate and it is the right data structure. It is always driven by the search results. I've seen this use case multiple times, where multiple is more than 10, so I am pretty convinced it is beyond just me. I think if you are taking away the ability to create windows around a match (if you read my early comments on this issue I brought it up from the beginning), that is a pretty big loss. I don't think the two things are mutually exclusive. As long as I have a way to get at the positions for all matches, I don't care that it. A "collector" type callback interface or a way for one to iterate all positions for a given match should be sufficient.

That being said, if Mike's comments about a collector like API are how it is implemented, I think it should work. In reality, I think one would just need a way to, for whatever number of results, be told about positions as they happen. Naturally, the default should be to only do this after the top X are retrieved, when X is small, but I could see implementing it in the scoring loop on certain occasions (and I'm not saying Lucene need have first order support for that). As long as you don't preclude me from doing that, it should be fine.

I'll try to find time to review the patch in more depth in the coming day or so.