Allow Scorer to expose positions and payloads aka. nuke spans [LUCENE-2878]

asfimport commented 13 years ago

Currently we have two somewhat separate types of queries, the one which can make use of positions (mainly spans) and payloads (spans). Yet Span*Query doesn't really do scoring comparable to what other queries do and at the end of the day they are duplicating lot of code all over lucene. Span*Queries are also limited to other Span*Query instances such that you can not use a TermQuery or a BooleanQuery with SpanNear or anthing like that. Beside of the Span*Query limitation other queries lacking a quiet interesting feature since they can not score based on term proximity since scores doesn't expose any positional information. All those problems bugged me for a while now so I stared working on that using the bulkpostings API. I would have done that first cut on trunk but TermScorer is working on BlockReader that do not expose positions while the one in this branch does. I started adding a new Positions class which users can pull from a scorer, to prevent unnecessary positions enums I added ScorerContext#needsPositions and eventually Scorere#needsPayloads to create the corresponding enum on demand. Yet, currently only TermQuery / TermScorer implements this API and other simply return null instead. To show that the API really works and our BulkPostings work fine too with positions I cut over TermSpanQuery to use a TermScorer under the hood and nuked TermSpans entirely. A nice sideeffect of this was that the Position BulkReading implementation got some exercise which now :) work all with positions while Payloads for bulkreading are kind of experimental in the patch and those only work with Standard codec.

So all spans now work on top of TermScorer ( I truly hate spans since today ) including the ones that need Payloads (StandardCodec ONLY)!! I didn't bother to implement the other codecs yet since I want to get feedback on the API and on this first cut before I go one with it. I will upload the corresponding patch in a minute.

I also had to cut over SpanQuery.getSpans(IR) to SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk first but after that pain today I need a break first :).

The patch passes all core tests (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't look into the MemoryIndex BulkPostings API yet)

Migrated from LUCENE-2878 by Simon Willnauer (@s1monw), 11 votes, resolved Apr 11 2018 Attachments: LUCENE-2878_trunk.patch (versions: 2), LUCENE-2878.patch (versions: 30), LUCENE-2878-OR.patch, LUCENE-2878-vs-trunk.patch, PosHighlighter.patch (versions: 2) Linked issues:

5590

Sub-tasks:

4391
- 4392
- 4393
- 4394
- 5617
- 5618
- 5621
- 5638

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

The key is you usually have a fairly complex Query to begin with, so I do think it is legitimate and it is the right data structure.

Really, just because its complicated? Accessing other terms 'around the position' seems like accessing the document in a non-inverted way.

I've seen this use case multiple times, where multiple is more than 10, so I am pretty convinced it is beyond just me.

Really? If this is so common, why do the spans get so little attention? if the queries are so complex, how is this even possible now given that spans have so many problems, even basic ones (e.g. discarding boosts)

If performance here is so important towards looking at these 'windows around a match' (which is gonna be slow as shit via term vectors), why don't I see codecs that e.g. deduplicate terms and store pointers to the term windows around themselves in payloads, and things like that for this use case?

I don't think we need to lock ourselves into a particular solution (such as per-position callback API) for something that sounds like its really slow already.

asfimport commented 13 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Really, just because its complicated? Accessing other terms 'around the position' seems like accessing the document in a non-inverted way.

Isn't that what highlighting does? This is just highlighting on a much bigger set of documents. I don't see why we should prevent users from doing it just b/c you don't see the use case.

Really? If this is so common, why do the spans get so little attention? if the queries are so complex, how is this even possible now given that spans have so many problems, even basic ones (e.g. discarding boosts)

Isn't that the point of this whole patch? To bring "spans" into the fold and treat as first class citizens? I didn't say it happened all the time. I just said it happened enough that I think it warrants being covered before one "nukes spans".

If performance here is so important towards looking at these 'windows around a match' (which is gonna be slow as shit via term vectors),

why don't I see codecs that e.g. deduplicate terms and store pointers to the term windows around themselves in payloads, and things like that for this use case?

Um, b/c it's open source and not everything gets implemented the minute you think of it?

I don't think we need to lock ourselves into a particular solution (such as per-position callback API) for something that sounds like its really slow already.

Never said we did.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Isn't that what highlighting does? This is just highlighting on a much bigger set of documents. I don't see why we should prevent users from doing it just b/c you don't see the use case.

well it is different: I'm not saying we should prevent users from doing it, but we shouldn't slow down normal use cases either: I think its fine for this to be a 2-pass operation, because any performance differences from it being 2-pass across many documents are going to be completely dwarfed by the term vector access!

asfimport commented 13 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Yeah, I agree. I don't want to block the primary use case, I'm just really hoping we can have a solution for the second one that elegantly falls out of the primary one and doesn't require a two pass solution. You are correct on the Term Vec access, but for large enough sets, the second search isn't trivial, even if it is dwarfed. Although, I think it may be possible to at least access them in document order.

asfimport commented 13 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

I hope you all will review the patch and see what you think. My gut at the moment tells me we can have it both ways with a bit more tinkering. I think that as it stands now, if you ask for positions you get them in more or less the most efficient way we know how. At the moment there is some performance hit when you don't want positions, but I think we can deal with that. Simon had the idea we could rely on the JIT compiler to optimize away the test we have if we set it up as a final false boolean (totally do-able if we set up the state during Scorer construction), which would be great and convenient. I'm no compiler expert, so not sure how reliable that is - is it? But we could also totally separate the two cases (say with a wrapping Scorer? - no need for compiler tricks) while still allowing us to retrieve positions while querying, collecting docs, and scoring.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

bq.Yeah, I agree. I don't want to block the primary use case, I'm just really hoping we can have a solution for the second one that elegantly falls out of the primary one and doesn't require a two pass solution. You are correct on the Term Vec access, but for large enough sets, the second search isn't trivial, even if it is dwarfed. Although, I think it may be possible to at least access them in document order.

Grant, as far as I understand your concerns I think they are addressed already. if you want to do span like (what spans does today) you can already do that. You can simply advance the iterator during search and get the matches / position. or do I misunderstand what you are saying...

asfimport commented 13 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Cool. I think as positions become first class citizens and as this stuff gets faster, we're going to see more and more use of positional information in apps, so it will likely become more common.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Mike & all other interested users :) I think I got around all the pre scorer creation collector setup etc. by detaching Scorer from Positions (and its iteration + collection) entirely. on the lowest level TermScorer now uses two enums, one for scoring (no positions) and one for the position iterator if needed. This change required some more upstream changes since the consumer now has to advance the positions to the next doc the scorer points to. Yet, this gives us some more freedom how and when to consume the positions. A wrapping scorer can still consume the positions or we can simply leave this to the collector.

I think this gets reasonably closer to what we need while still pretty rough. Mike what do you think, does that help with highlighting?

i also added two types of collectors, one collects only leaves (term positions) and the other collects the intermediate (composite) intervals. I call them by default without null checking, the default is simply an empty collector so hopefully the compiler will no-op this.

asfimport commented 13 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

Looks good, Simon!

So - working with two enums; one for basic scoring w/o positions, and one for gathering positions allows additional flexibility and cleaner separation between the position-aware code and the scorers, and makes it more straightforward to implement the desired API.

We can now set up a PositionCollector (it's good to allow as separate from Collector) that collects both term positions and (separately) composite position intervals (like phrases, intervals containing conjoined terms, etc).

Some will be reported naturally if a position-aware scorer consumes them; the collector can iterate through the remainder by calling collect() => actually I might suggest renaming PositionIntervalIterator.collect() to distribute(), to distinguish it from its counterpart, PositionCollector.collect().

Do you have any concern about the two iterators getting out of sync? I noticed the nocommit, I guess that's what you meant? What's the scope for mischief - should we be thinking about making it impossible for the user of the API get themselves in trouble? Say, for example, I call advanceTo(randomDocID) - I could cause my PositionFilterQuery to get out of whack, maybe?

I am going to clean up the PosHighlighter tests a bit, get rid of dead code, etc., possibly add some tests for the composite interval stuff, and do a little benchmarking.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

actually I might suggest renaming PositionIntervalIterator.collect() to distribute(), to distinguish it from its counterpart, PositionCollector.collect().

how about gatherPositions() ?

Do you have any concern about the two iterators getting out of sync? I noticed the nocommit, I guess that's what you meant?

Actually I am not super concerned about this. its all up to the API consumer. The nocommit is just a reminder that we need to fix this method (PII#doc()) to return the actual document the DocsAndPositionsEnum points to or rather the iterator points to right now. I think we should start sketching out the API and write some javadoc to make clear how things work. Beside working on highlighting I think we should also cut over remaining queries to positions and copy some of the span tests to positions (dedicated issue for this would be helpful this gets a little big).

should we be thinking about making it impossible for the user of the API get themselves in trouble? Say, for example, I call advanceTo(randomDocID) - I could cause my PositionFilterQuery to get out of whack, maybe?

phew, I think we can work around this but we need to make sure we don't loose flexibility. Maybe we need to rethink how PositionFitlerQuery works. Lets leave that for later :)

For spans I think we should move them to the new queries module and eventually out of core (we should have a new issue for this no?). For the position iterating stuff I think we can mainly concentrate on getting positions work and leave payloads for later.

Further I think we should also open a ticket for highlighting as well as for positional scoring where we can add the 2 stage collector stuff etc.

I will create a "positions branch" version so we can flag issues correctly.

I am going to clean up the PosHighlighter tests a bit, get rid of dead code, etc., possibly add some tests for the composite interval stuff, and do a little benchmarking.

awesome, if you clean up the patch make sure we have the right headers in all new files and add @lucene.experimental to the classes. I want to commit our stage soonish (once you cleaned it up) and continue with fine grained issues.

I am glad that you spend so much time this man! Making positions first class citizens is very important and it will pave the way to get rid of spans eventually.

asfimport commented 13 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

how about gatherPositions() ?

Seems OK; anything but collect!

I want to commit our stage soonish (once you cleaned it up) and continue with fine grained issues.

Good - yes it would be helpful to split out some issues now: finish up API, more queries, positional scoring and highlighting? Do you have a plan for PhraseQuery? It looks scary to me!

API note: I wonder if it still makes sense to use Collector.needsPositions() as the trigger for requesting positions - if Collectors are not really what end up doing the gathering of positions?

I ran some quick benchmarks, and the results are promising - highlighting with positions is more than 10x faster than regular highlighting and slightly (10-15%?) faster than fast vector highlighter. While doing this I found a bug in DisjunctionPositionIterator.advanceTo() - it could return a docId that none of its subqueries matched, so you'd eventually get a NPE. Fix is in the patch I'm uploading. Oh yes - also added SingleMatchScorer.positions()

I am glad that you spend so much time this man! Making positions first class citizens is very important and it will pave the way to get rid of spans eventually.

This is exciting, I think! Glad you are able to work on it again. I will probably slow down a bit since I am traveling for a few days, but I'll be back next week.

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Good - yes it would be helpful to split out some issues now: finish up API, more queries, positional scoring and highlighting? Do you have a plan for PhraseQuery? It looks scary to me!

I committed the latest patch with some more cleanups, headers, test etc. I also started working on the PhraseQuery, exact case works already but I need some more time and brain cycles for the sloppy part. (it is scary) I am going to open a new issue for this now.

ran some quick benchmarks, and the results are promising - highlighting with positions is more than 10x faster than regular highlighting and slightly (10-15%?) f

awesome I can't wait to do some more benchmarks though.

This is exciting, I think! Glad you are able to work on it again. I will probably slow down a bit since I am traveling for a few days, but I'll be back next week.

same here, I was traveling this week and next week again so lets see how much progress we can make here. :) looks good so far

asfimport commented 13 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

mike, I created subtasks (listed below the attached files) for this issue since this gets kind of huge. I also made you a JIRA contributor so you can assign issues to yourself. Please don't hesitate to open further issues / subtasks as we proceed.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

hey folks,

due to heavy modifications on trunk I had almost no choice but creating a new branch and manually move over the changes via selective diffs. the branch is now here: https://svn.apache.org/repos/asf/lucene/dev/branches/LUCENE-2878

the current state of the branch is: it compiles :)

lots of nocommits / todos and several tests failing due to not implemented stuff on new specialized boolean scorers. Happy coding everybody!

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Patch changing the Scorer#positions() signature to Scorer#positions(needsPayloads, needsOffsets), and implementing the payload passing functionality. All Span payload tests now pass.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Alan this is awesome. I fixed some compile errors in solr and modules land and test-core passes! I will go ahead and commit this to the branch. I think next is either fixing all missing queries (PhraseScorer and friends) or exposing offsets. Feel free to create a subtask for the offsets though. My first step here would be to put the offsets next to PositionInterval#begin/end as offsetBegin/End. This is more of a coding task than anything else in the beginning since this info needs to be transported up the PositionIntervalIterator "tree" during execution. on the lowest level (TermPositions) you can simply assign it via DocsAndPositionsEnum#start/endOffset() since that returns -1 if offsets are not indexed.

thanks & good job!

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Alan this is awesome. I fixed some compile errors in solr and modules land and test-core passes! I will go ahead and commit this to the branch. I think next is either fixing all missing queries (PhraseScorer and friends) or exposing offsets. Feel free to create a subtask for the offsets though. My first step here would be to put the offsets next to PositionInterval#begin/end as offsetBegin/End. This is more of a coding task than anything else in the beginning since this info needs to be transported up the PositionIntervalIterator "tree" during execution. on the lowest level (TermPositions) you can simply assign it via DocsAndPositionsEnum#start/endOffset() since that returns -1 if offsets are not indexed.

thanks & good job!

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

I'll start on the offsets - some relatively mindless coding is probably about where I'm at today, and the brief look I had at ExactPhraseScorer scared me a bit.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

its fantastic how you guys have brought this back from the dead!

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Patch against the branch head, adding offsets to PositionInterval. Includes a couple of test cases showing that it works for basic TermQueries.

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

The patch also includes an @Ignored test case for BooleanQueries, as this didn't behave in the way I expected it to. At the moment, ConjunctionPositionIterator returns PositionIntervals that span all the parent query's subclauses. So searching for 'porridge' and 'nine' returns an Interval that starts at 'porridge' and ends at 'nine'. I would have expected this instead to return two separate intervals - if we want phrase-type intervals, then we can combine the individual intervals with a Filter of some kind. But I may just be misunderstanding how this is supposed to work.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

hey alan,

great job.. your are getting up to speed. I fixed that testcase (the boolean one) since in the conjunction case you have to consume the conjunction positions/offsets ie. the intervals given by the term matches. I also fixed the license header in that file and brought the highlighter prototype test back. I will commit this to the branch now.

wow man this makes me happy! Good job.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I messed up the last patch - here is the actual patch.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

oh btw. All tests on the branch pass now :)

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

I think my next step is to have a go at implementing ReqOptSumScorer and RelExclScorer, so that all the BooleanQuery cases work. Testing it via the PosHighlighter seems to be the way to go as well.

This might take a little longer, in that it will require me to actually think about what I'm doing...

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

This might take a little longer, in that it will require me to actually think about what I'm doing...

no worries, good job so far. Did the updated patch made sense to you? I think you had a good warmup phase now we can go somewhat deeper!

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

New patch, implementing positions() for ReqExclScorer and ReqOptSumScorer, with a couple of basic tests.

These just return Conj/Disj PositionIterators, ignoring the excluded Scorers. It works in the simple cases that I've got here, but they may need to be made more complex when we take proximity searches into account.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

New patch, implementing positions() for ReqExclScorer and ReqOptSumScorer, with a couple of basic tests.

looks good! I will commit it and we can iterate further. Good to see those additional tests! Proximity searches are a different story and I will leave that for later. We can even add that once this is in trunk. In general we need to add a ton of testcases and straight out the api at some point but lets get all queries supporting that stuff first.

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

I've spent a bit of time on ExactPhraseScorer this weekend, and I think I'm going to need some pointers on how to proceed. BlockPositionIterator expects all component terms in its target phrase to have their own subscorers, but ExactPhraseScorer uses a different algorithm that doesn't use subscorers at all. Are we happy with the positions() algorithm being completely separate from the algorithm used by the initial search? Or should I be looking at refactoring PhraseQuery to create subscorers and pass them down to the various Scorer implementations?

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Ah, never mind, I'm an idiot. I extend BlockPositionIterator to take an array of TermPositions. Patch will follow later today.

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Patch implementing positions() for ExactPhraseScorer.

I've had to make some changes to PhraseQuery#scorer() to get this to work, and MultiPhraseQuery is now failing some tests, but it's a start.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I've had to make some changes to PhraseQuery#scorer() to get this to work, and MultiPhraseQuery is now failing some tests, but it's a start.

that's fine changes are necessary to make this work. I updated your patch with some quick fixes to give you some more insight how I'd do it. the core tests pass now but I am not sure if that is really the way to go or if we need to smooth some edges. I didn't have too much time so I hacked it together to get you going. We can iterate on that a little during the week.

thanks for the patch man!

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I committed the latest patch. thanks alan

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Patch adding positions() support to SloppyPhraseScorer.

Some tests fail here, as MultiPhraseQuery doesn't create a TermDocsEnumFactory in certain circumstances yet. I'll get working on that next.

The meaty bit of the patch is a refactoring of SloppyPhraseScorer#phraseFreq() to use an iterator when calculating phrase frequencies. We can then reuse this logic when finding the phrase positions.

I think we can probably simplify the PostingsAndFreq and TermDocsEnumFactory constructors as well now - for example, we don't need the TermState in TDEF because we want to wind back the DocsAndPositionsIterators to their initial positions. I think. (I'm still getting my head round some of these internal APIs, can you tell?)

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

This fixes the MultiPhraseQuery tests. Simplifies the TermDocsEnumFactory interface considerably, and implements a new version that can return a UnionDocsAndPositionsEnum.

MPQ still doesn't support positions() completely, because UnionDocsAndPE doesn't return offsets yet. That'll be in the next patch!

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Updated patch implementing startOffset and endOffset on UnionDocsAndPositionsEnum. MultiPhraseQuery can now return its positions properly.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

hey alan, I won't be able to look at this this week but will do early next week! good stuff on a brief look!

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Hi Simon, I'm going to be away for the rest of the month, but will hopefully be able to work more on this in a couple of weeks. Let me know if there's more I can do.

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Patch incorporating my previous uncommitted patches, but catching up with changes in trunk.

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

positions() is now implemented on all the various different types of query, I think, with the exception of the BlockJoin queries.

Next step is to try and reimplement the various SpanQuery tests using the position filter queries in their place.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

hey alan,

I merged the branch up with trunk and applied you patch (latest-1). Your changes to PhraseQuery are tricky. The SloppyPhraseScorer now uses the same DocsAndPosition enums as the PosIterator in there. That is unfortunately not how it should be. If you pull a PosIterator from a scorer there should not be any side-effect on the score or the iterator if you advance one or the other. Currently I see a failure in PosHighlighterTest:

[junit4:junit4] Suite: org.apache.lucene.search.poshighlight.PosHighlighterTest
[junit4:junit4] FAILURE 0.22s J0  | PosHighlighterTest.testSloppyPhraseQuery
[junit4:junit4]    > Throwable #1: java.lang.AssertionError: nextPosition() was called too many times (more than freq() times) posPendingCount=-1
[junit4:junit4]    >    at __randomizedtesting.SeedInfo.seed([C966081DA1EFC306:32134412A4F88738]:0)
[junit4:junit4]    >    at org.apache.lucene.codecs.lucene40.Lucene40PostingsReader$SegmentFullPositionsEnum.nextPosition(Lucene40PostingsReader.java:1127)
[junit4:junit4]    >    at org.apache.lucene.search.PhrasePositions.nextPosition(PhrasePositions.java:76)
[junit4:junit4]    >    at org.apache.lucene.search.PhrasePositions.firstPosition(PhrasePositions.java:65)
[junit4:junit4]    >    at org.apache.lucene.search.SloppyPhraseScorer.initSimple(SloppyPhraseScorer.java:230)
[junit4:junit4]    >    at org.apache.lucene.search.SloppyPhraseScorer.initPhrasePositions(SloppyPhraseScorer.java:218)
[junit4:junit4]    >    at org.apache.lucene.search.SloppyPhraseScorer.access$800(SloppyPhraseScorer.java:28)
[junit4:junit4]    >    at org.apache.lucene.search.SloppyPhraseScorer$SloppyPhrasePositionIntervalIterator.advanceTo(SloppyPhraseScorer.java:533)
[junit4:junit4]    >    at org.apache.lucene.search.poshighlight.PosCollector.collect(PosCollector.java:53)
[junit4:junit4]    >    at org.apache.lucene.search.Scorer.score(Scorer.java:62)
[junit4:junit4]    >    at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:574)
[junit4:junit4]    >    at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:287)
[junit4:junit4]    >    at org.apache.lucene.search.poshighlight.PosHighlighterTest.doSearch(PosHighlighterTest.java:161)
[junit4:junit4]    >    at org.apache.lucene.search.poshighlight.PosHighlighterTest.doSearch(PosHighlighterTest.java:147)
[junit4:junit4]    >    at org.apache.lucene.search.poshighlight.PosHighlighterTest.testSloppyPhraseQuery(PosHighlighterTest.java:378)
...
[junit4:junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=PosHighlighterTest -Dtests.method=testSloppyPhraseQuery -Dtests.seed=C966081DA1EFC306 -Dtests.slow=true -Dtests.locale=sr 
-Dtests.timezone=Africa/Harare -Dtests.file.encoding=UTF-8
[junit4:junit4]   2>
[junit4:junit4]    > (`@AfterClass` output)
[junit4:junit4]   2> NOTE: test params are: codec=Lucene40: {}, sim=RandomSimilarityProvider(queryNorm=true,coord=false): {f=DFR I(ne)B3(800.0)}, locale=sr, timezone=Africa/Harare
[junit4:junit4]   2> NOTE: Linux 2.6.38-15-generic amd64/Sun Microsystems Inc. 1.6.0_26 (64-bit)/cpus=12,threads=1,free=327745216,total=379322368
[junit4:junit4]   2> NOTE: All tests run in this JVM: [PosHighlighterTest]
[junit4:junit4]   2>

this makes the entire sloppy case very tricky though. Even further I don't think sloppy phrase is really correct and if I recall correctly there some issue with it that haven't been resolved for years now. I am not sure how we should proceed with that one. I will need to think about that further.

Next step is to try and reimplement the various SpanQuery tests using the position filter queries in their place.

please go ahead!

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Patch with a couple of new tests that exercise the SpanNearQuery-like functions.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Alan! I am so glad you are still sticking around!

thanks for your patch, I already committed it together with some additions I added today. I saw your comment in the test

//TODO: Subinterval slops - should this work with a slop of 6 rather than 11?

I fixed this today since this bugged me for a long time. I basically use the same function that sloppyphrase uses to figure out the matchDistance of the current interval. The test now passes with slop = 6. I also fixed all the tests in TestSimplePositions that did this weird slop manipulation. I also added a new operator based on the Brouwerian difference (here is the crazy paper if you are interested: http://vigna.dsi.unimi.it/ftp/papers/EfficientAlgorithmsMinimalIntervalSemantics)

SloppyPhraseScorer now works with a new positioniterator for the single term case ie. not created through MultiPhraseQuery and all tests pass. I still need to find a good way to fix the multi-term case. What I think is a good plan for the next iteration is to create more tests. What I did with TestSimplePositions is that I copied TestSpans and modified the tests to use PositionIterators and not spans. If you are keen go ahead and grab some of those tests and copy them to the positions package and port them over.

I will soon refactor some classnames since IMO PostionIntervalIterator and PositionInterval is 1. too long and 2. not true anymore. we also have offsets in there so for now I will just call them IntervalIterator. Since those are all svn moves I will commit them directly.

looking forward to your next patch!

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

FYI - refactoring is done & committed. Alan, I might not be very responsive in the next 2 weeks so happy coding! :)

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

New patch, does a few things:

adds some Javadocs. Not many, though! This is mainly me trying to understand how things fit together here.
pulls the SnapshotPositionCollector into its own class, and extends OrderedConjunctionIntervalIterator to use it. Also adds a new test illustrating this.
cleans up Interval and IntervalIterator a bit.

I'll commit this shortly.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

ALAN! you have no idea how happy I am that you picking this up again. I put a lot of work into this already and I really think we are close already. Only MultiTermSloppyPhrase doesn't work at this point and I honestly think we can just mark this as unsupported (what a crazy scorer) anyway. We really need to clean this stuff up and you basically did the first step towards this. +1 to commit! :)

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

Heh, it was a long two weeks :-)

As another step towards making the API prettier, I'd like to rename the queries:

OrderedConjunctionQuery => OrderedNearQuery
BrouwerianQuery => NonOverlappingQuery

And maybe add an UnorderedNearQuery that just wraps a BooleanQuery and a WithinIntervalFilter. These names are probably a bit more intuitive to people unversed in IR theory...

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

+1 to the renaming. I still think we should document the actual used algorithm (ie. for BrouweianQuery) with references to the paper though. Please go ahead and add this. I will need to bring this branch up-to-date, will do once you committed these changes.

asfimport commented 12 years ago

Alan Woodward (@romseygeek) (migrated from JIRA)

OK, done, added some more javadocs as well. Next cleanup is to make the distinction between iterators and filters a bit more explicit, I think. We've got some iterators that also act as filters, and some which are distinct. I think they should all be separate classes - filters are a public API that clients can use to create queries, whereas Iterators are an implementation detail.

asfimport commented 12 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

Alan, I merged up with trunk and fixed some small bugs. +1 to all the cleanups

apache / lucene

Allow Scorer to expose positions and payloads aka. nuke spans [LUCENE-2878] #3952

5590

4391

4392

4393

4394

5617

5618

5621

5638