apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.59k stars 1.01k forks source link

SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. [LUCENE-3821] #4894

Open asfimport opened 12 years ago

asfimport commented 12 years ago

The general bug is a case where a phrase with no slop is found, but if you add slop its not.

I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case, jenkins just hasn't had enough time to chew on it.

ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x


Migrated from LUCENE-3821 by Naomi Dushay, updated Mar 13 2012 Attachments: LUCENE-3821_test.patch, LUCENE-3821.patch (versions: 4), LUCENE-3821-SloppyDecays.patch, schema.xml, solrconfig-test.xml

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Here's Naomi's original description (moved from description section): including the queries, however I can't reproduce with just that phrase of the document (I tried, it passes).

I think its a more complex bug in SloppyPhraseScorer... and I can reproduce similar behavior with another test.

In upgrading from Solr 1.4 to Solr 3.5, the following phrase searches stopped working in dismax:
"The Beatles as musicians : Revolver through the Anthology"
"Color-blindness [print/digital]; its dangers and its detection"
Both of these queries have a repeated work, and have many terms. It's not the number of terms or the colon surrounded by spaces, because the following phrase search works in Solr 3.5 (and Solr 1.4):
"International encyclopedia of revolution and protest : 1500 to the present"

With Robert Muir's help, we have narrowed the problem down to slop (proximity in lucene QueryParser, query slop in dismax). I have included debugQuery details for the Beatles search; I confirmed the same behavior with the color-blindness search.

Solr 3.5: it fails when (query) slop setting isn't 0.

lucene QueryParser with proximity set to 1 (or anything > 0) : no match
URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"\~1
final query: all_search:"the beatl as musician revolv through the antholog"\~1

lucene QueryParser with proximity set to 0: result!
URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"
final query: all_search:"the beatl as musician revolv through the antholog"

6.0562754 = (MATCH) weight(all_search:"the beatl as musician revolv through the antholog" in 1064395), product of:
<snip>
48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611)
<snip>

dismax QueryParser with qs=1: no match
ps=0
URL: qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=0
final query: +(all_search:"the beatl as musician revolv through the antholog"\~1)\~0.01 (all_search:"the beatl as musician revolv through the antholog")\~0.01
ps=1
URL: qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=1
final query: +(all_search:"the beatl as musician revolv through the antholog"\~1)\~0.01 (all_search:"the beatl as musician revolv through the antholog"\~1)\~0.01

dismax QueryParser with qs=0: result!
ps=0
URL: qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=0
final query: +(all_search:"the beatl as musician revolv through the antholog")\~0.01 (all_search:"the beatl as musician revolv through the antholog")\~0.01
ps=1
URL: qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=1
final query: +(all_search:"the beatl as musician revolv through the antholog")\~0.01 (all_search:"the beatl as musician revolv through the antholog"\~1)\~0.01

8.564867 = (MATCH) sum of:
4.2824335 = (MATCH) weight(all_search:"the beatl as musician revolv through the antholog" in 1064395), product of:
<snip>
48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611)
<snip>

Solr 1.4: it works regardless of slop settings

lucene QueryParser with any proximity value: result!
\~0
URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"
final query: all_search:"the beatl as musician revolv through the antholog"
\~1
URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"\~1
final query: all_search:"the beatl as musician revolv through the antholog"\~1

5.2672544 = fieldWeight(all_search:"the beatl as musician revolv through the antholog" in 3469163), product of:
<snip>
48.157753 = idf(all_search: the=3549637 beatl=392 as=751093 musician=11992 revolv=822 through=88522 the=3549637 antholog=11246)
<snip>

dismax QueryParser with any qs: result!
qs=0, ps=0
URL: qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=0
final query: +(all_search:"the beatl as musician revolv through the antholog")\~0.01 (all_search:"the beatl as musician revolv through the antholog")\~0.01
qs=0, ps=1
URL: qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=0&ps=1
final query: +(all_search:"the beatl as musician revolv through the antholog")\~0.01 (all_search:"the beatl as musician revolv through the antholog"\~1)\~0.01
dismax QueryParser with qs=0: result!
qs=1, ps=0
URL: qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=0
final query: +(all_search:"the beatl as musician revolv through the antholog"\~1)\~0.01 (all_search:"the beatl as musician revolv through the antholog")\~0.01
qs=1, ps=1
URL: qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the Anthology"&qs=1&ps=1
final query: +(all_search:"the beatl as musician revolv through the antholog"\~1)\~0.01 (all_search:"the beatl as musician revolv through the antholog"\~1)\~0.01

7.4490223 = (MATCH) sum of:
3.7245111 = weight(all_search:"the beatl as musician revolv through the antholog"\~1 in 3469163), product of:
<snip>
48.157753 = idf(all_search: the=3549637 beatl=392 as=751093 musician=11992 revolv=822 through=88522 the=3549637 antholog=11246)
<snip>

More information:

schema.xml:
<field name="all_search" type="text" indexed="true" stored="false" />

solr 3.5:
<fieldtype name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange="1" generateWordParts="1" catenateWords="1"
splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1"
catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1" />
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldtype>

solr1.4:
<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j" composed="false" remove_diacritics="true" remove_modifiers="true" fold="true" />
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange="1" generateWordParts="1" catenateWords="1"
splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1"
catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldtype>

And the analysis page shows the same results for Solr 3.5 and 1.4

Solr 3.5:

position 1 2 3 4 5 6 7 8
term text the beatl as musician revolv through the antholog
keyword false false false false false false false false
startOffset 0 4 12 15 27 36 44 48
endOffset 3 11 14 24 35 43 47 57
type word word word word word word word word

Solr 1.4:

term position 1 2 3 4 5 6 7 8
term text the beatl as musician revolv through the antholog
term type word word word word word word word word
source start,end 0,3 4,11 12,14 15,24 27,35 36,43 44,47 48,57

For debug purposes, we can consider the Solr document as:

<doc>
<str name="all_search">The Beatles as musicians : Revolver through the Anthology</str>
</doc>

I can't attached the full SolrDoc as all_search is indexed, but not stored, and I use SolrJ to write to the index from java objects ... plus our objects have a zillion fields (I work in a library with very rich metadata and very exacting solr fields). I have attached the Solr 3.5 schema and solrconfig, but they are big and ugly for the same reasons.

For more details, see the erroneously titled email thread "result present in Solr 1.4 but missing in Solr 3.5, dismax only" started on 2012-02-22 on solr-user@lucene.apache.org.

    Naomi
asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Moving to comments...

asfimport commented 12 years ago

Naomi Dushay (migrated from JIRA)

How's this for a document:

<doc> <str name="id">1</str> <str name="title_245a_display">The Beatles as musicians</str> <str name="title_245a_search">The Beatles as musicians :</str> <str name="title_245c_display">Walter Everett</str> <str name="title_display">The Beatles as musicians : Revolver through the Anthology</str> <str name="title_full_display">The Beatles as musicians : Revolver through the Anthology / Walter Everett.</str> <str name="title_245_search">The Beatles as musicians : Revolver through the Anthology / Walter Everett.</str> <str name="title_sort">Beatles as musicians Revolver through the Anthology</str> <str name="all_search">The Beatles as musicians : Revolver through the Anthology</str> </doc>

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Here's an improved test that fails every time.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I reviewed the random failures: in all cases it fails, repeated terms are in the query.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I just @Ignore'd this test... it's creating a lot of Jenkins noise... but we should fix this bug!!

asfimport commented 12 years ago

Doron Cohen (migrated from JIRA)

Fails here too like this:

ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtestmethod=testRandomIncreasingSloppiness -Dtests.seed=-171bbb992c697625:203709d611c854a5:1ca48cb6d33b3f74 -Dargs="-Dfile.encoding=UTF-8"

I'll look into it

asfimport commented 12 years ago

Doron Cohen (migrated from JIRA)

I understand the problem.

It all has to do - as Robert mentioned - with the repeating terms in the phrase query. I am working on a patch - it will change the way that repeats are handled.

Repeating PPs require additional computation - and current SloppyPhraseScorer attempts to do that additional work efficiently, but over simplifies in that and fail to cover all cases.

In the core of things, each time a repeating PP is selected (from the queue) and propagated, all its sibling repeaters are propagated as well, to prevent a case that two repeating PPs point to the same document position (which was the bug that originally triggered handling repeats in this code).

But this is wrong, because it propagates all siblings repeaters, and misses some cases.

Also, the code is hard to read, as Mike noted in #3485 (this comment) ).

So this is a chance to also make the code more maintainable.

I have a working version which is not ready to commit yet, and all the tests pass, except for one which I think is a bug in ExactPhraseScorer, but maybe i am missing something.

The case that fails is this:

AssertionError: Missing in super-set: doc 706
q1: field:"(j o s) (i b j) (t d)"
q2: field:"(j o s) (i b j) (t d)"\~1
td1: [doc=706 score=7.7783184 shardIndex=-1, doc=175 score=6.222655 shardIndex=-1]
td2: [doc=523 score=5.5001016 shardIndex=-1, doc=957 score=5.5001016 shardIndex=-1, doc=228 score=4.400081 shardIndex=-1, doc=357 score=4.400081 shardIndex=-1, doc=390 score=4.400081 shardIndex=-1, doc=503 score=4.400081 shardIndex=-1, doc=602 score=4.400081 shardIndex=-1, doc=757 score=4.400081 shardIndex=-1, doc=758 score=4.400081 shardIndex=-1]
doc 706: Document<stored,indexed,tokenized<field:s o b h j t j z o>>

It seems that q1 too should not match this document?

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Doron do you have the seed for that failure? I can dig on the ExactPhraseScorer...

asfimport commented 12 years ago

Doron Cohen (migrated from JIRA)

Patch with fix for this problem. I would expect SloppyPhrase scoring performance to degrade, though I did not measure it.

The single test that still fails (and I think the bug is in ExactPhraseScorer) is testRandomIncreasingSloppiness, and can be recreated like this:

ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtestmethod=testRandomIncreasingSloppiness -Dtests.seed=47267613db69f714:-617bb800c4a3c645:-456a673444fdc184 -Dargs="-Dfile.encoding=UTF-8"
asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Hmm patch has this:

import backport.api.edu.emory.mathcs.backport.java.util.Arrays;
asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks for digging into the problem Doron!

I'm going to be ecstatic if that crazy test finds bugs both in exact and sloppy phrase scorers :)

asfimport commented 12 years ago

Doron Cohen (migrated from JIRA)

Hmm patch has this: ... import backport.api...

Oops, here's a fixed patch, also added some comments, and removed the @Ignore from the test

I'm going to be ecstatic if that crazy test finds bugs both in exact and sloppy phrase scorers :)

It is a great test! Interestingly one thing it exposed is the dependency of the SloppyPhraseScorer in the order of PPs in PhraseScorer when phraseFreq() is invoked. The way things work in the super class, that order depends on the content of previously processed documents. This fix removes that wrong dependency, of course. The point is that deliberately devising a test that exposes such a bug seems almost impossible: first you need to think about such a case, and if you did, writing a test that would create this specific scenario is buggy by itself. Praise to random testing, and this random test in particular.

asfimport commented 12 years ago

Doron Cohen (migrated from JIRA)

The remaining failure still happens with the updated patch (same seed), and still seems to me like an ExactPhraseScorer bug.

However, it is probably not a simple one I think, because when adding to TestMultiPhraseQuery, it passes, that is, no documents are matched, as expected, although this supposedly created the exact scenario that failed above.

Perhaps ExactPhraseScorer behavior too is influenced by other docs processed so far.

  public void test_LUCENE_XYZ() throws Exception {
    Directory indexStore = newDirectory();
    RandomIndexWriter writer = new RandomIndexWriter(random, indexStore);
    add("s o b h j t j z o", "LUCENE-XYZ", writer);

    IndexReader reader = writer.getReader();
    IndexSearcher searcher = newSearcher(reader);

    MultiPhraseQuery q = new MultiPhraseQuery();
    q.add(new Term[] {new Term("body", "j"), new Term("body", "o"), new Term("body", "s")});
    q.add(new Term[] {new Term("body", "i"), new Term("body", "b"), new Term("body", "j")});
    q.add(new Term[] {new Term("body", "t"), new Term("body", "d")});
    assertEquals("Wrong number of hits", 0,
        searcher.search(q, null, 1).totalHits);

    // just make sure no exc:
    searcher.explain(q, 0);

    writer.close();
    searcher.close();
    reader.close();
    indexStore.close();
  }
asfimport commented 12 years ago

Doron Cohen (migrated from JIRA)

Update: apparently MultiPhraseQuery.toString does not print its "holes".

So the query that failed was not:

field:"(j o s) (i b j) (t d)"

But rather:

"(j o s) ? (i b j) ? ? (t d)"

Which is a different story: this query should match the document

s o b h j t j z o

There is a match for ExactPhraseScorer, but not for Sloppy with slope 1. So there is still work to do on SloppyPhraseScorer...

(I'll fix MFQ.toString() as well)

asfimport commented 12 years ago

Doron Cohen (migrated from JIRA)

updated patch with fixed MFQ.toString(), which prints the problematic doc and queries in case of failure.

asfimport commented 12 years ago

Doron Cohen (migrated from JIRA)

I think I understand the cause.

In current implementation there is an assumption that once we landed on the first candidate document, it is possible to check if there are repeating pps, by just comparing the in-doc-positions of the terms.

Tricky as it is, while this is true for plain PhrasePositions, it is not true for MultiPhrasePositions - assume to MPPs: (a m n) and (b x y), and first candidate document that starts with "a b". The in-doc-positions of the two pps would be 0,1 respectively (for 'a' and 'b') and we would not even detect the fact that there are repetitions, not to mention not putting them in the same group.

MPPs conflicts with current patch in an additional manner: It is now assumed that each repetition can be assigned a repetition group.

So assume these PPs (and query positions): 0:a 1:b 3:a 4:b 7:c There are clearly two repetition groups {0:a, 3:a} and {1:b, 4:b}, while 7:c is not a repetition.

But assume these PPs (and query positions): 0:(a b) 1:(b x) 3:a 4:b 7:(c x) We end up with a single large repetition group: {0:(a b) 1:(b x) 3:a 4:b 7:(c x)}

I think if the groups are created correctly at the first candidate document, scorer logic would still work, as a collision is decided only when two pps are in the same in-doc-position. The only impact of MPPs would be performance cost: since repetition groups are larger, it would take longer to check if there are repetitions.

Just need to figure out how to detect repetition groups without relying on in-(first-)doc-positions.

asfimport commented 12 years ago

Doron Cohen (migrated from JIRA)

Attached updated patch.

Repeating PPs with multi-Phrase-query is handled as well.

This called for more cases in the sloppy phrase scorer and more code, and, although I think the code is cleaner now, I don't know to what extent is it easier to maintain.

It definitely fixes wrong behavior that exists in current 3x and trunk (patch is for 3x).

However, although the random test passes for me even with -Dtests.iter=2000, it is possible to "break the scorer" - that is, create a document and a query which should match each other but would not.

The patch adds just such a case as an @Ignored test case: TestMultiPhraseQuery.testMultiSloppyWithRepeats().

I don't see how to solve this specific case in the context of current sloppy phrase scorer.

So there are 3 options:

  1. leave things as they are
  2. commit this patch and for now document the failing scenario (also keep it in the ignored test case).
  3. devise a different algorithm for this.

I would love it to be the 3rd if I just knew how to do it. Otherwise I like the 2nd, just need to keep in mind that the random test might from time to time create this scenario and so there will be noise in the test builds.

Preferences?

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I would love it to be the 3rd if I just knew how to do it. Otherwise I like the 2nd, just need to keep in mind that the random test might from time to time create this scenario and so there will be noise in the test builds.

I think there is no problem in fixing "some of the bugs" to improve the behavior, even if its still not perfect.

we can take our time thinking of how to handle the remaining scenarios... either way I think we should just go with your judgement call on this one, since you obviously understand it better than anyone else.

asfimport commented 12 years ago

Doron Cohen (migrated from JIRA)

Thanks Robert, okay, I'll continue with option 2 then.

In addition, perhaps should try harder for a sloppy version of current ExactPhraseScorer, for both performance and correctness reasons.

In ExactPhraseScorer, the increment of count[posIndex] is by 1, each time the conditions for a match (still) holds.

A sloppy version of this, with N terms and slop=S could increment differently:

1 + N*S        at posIndex
1 + N*S - 1    at posIndex-1 and posIndex+1
1 + N*S - 2 at posIndex-2 and posIndex+3
...
1 + N*S - S at posIndex-S and posIndex+S

For S=0, this falls back to only increment by 1 and only at posIndex, same as the ExactPhraseScorer, which makes sense.

Also, the success criteria in ExactPhraseScorer, when checking term k, is, before adding up 1 for term k:

In the sloppy version the criteria (after adding up term k) would be:

Again, for S=0 this falls to the ExactPhraseScorer logic:

Mike (and all), correctness wise, what do you think?

If you are wondering why the increment at the actual position is (1 + N*S) - it allows to match only posIndexes where all terms contributed something.

I drew an example with 5 terms and slop=2 and so far it seems correct.

Also tried 2 terms and slop=5, this seems correct as well, just that, when there is a match, several posIndexes will contribute to the score of the same match. I think this is not too bad, as for these parameters, same behavior would be in all documents. I would be especially forgiving for this if we this way get some of the performance benefits of the ExactPhraseScorer.

If we agree on correctness, need to understand how to implement it, and consider the performance effect. The tricky part is to increment at posIndex-n. Say there are 3 terms in the query and one of the terms is found at indexes 10, 15, 18. Assume the slope is 2. Since N=3, the max increment is:

So the increments for this term would be (pos, incr):

Pos   Increment
---   ---------
 8  ,  5
 9  ,  6
10  ,  7
11  ,  6
12  ,  5
13  ,  5
14  ,  6
15  ,  7   =  max(7,5)
16  ,  6   =  max(6,5)
17  ,  6   =  max(5,6)
18  ,  7
19  ,  6
20  ,  5

So when we get to posIndex 17, we know that posIndex 15 contributes 5, but we do not know yet about the contribution of posIndex 18, which is 6, and should be used instead of 5. So some look-ahead (or some fix-back) is required, which will complicate the code.

If this seems promising, should probably open a new issue for it, just wanted to get some feedback first.

asfimport commented 12 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Cool!

I haven't fully thought this out (sloppy phrase matching is hard to think about!), but, tentatively, I think this is correct...?

asfimport commented 12 years ago

Doron Cohen (migrated from JIRA)

OK great!

If you did not point a problem with this up front there's a good chance it will work and I'd like to give it a try.

I have a first patch - not working or anything - it opens ExactPhraseScorer a bit for extensions and adds a class (temporary name) - NonExactPhraseScorer.

The idea is to hide in the ChunkState the details of decaying frequencies due to slops. I will try it over the weekend. If we can make it this way, I'd rather do it in this issue rather than committing the other new code for the fix and then replacing it. If that won't go quick, I'll commit the (other) changes to SloppyPhraseScorer and start a new issue.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

sounds interesting: ExactPhraseScorer really has a lot of useful recent heuristics and optimizations, especially about when to next() versus advance() and such?

net/net this idea could possibly improvement the performance overall of SloppyPhraseScorer

asfimport commented 12 years ago

Doron Cohen (migrated from JIRA)

Patch adds NonExactPhraseScorer (temporary name) as discussed above - work in progress, it does not yet do any sloppy matching or scoring.

asfimport commented 12 years ago

Doron Cohen (migrated from JIRA)

sounds interesting: ExactPhraseScorer really has a lot of useful recent heuristics and optimizations, especially about when to next() versus advance() and such?

next()/advance() will remain, but it would still be more costly than exact - score cache won't play, because freqs really are float in this case, and also there would be more computations on the way. But let's see it working first...

asfimport commented 12 years ago

Doron Cohen (migrated from JIRA)

I'm afraid it won't solve the problem.

The complicity of SloppyPhraseScorer stems firstly from the slop. That part is handled in the scorer for long time.

Two additional complications are repeating terms, and multi-term phrases. Each one of these, separately, is handled as well. Their combination however, is the cause for this discussion.

To prevent two repeating terms from landing on the same document position, we propagate the smaller of them (smaller in its phrase-position, which takes into account both the doc-position and the offset of that term in the query).

Without this special treatment, a phrase query "a b a"\~2 might match a document "a b", because both "a"'s (query terms) will land on the same document's "a". This is illegal and is prevented by such propagation.

But when one of the repeating terms is a multi-term, it is not possible to know which of the repeating terms to propagate. This is the unsolved bug.

Now, back to current ExactPhraseScorer. It does not have this problem with repeating terms. But not because of the different algorithm - rather because of the different scenario. It does not have this problem because exact phrase scoring does not have it. In exact phrase scoring, a match is declared only when all PPs are in the same phrase position. Recall that phrase position = doc-position - query-offset, it is visible that when two PPs with different query offset are in the same phrase-position, their doc-position cannot be the same, and therefore no special handling is needed for repeating terms in exact phrase scorers.

However, once we will add that slopy-decaying frequency, we will match in certain posIndex, different phrase-positions. This is because of the slop. So they might land on the same doc-position, and then we start again...

This is really too bad. Sorry for the lengthy post, hopefully this would help when someone wants to get into this.

Back to option 2.

asfimport commented 12 years ago

Naomi Dushay (migrated from JIRA)

I would be glad to try out a nightly build with the patch as is against our tests - I would be glad to get the 80% solution if it fixes my bug. I haven't compiled from source yet, though, so am inclined to wait for the patch getting posted to the nightly.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

To prevent two repeating terms from landing on the same document position, we propagate the smaller of them (smaller in its phrase-position, which takes into account both the doc-position and the offset of that term in the query).

Without this special treatment, a phrase query "a b a"\~2 might match a document "a b", because both "a"'s (query terms) will land on the same document's "a". This is illegal and is prevented by such propagation.

But when one of the repeating terms is a multi-term, it is not possible to know which of the repeating terms to propagate. This is the unsolved bug.

Not understanding really how SloppyPhraseScorer works now, but not trying to add confusion to the issue, I can't help but think this problem is a variant on LevensteinAutomata... in fact that was the motivation for the new test, i just stole the testing methodology from there and applied it to this!

It seems many things are the same but with a few twists:

I'm not suggesting we try to re-use any of that code at all, i don't think it will work. But I wonder if we can re-use even some of the math to redefine the problem more formally to figure out what minimal state/lookahead we need for example...

asfimport commented 12 years ago

Doron Cohen (migrated from JIRA)

Not understanding really how SloppyPhraseScorer works now, but not trying to add confusion to the issue, I can't help but think this problem is a variant on LevensteinAutomata... in fact that was the motivation for the new test, i just stole the testing methodology from there and applied it to this!

Interesting! I was not aware of this. I went and read some about this automaton, It is relevant.

It seems many things are the same but with a few twists:

  • fundamentally we are interleaving the streams from the subscorers into the 'index automaton' 'query automaton' is produced from the user-supplied terms

True. In fact, the current code works hard to decide on the "correct interleaving order" - while if we had a "Perfect Levenstein Automaton" that took care of the computation we could just interleave, in the term position order (forget about phrase position and all that) and let the automaton compute the distance.

This might capture the difficulty in making the sloppy phrase scorer correct: it started with the algorithm that was optimized for exact matching, and adopted (hacked?) it for approximate matching.

Instead, starting with a model that fits approximate matching, might be easier and cleaner. I like that.

  • our 'alphabet' is the terms, and holes from position increment are just an additional symbol.
  • just like the LevensteinAutomata case, repeats are problematic because they are different characteristic vectors
  • stacked terms at the same position (index or query) just make the automata more complex (so they arent just strings)

I'm not suggesting we try to re-use any of that code at all, i don't think it will work. But I wonder if we can re-use even some of the math to redefine the problem more formally to figure out what minimal state/lookahead we need for example...

I agree. I'll think of this.

In the meantime I'll commit this partial fix.

asfimport commented 12 years ago

Doron Cohen (migrated from JIRA)

Committed:

I would be glad to try out a nightly build with the patch as is against our tests - I would be glad to get the 80% solution if it fixes my bug.

It's in now...

But I wonder if we can re-use even some of the math to redefine the problem more formally to figure out what minimal state/lookahead we need for example...

Robert, this gave me an idea... currently, in case of "collision" between repeaters, we compare them and advance the "lesser" of them (SloppyPhraseScorer.lesser(PhrasePositions, PhrasePositions)) - it should be fairly easy to add lookahead to this logic: if one of the two is multi-term, lesser can also do a lookahead. The amount of lookahead can depend on the slop. I'll give it a try before closing this issue.

asfimport commented 12 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Robert, this gave me an idea... currently, in case of "collision" between repeaters, we compare them and advance the "lesser" of them (SloppyPhraseScorer.lesser(PhrasePositions, PhrasePositions)) - it should be fairly easy to add lookahead to this logic: if one of the two is multi-term, lesser can also do a lookahead. The amount of lookahead can depend on the slop. I'll give it a try before closing this issue.

Interesting... its hard to think about for me since the edit distance is a little different, but at least in the levAutomata case the maximum 'context' the thing ever needs is 2n+1, where n is the distance/slop. I don't know if it applies here... but seems like it should be at least an upperbound.

Speaking of which on a related note, I think its possible we can implement a more... exhaustive test for SloppyPhraseScorer (rather than relying so much on a random one). The idea would be a twist on TestLevenshteinAutomata.assertCharVectors: using an alphabet of terms={0,1} the idea is to test all possible 'automaton structures', for sloppyphrasescorer, the idea would be we have the minimal test method that tests all the cases...

I'll think on this one...

asfimport commented 12 years ago

Naomi Dushay (migrated from JIRA)

The commits from March 10 fix my two failing tests - huzzah! Thank you so much! - Naomi