TestBasicModelIne.testRandomScoring failure [LUCENE-8015]

asfimport commented 7 years ago

reproduce with: ant test -Dtestcase=TestBasicModelIne -Dtests.method=testRandomScoring -Dtests.seed=86E85958B1183E93 -Dtests.slow=true -Dtests.locale=vi-VN -Dtests.timezone=Pacific/Tongatapu -Dtests.asserts=true -Dtests.file.encoding=UTF8

Migrated from LUCENE-8015 by Adrien Grand (@jpountz), resolved Dec 06 2017 Attachments: LUCENE-8015_test_fangs.patch, LUCENE-8015.patch, LUCENE-8015-test.patch

asfimport commented 7 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I looked into it, this similarity ends up doing something like that:

double tfn = // non-decreasing function of tf
return (tfn * C1) * (C2 / (tfn + 1)); // C1 and C2 are some constants

The issue is that even if tfn increases, the result might decrease if tfn * C1 is rounded down and/or C2/(tfn + 1) is rounded up. One way to fix it that I can think of is to make the value of tfn more discrete by doing eg.

diff --git a/lucene/core/src/java/org/apache/lucene/search/similarities/DFRSimilarity.java b/lucene/core/src/java/org/apache/lucene/search/similarities/DFRSimilarity.java
index aacd246..554d12f 100644
--- a/lucene/core/src/java/org/apache/lucene/search/similarities/DFRSimilarity.java
+++ b/lucene/core/src/java/org/apache/lucene/search/similarities/DFRSimilarity.java
`@@` -108,7 +108,7 `@@` public class DFRSimilarity extends SimilarityBase {

   `@Override`
   protected double score(BasicStats stats, double freq, double docLen) {
-    double tfn = normalization.tfn(stats, freq, docLen);
+    double tfn = (float) normalization.tfn(stats, freq, docLen); // cast to float on purpose to introduce gaps between consecutive values and prevent double rounding errors to make the score decrease when tfn increases
     return stats.getBoost() *
         basicModel.score(stats, tfn) * afterEffect.score(stats, tfn);
   }

Opinions?

asfimport commented 7 years ago

Robert Muir (@rmuir) (migrated from JIRA)

lets take a step back first. which 3 DFR components are involved? Can you include the test output?

asfimport commented 7 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

   [junit4] <JUnit4> says 你好! Master seed: 86E85958B1183E93
   [junit4] Executing 1 suite with 1 JVM.
   [junit4] 
   [junit4] Started J0 PID(22203@localhost).
   [junit4] Suite: org.apache.lucene.search.similarities.TestBasicModelIne
   [junit4]   1> 7.0E-45 = score(DFRSimilarity, doc=0, freq=0.99999994), computed from:
   [junit4]   1>   1.4E-45 = boost
   [junit4]   1>   3.46341352E16 = NormalizationH1, computed from: 
   [junit4]   1>     0.99999994 = tf
   [junit4]   1>     2.09433728E9 = avgFieldLength
   [junit4]   1>     26.0 = len
   [junit4]   1>   1.03902406E17 = BasicModelIne, computed from: 
   [junit4]   1>     11.0 = numberOfDocuments
   [junit4]   1>     1.0 = totalTermFreq
   [junit4]   1>   4.3309873E-17 = AfterEffectB, computed from: 
   [junit4]   1>     3.46341352E16 = tfn
   [junit4]   1>     1.0 = totalTermFreq
   [junit4]   1>     1.0 = docFreq
   [junit4]   1> 
   [junit4]   1> 5.6E-45 = score(DFRSimilarity, doc=0, freq=1.0), computed from:
   [junit4]   1>   1.4E-45 = boost
   [junit4]   1>   3.46341374E16 = NormalizationH1, computed from: 
   [junit4]   1>     1.0 = tf
   [junit4]   1>     2.09433728E9 = avgFieldLength
   [junit4]   1>     26.0 = len
   [junit4]   1>   1.03902414E17 = BasicModelIne, computed from: 
   [junit4]   1>     11.0 = numberOfDocuments
   [junit4]   1>     1.0 = totalTermFreq
   [junit4]   1>   4.330987E-17 = AfterEffectB, computed from: 
   [junit4]   1>     3.46341374E16 = tfn
   [junit4]   1>     1.0 = totalTermFreq
   [junit4]   1>     1.0 = docFreq
   [junit4]   1> 
   [junit4]   1> DFR I(ne)B1
   [junit4]   1> field="field",maxDoc=16,docCount=11,sumTotalTermFreq=23037710092,sumDocFreq=1421016222
   [junit4]   1> term="term",docFreq=1,totalTermFreq=1
   [junit4]   1> norm=26 (doc length \~ 26)
   [junit4]   1> freq=1.0
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestBasicModelIne -Dtests.method=testRandomScoring -Dtests.seed=86E85958B1183E93 -Dtests.slow=true -Dtests.locale=vi-VN -Dtests.timezone=Pacific/Tongatapu -Dtests.asserts=true -Dtests.file.encoding=UTF8
   [junit4] FAILURE 3.13s | TestBasicModelIne.testRandomScoring <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: score(0.99999994)=7.0E-45 > score(1.0)=5.6E-45
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([86E85958B1183E93:D7700EAAB6FD899]:0)
   [junit4]    >    at org.apache.lucene.search.similarities.BaseSimilarityTestCase.doTestScoring(BaseSimilarityTestCase.java:405)
   [junit4]    >    at org.apache.lucene.search.similarities.BaseSimilarityTestCase.testRandomScoring(BaseSimilarityTestCase.java:357)
   [junit4]    >    at java.lang.Thread.run(Thread.java:745)
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): {field=PostingsFormat(name=Memory)}, docValues:{}, maxPointsInLeafNode=285, maxMBSortInHeap=6.307483399953041, sim=RandomSimilarity(queryNorm=false): {field=IB LL-DZ(0.3)}, locale=vi-VN, timezone=Pacific/Tongatapu
   [junit4]   2> NOTE: Linux 4.4.0-97-generic amd64/Oracle Corporation 1.8.0_102 (64-bit)/cpus=8,threads=1,free=241459528,total=344457216
   [junit4]   2> NOTE: All tests run in this JVM: [TestBasicModelIne]
   [junit4] Completed [1/1 (1!)] in 3.79s, 1 test, 1 failure <<< FAILURES!
   [junit4] 
   [junit4] 
   [junit4] Tests with failures [seed: 86E85958B1183E93]:
   [junit4]   - org.apache.lucene.search.similarities.TestBasicModelIne.testRandomScoring
   [junit4] 
   [junit4] 
   [junit4] JVM J0:     0.40 ..     4.74 =     4.34s
   [junit4] Execution time total: 4.75 sec.
   [junit4] Tests summary: 1 suite, 1 test, 1 failure

asfimport commented 7 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks I will look in. Its hard to debug it specifically without fixing explains first (we really need to do that, then you can "see" what goes wrong from test fails like this). Separately the test is inefficient in that this only comes out with beasting many iterations. We should improve the test to more often enumerate edges (e.g. min/max values) that look like this so that its more efficient.

at a glance it looks like small collection with mostly super-huge docs but then one tiny doc. So it may stress some extremes in computations like dl/avgdl type stuff, and expose a hazard in one of the components here. I have to look more...

asfimport commented 7 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Maybe the issue is better fixed in after-effect B? Instead of (F+1)/(n * (tf + 1)) we can do (F+1)/n * 1/(tf+1). Keep in mind F and n are presumably large, as they are the term's totalTermFreq and docFreq although not in this particular failure.

asfimport commented 7 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Two reproducing Jenkins failures, of a different test suite (*In vs. *Ine): first, from https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/20789/:

Checking out Revision 39376cd8b5ef03b3338c2e8fa31dce732749bcd7 (refs/remotes/origin/master)
[...]
   [junit4] Suite: org.apache.lucene.search.similarities.TestBasicModelIn
   [junit4]   1> 1.27634828E18 = score(DFRSimilarity, doc=0, freq=1.18869171E9), computed from:
   [junit4]   1>   2.14748365E9 = boost
   [junit4]   1>   1.18869171E9 = NormalizationZ, computed from: 
   [junit4]   1>     1.18869171E9 = tf
   [junit4]   1>     6.0234362E8 = avgFieldLength
   [junit4]   1>     76.0 = len
   [junit4]   1>   1.18869171E9 = BasicModelIn, computed from: 
   [junit4]   1>     2.0 = numberOfDocuments
   [junit4]   1>     1.0 = docFreq
   [junit4]   1>   0.50000006 = AfterEffectB, computed from: 
   [junit4]   1>     1.18869171E9 = tfn
   [junit4]   1>     1.18869184E9 = totalTermFreq
   [junit4]   1>     1.0 = docFreq
   [junit4]   1> 
   [junit4]   1> 1.27634814E18 = score(DFRSimilarity, doc=0, freq=1.18869184E9), computed from:
   [junit4]   1>   2.14748365E9 = boost
   [junit4]   1>   1.18869184E9 = NormalizationZ, computed from: 
   [junit4]   1>     1.18869184E9 = tf
   [junit4]   1>     6.0234362E8 = avgFieldLength
   [junit4]   1>     76.0 = len
   [junit4]   1>   1.18869184E9 = BasicModelIn, computed from: 
   [junit4]   1>     2.0 = numberOfDocuments
   [junit4]   1>     1.0 = docFreq
   [junit4]   1>   0.5 = AfterEffectB, computed from: 
   [junit4]   1>     1.18869184E9 = tfn
   [junit4]   1>     1.18869184E9 = totalTermFreq
   [junit4]   1>     1.0 = docFreq
   [junit4]   1> 
   [junit4]   1> DFR I(n)BZ(1.4E-45)
   [junit4]   1> field="field",maxDoc=2,docCount=2,sumTotalTermFreq=1204687257,sumDocFreq=2
   [junit4]   1> term="term",docFreq=1,totalTermFreq=1188691903
   [junit4]   1> norm=53 (doc length \~ 76)
   [junit4]   1> freq=1.18869184E9
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestBasicModelIn -Dtests.method=testRandomScoring -Dtests.seed=4210BC5FDD9E3841 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=pt -Dtests.timezone=AET -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
   [junit4] FAILURE 6.16s J1 | TestBasicModelIn.testRandomScoring <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: score(1.18869171E9)=1.27634828E18 > score(1.18869184E9)=1.27634814E18
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([4210BC5FDD9E3841:C98FE5EDC7E9DE4B]:0)
   [junit4]    >    at org.apache.lucene.search.similarities.BaseSimilarityTestCase.doTestScoring(BaseSimilarityTestCase.java:405)
   [junit4]    >    at org.apache.lucene.search.similarities.BaseSimilarityTestCase.testRandomScoring(BaseSimilarityTestCase.java:357)
   [junit4]    >    at java.lang.Thread.run(Thread.java:748)
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): {field=PostingsFormat(name=LuceneVarGapDocFreqInterval)}, docValues:{}, maxPointsInLeafNode=839, maxMBSortInHeap=6.659456353481144, sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@19821e9), locale=pt, timezone=AET
   [junit4]   2> NOTE: Linux 4.10.0-37-generic i386/Oracle Corporation 1.8.0_144 (32-bit)/cpus=8,threads=1,free=227959720,total=316669952

And from https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/20744/ (output below is from my reproduction, since the job output is no longer accessible - git sha is 95d287e):

   [junit4] Suite: org.apache.lucene.search.similarities.TestBasicModelIn
   [junit4]   1> 8.0517238E9 = score(DFRSimilarity, doc=0, freq=1.86950656E9), computed from:
   [junit4]   1>   1.6103447E9 = boost
   [junit4]   1>   2.6727952E22 = NormalizationH1, computed from: 
   [junit4]   1>     1.86950656E9 = tf
   [junit4]   1>     1.4181463E9 = avgFieldLength
   [junit4]   1>     213016.0 = len
   [junit4]   1>   1.3363976E23 = BasicModelIn, computed from: 
   [junit4]   1>     79.0 = numberOfDocuments
   [junit4]   1>     2.0 = docFreq
   [junit4]   1>   3.7414016E-23 = AfterEffectL, computed from: 
   [junit4]   1>     2.6727952E22 = tfn
   [junit4]   1> 
   [junit4]   1> 8.0517233E9 = score(DFRSimilarity, doc=0, freq=1.86950669E9), computed from:
   [junit4]   1>   1.6103447E9 = boost
   [junit4]   1>   2.6727954E22 = NormalizationH1, computed from: 
   [junit4]   1>     1.86950669E9 = tf
   [junit4]   1>     1.4181463E9 = avgFieldLength
   [junit4]   1>     213016.0 = len
   [junit4]   1>   1.3363977E23 = BasicModelIn, computed from: 
   [junit4]   1>     79.0 = numberOfDocuments
   [junit4]   1>     2.0 = docFreq
   [junit4]   1>   3.7414013E-23 = AfterEffectL, computed from: 
   [junit4]   1>     2.6727954E22 = tfn
   [junit4]   1> 
   [junit4]   1> DFR I(n)L1
   [junit4]   1> field="field",maxDoc=852,docCount=79,sumTotalTermFreq=112033561766,sumDocFreq=79
   [junit4]   1> term="term",docFreq=2,totalTermFreq=2487761701
   [junit4]   1> norm=149 (doc length \~ 213016)
   [junit4]   1> freq=1.86950669E9
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestBasicModelIn -Dtests.method=testRandomScoring -Dtests.seed=56DDF2F2BA9390E3 -Dtests.slow=true -Dtests.locale=cs -Dtests.timezone=Mexico/BajaNorte -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] FAILURE 3.91s | TestBasicModelIn.testRandomScoring <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: score(1.86950656E9)=8.0517238E9 > score(1.86950669E9)=8.0517233E9
   [junit4]    >        at __randomizedtesting.SeedInfo.seed([56DDF2F2BA9390E3:DD42AB40A0E476E9]:0)
   [junit4]    >        at org.apache.lucene.search.similarities.BaseSimilarityTestCase.doTestScoring(BaseSimilarityTestCase.java:405)
   [junit4]    >        at org.apache.lucene.search.similarities.BaseSimilarityTestCase.testRandomScoring(BaseSimilarityTestCase.java:357)
   [junit4]    >        at java.lang.Thread.run(Thread.java:748)
   [junit4]   2> NOTE: test params are: codec=DummyCompressingStoredFields(storedFieldsFormat=CompressingStoredFieldsFormat(compressionMode=DUMMY, chunkSize=4442, maxDocsPerChunk=146, blockSize=5), termVectorsFormat=CompressingTermVectorsFormat(compressionMode=DUMMY, chunkSize=4442, blockSize=5)), sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@146b4f40), locale=cs, timezone=Mexico/BajaNorte

asfimport commented 7 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Another reproducing failure, from my Jenkins; it's a different test suite, but looks similar enough to me to comment on this issue:

Checking out Revision b44497fdb721fb67c3c8f20dd0a781f6beaaa8a6 (refs/remotes/origin/master)
[...]
   [junit4] Suite: org.apache.lucene.search.similarities.TestBasicModelG
   [junit4]   1> 5.9448525E9 = score(DFRSimilarity, doc=0, freq=0.99999994), computed from:
   [junit4]   1>   1.98161741E9 = boost
   [junit4]   1>   1.49336593E16 = NormalizationH1, computed from: 
   [junit4]   1>     0.99999994 = tf
   [junit4]   1>     1.05701216E9 = avgFieldLength
   [junit4]   1>     152.0 = len
   [junit4]   1>   4.4800976E16 = BasicModelG, computed from: 
   [junit4]   1>     12.0 = numberOfDocuments
   [junit4]   1>     1.0 = totalTermFreq
   [junit4]   1>   6.6962825E-17 = AfterEffectL, computed from: 
   [junit4]   1>     1.49336593E16 = tfn
   [junit4]   1> 
   [junit4]   1> 5.944852E9 = score(DFRSimilarity, doc=0, freq=1.0), computed from:
   [junit4]   1>   1.98161741E9 = boost
   [junit4]   1>   1.49336603E16 = NormalizationH1, computed from: 
   [junit4]   1>     1.0 = tf
   [junit4]   1>     1.05701216E9 = avgFieldLength
   [junit4]   1>     152.0 = len
   [junit4]   1>   4.480098E16 = BasicModelG, computed from: 
   [junit4]   1>     12.0 = numberOfDocuments
   [junit4]   1>     1.0 = totalTermFreq
   [junit4]   1>   6.696282E-17 = AfterEffectL, computed from: 
   [junit4]   1>     1.49336603E16 = tfn
   [junit4]   1> 
   [junit4]   1> DFR GL1
   [junit4]   1> field="field",maxDoc=50,docCount=12,sumTotalTermFreq=12684145308,sumDocFreq=12
   [junit4]   1> term="term",docFreq=1,totalTermFreq=1
   [junit4]   1> norm=64 (doc length \~ 152)
   [junit4]   1> freq=1.0
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestBasicModelG -Dtests.method=testRandomScoring -Dtests.seed=4B5C3926B202A201 -Dtests.slow=true -Dtests.locale=en-IE -Dtests.timezone=Pacific/Bougainville -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] FAILURE 7.31s J0 | TestBasicModelG.testRandomScoring <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: score(0.99999994)=5.9448525E9 > score(1.0)=5.944852E9
   [junit4]    >    at __randomizedtesting.SeedInfo.seed([4B5C3926B202A201:C0C36094A875440B]:0)
   [junit4]    >    at org.apache.lucene.search.similarities.BaseSimilarityTestCase.doTestScoring(BaseSimilarityTestCase.java:405)
   [junit4]    >    at org.apache.lucene.search.similarities.BaseSimilarityTestCase.testRandomScoring(BaseSimilarityTestCase.java:357)
   [junit4]    >    at java.lang.Thread.run(Thread.java:745)
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): {field=PostingsFormat(name=LuceneFixedGap)}, docValues:{}, maxPointsInLeafNode=68, maxMBSortInHeap=6.052983739984725, sim=RandomSimilarity(queryNorm=false): {field=DFR I(ne)B3(800.0)}, locale=en-IE, timezone=Pacific/Bougainville
   [junit4]   2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 1.8.0_77 (64-bit)/cpus=16,threads=1,free=293394976,total=351797248

asfimport commented 7 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks Steve: I am not sure if the 3 failures represent just one bug, but its very relevant.

Adrien's suggested fix alone will fix #1 and #3 but not #2. #2 is very clearly the hazard in AfterEffectB that I described (you can see it from the explain). If you combine both of our suggested fixes, all 3 cases will pass.

We should first maybe make the test more efficient.

asfimport commented 7 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I tested your last failure of GL2 (#4) and its also covered by adrien's fix.

asfimport commented 7 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Here is a patch improving the test. It just tries to hit the edge cases with more probability.

It now seems to generally fail with only 1 or 2 rounds of

ant beast -Dbeast.iters=10 -Dtests.class="org.apache.lucene.search.similarities.Test*"

I'd rather it fail every time of course, but I think its still an improvement.

asfimport commented 7 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I dug into the I(n) and I(ne) failures here via the new test, their biggest problem is in the BasicModel itself.

These idf-like functions have the "log1p" trap due to the formulas in use. Note their formula is log2((maxDoc + 1) / (x + 0.5)) where x is docFreq for I(n), expected docFreq for I(ne), and totalTermFreq for I(F). So the worst case (e.g. term in every doc) gets even worse as collection size increases, because we take log of values increasingly closer to 1.

BasicModel I(F) never fails because we added a floor in its log: we had to, since it would otherwise go negative when totalTermFreq exceeds maxDoc, which can easily happen. But we should fix the other two in the same way, I think. It does not change retrieval quality in my tests.

If I floor them to avoid this issue like this, it fixes all their fails here and they survive hundred rounds of beasting by my new test:

--- a/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java
+++ b/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java
`@@` -33,7 +33,7 `@@` public class BasicModelIn extends BasicModel {
   public final double score(BasicStats stats, double tfn) {
     long N = stats.getNumberOfDocuments();
     long n = stats.getDocFreq();
-    return tfn * log2((N + 1) / (n + 0.5));
+    return tfn * log2(1 + (N + 1) / (n + 0.5));
   }

   `@Override`
--- a/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIne.java
+++ b/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIne.java
`@@` -34,7 +34,7 `@@` public class BasicModelIne extends BasicModel {
     long N = stats.getNumberOfDocuments();
     long F = stats.getTotalTermFreq();
     double ne = N * (1 - Math.pow((N - 1) / (double)N, F));
-    return tfn * log2((N + 1) / (ne + 0.5));
+    return tfn * log2(1 + (N + 1) / (ne + 0.5));
   }

Model G failures are separate, I have not looked into it yet.

asfimport commented 7 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

+1 to giving it a try

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I have been looking into the following model G failure.

7.0E-45 = score(DFRSimilarity, doc=0, freq=0.99999994), computed from:
  1.4E-45 = boost
  3.09640771E16 = NormalizationH1, computed from: 
    0.99999994 = tf
    1.61490253E9 = avgFieldLength
    112.0 = len
  9.2892231E16 = BasicModelG, computed from: 
    12.0 = numberOfDocuments
    1.0 = totalTermFreq
  4.8443234E-17 = AfterEffectB, computed from: 
    3.09640771E16 = tfn
    1.0 = totalTermFreq
    1.0 = docFreq

5.6E-45 = score(DFRSimilarity, doc=0, freq=1.0), computed from:
  1.4E-45 = boost
  3.09640792E16 = NormalizationH1, computed from: 
    1.0 = tf
    1.61490253E9 = avgFieldLength
    112.0 = len
  9.289224E16 = BasicModelG, computed from: 
    12.0 = numberOfDocuments
    1.0 = totalTermFreq
  4.844323E-17 = AfterEffectB, computed from: 
    3.09640792E16 = tfn
    1.0 = totalTermFreq
    1.0 = docFreq

DFR GB1
field="field",maxDoc=46519,docCount=12,sumTotalTermFreq=19378830951,sumDocFreq=19378830951
term="term",docFreq=1,totalTermFreq=1
norm=59 (doc length \~ 112)
freq=1.0
NOTE: reproduce with: ant test  -Dtestcase=TestBasicModelG -Dtests.method=testRandomScoring -Dtests.seed=3C22B051C61EEC84 -Dtests.locale=cs-CZ -Dtests.timezone=Atlantic/Madeira -Dtests.asserts=true -Dtests.file.encoding=UTF-8

In short, the scoring formula here looks like (A + B * tfn) * (C / (tfn + 1)) where A, B and C are constants. This function increases when tfn increases when B > A, which is always the case. The problem is that tfn is so large (ulp(tfn) = 4) , that tfn+1 always returns tfn and A + B * tfn always returns the same as B * tfn. So when tfn gets high, the formula is effectively (B * tfn) * (C / tfn). This is a constant, but since we compute the left and right parts independently, this might decrease when tfn increases about half the time.

Even though I triggered it with BasicModelG, I suspect it affects almost all DFRSimilarity impls.

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

thanks for the analysis! I still don't even really want to commit the "floor" modifications for In and Ine because i dont like it: really a scoring formula should be able to return a tiny tiny value for a stopword, that should be ok. It shouldnt have to be a number between 1 and 43 or whatever to work with lucene.

For model IF its justifiable, just like its justifiable in the BM25 case, because the formula is fundamentally broken you know, i mean we dont want negative scores for stopwords.

But your analysis suggests maybe we can look at a more surgical fix, like a nextUp/nextDown somewhere?

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I don't think we can fix this with a nextUp/nextDown? One way we could fix it for sure would be by implementing the basic model and the after effect in a single method. For instance (A + B * tfn) * (C / (tfn + 1)) could be rewritten as (A - B + B * (1 + tfn))) * C / (tfn + 1) = (A - B) * C / (tfn + 1) + B * C. Since there is only one occurrence of tfn in the latter, it would be guaranteed to be non-decreasing when tfn increases. Fixing it in the general case looks challenging however?

Maybe one reasonable way to avoid this issue would be to bound the values that tfn may take? This isn't nice but it wouldn't affect the general case, only when freq, avgdl, or some other stats have extreme values?

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Given that this bug is not easy to reproduce due to how we finally cast double scores to floats, which often returns the same value for consecutive values of freq, I tried to hack the test framework to compare the produced doubles (see attached patch - note this is for testing purpose only, I don't plan/want to merge it). My assumption is that if we can reproduce the issue with doubles, it means it can happen after a float cast as well, the scorer just needs to produce a value that is close enough from the boundary between two floats so that both values would round to different floats. And indeed tests fail systematically with this patch. The bad news is that I can't think of a way to fix the formula. Even if I put quite severe restrictions on the values that tfn may take, there are still some special freq values that manage to prove the score is not monotonic. Good news is that it doesn't make some other SimilarityBase impls fail like the axiomatic ones.

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

I think our best option is to specialize some combinations. We should be able to specialize basic models G, IF, I(n) and I(ne) with after effects B, L and NoAfterEffect and make them pass tests. For instance, I tested out this specialization of model G and after effect L to make sure it actually passes the tests:

/** BasicModel G + AfterEffect L */
public class DFRSimilarityGL extends SimilarityBase {

  private final Normalization normalization;

  public DFRSimilarityGL(Normalization normalization) {
    this.normalization = Objects.requireNonNull(normalization);
  }

  `@Override`
  protected double score(BasicStats stats, double freq, double docLen) {
    double tfn = normalization.tfn(stats, freq, docLen);

    // approximation only holds true when F << N, so we use lambda = F / (N + F)
    double F = stats.getTotalTermFreq() + 1;
    double N = stats.getNumberOfDocuments();
    double lambda = F / (N + F);

    // -log(1 / (lambda + 1)) -> log(lambda + 1)
    double A = log2(lambda + 1);
    double B = log2((1 + lambda) / lambda);

    // basic model G uses (A + B * tfn)
    // after effect L takes the result and divides it by (1 + tfn)
    // so in the end we have (A + B * tfn) / (1 + tfn)
    // which we change to B - (B - A) / (1 + tfn) to reduce floating-point accuracy issues
    // (since tfn appears only once it is guaranteed to be non decreasing with tfn
    return B - (B - A) / (1 + tfn);
  }

  `@Override`
  public String toString() {
    return "DFR GL" + normalization.toString();
  }

}

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Adrien it should reproduce every time with the test changes i made on this issue? Its just a bug in the test that it doesn't explicitly test the extremes but instead relies on randomness.

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think I like the proposed solution. Lets drop NoAfterEffect though, i'm not sure its even theoretical: I don't see it in the DFR paper (http://theses.gla.ac.uk/1570/1/2003amatiphd.pdf). That would yield 8 solid combinations which seems manageable. There are also some "+1"'s that maybe are no longer necessary (I don't know if it makes this task easier, just mentioning it: LUCENE-8023)

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Adrien it should reproduce every time with the test changes i made on this issue?

It doesn't because the fact we always compute scores as doubles then cast to a float hides the issue: even if score the score of Math.nextDown(freq) is more than the score of freq, the float cast rounds both values to the same float almost all the time.

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

If we are fine with removing support for NoAfterEffect, maybe we could change the contract of AfterEffect to return the product of the after effect with (1+tfn), which has the nice property of not depending on tfn for both B and L. It makes the internal API less nice but has the benefit of keeping the basic model and after effect plugged in separately, similarly to the Terrier API.

The attached patch implements this idea and removes after effects BE, D and P, which I couldn't fix to produce scores that do not decrease when tfn increases. Tests pass for all combinations of the DFRSimilarity with this patch.

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Took a glance, I am good with this approach, thank you! I would like to combine your patch with my test patch (attached to this issue) though, because it makes the test much better for all sims not just this particular case by exercising the extremes explicitly.

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit 63b63c573487fe6b054afb6073c057a88a15288f in lucene-solr's branch refs/heads/master from @jpountz https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=63b63c5

LUCENE-8015: Fixed DFR similarities' scores to not decrease when tfn increases.

asfimport commented 6 years ago

Adrien Grand (@jpountz) (migrated from JIRA)

Done, I combined both patches and beasting didn't find any failures so I merged. Thank you!

apache / lucene

TestBasicModelIne.testRandomScoring failure [LUCENE-8015] #9063