apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.64k stars 1.02k forks source link

improve lucene's similarity algorithm defaults [LUCENE-2187] #3263

Open asfimport opened 14 years ago

asfimport commented 14 years ago

First things first: I am not an IR guy. The goal of this issue is to make 'surgical' tweaks to lucene's formula to bring its performance up to that of more modern algorithms such as BM25.

In my opinion, the concept of having some 'flexible' scoring with good speed across the board is an interesting goal, but not practical in the short term.

Instead here I propose incorporating some work similar to lnu.ltc and friends, but slightly different. I noticed this seems to be in line with that paper published before about the trec million queries track...

Here is what I propose in pseudocode (overriding DefaultSimilarity):

  `@Override`
  public float tf(float freq) {
    return 1 + (float) Math.log(freq);
  }

  `@Override`
  public float lengthNorm(String fieldName, int numTerms) {
    return (float) (1 / ((1 - slope) * pivot + slope * numTerms));
  }

Where slope is a constant (I used 0.25 for all relevance evaluations: the goal is to have a better default), and pivot is the average field length. Obviously we shouldnt make the user provide this but instead have the system provide it.

These two pieces do not improve lucene much independently, but together they are competitive with BM25 scoring with the test collections I have run so far.

The idea here is that this logarithmic tf normalization is independent of the tf / mean TF that you see in some of these algorithms, in fact I implemented lnu.ltc with cosine pivoted length normalization and log(tf)/log(mean TF) stuff and it did not fare as well as this method, and this is simpler, we do not need to calculate this mean TF at all.

The BM25-like "binary" pivot here works better on the test collections I have run, but of course only with the tf modification.

I am uploading a document with results from 3 test collections (Persian, Hindi, and Indonesian). I will test at least 3 more languages... yes including English... across more collections and upload those results also, but i need to process these corpora to run the tests with the benchmark package, so this will take some time (maybe weeks)

so, please rip it apart with scoring theory etc, but keep in mind 2 of these 3 test collections are in the openrelevance svn, so if you think you have a great idea, don't hesitate to test it and upload results, this is what it is for.

also keep in mind again I am not a scoring or IR guy, the only thing i can really bring to the table here is the willingness to do a lot of relevance testing!


Migrated from LUCENE-2187 by Robert Muir (@rmuir), 2 votes, updated May 09 2016 Attachments: LUCENE-2187.patch, scoring.pdf (versions: 3) Linked issues:

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

document with some simple results from the 3 collections i tested thus far.

i chose to display simple graphs with descriptions of the collections and some of their peculiarities.

if you want submission.txt dumps or verbose output from trec_eval, I can do that too, but I think its less useful to start with.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

attaching updated document with results for a 4th test collection, on english. for this one BM25 did not fare so well.

For the lazy, here are the MAP values:

StandardAnalyzer Default Scoring: 0.3837 BM25 Scoring: 0.3580 Improved Scoring: 0.3994

StandardAnalyzer + Porter Default Scoring: 0.4333 BM25 Scoring: 0.4131 Improved Scoring: 0.4515

StandardAnalyzer + Porter + MoreLikeThis (top 5 docs) Default Scoring: 0.5234 BM25 Scoring: 0.5087 Improved Scoring: 0.5474

Note that 0.5572 was the highest performing MAP on this corpus (Microsoft Research) in FIRE 2008: http://www.isical.ac.in/\~fire/paper/Udupa-mls-fire2008.pdf

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

sorry, correct some transposition of axes labels and some grammatical mistakes :)

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

attached is a patch with the Similarity impl. of course you have to manually supply this pivot value (avg doc. length), for now.

asfimport commented 11 years ago

Tom Burton-West (migrated from JIRA)

Hi Robert,

Is this implementation made moot by the new GSOC work, or would it still be worth testing this as well as BM25, DFR and INF?

I can't seem to find a link to the ORP collections. Can you point me to it? (I plan to test with our long docs, but thought I would try out some of the ORP collections as well)

Tom

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Bulk move 4.4 issues to 4.5 and 5.0

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Move issue to Lucene 4.9.