Implement various ranking models as Similarities [LUCENE-3220]

asfimport commented 13 years ago

With #4247 done, we can finally work on implementing the standard ranking models. Currently DFR, BM25 and LM are on the menu.

Done:

EasyStats: contains all statistics that might be relevant for a ranking algorithm
EasySimilarity: the ancestor of all the other similarities. Hides the DocScorers and as much implementation detail as possible
BM25: the current "mock" implementation might be OK
LM
DFR
The so-called Information-Based Models

Migrated from LUCENE-3220 by David Mark Nemeskey, resolved Aug 12 2011 Parent: #4033 Attachments: LUCENE-3220.patch (versions: 24) Linked issues:

4430

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

EasyStats object added.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

a few comments (it generally looks close to me):

maybe we should use 'numberOfDocuments' instead of 'docNo' and same with 'numberOfFieldTokens'? this might make the naming more clear
i'm worried about 'uniqueTermCount', do you know of which implementations require this? this number is not accurate if the index has more than one segment.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

* I was wondering about that too – actually docNo is a mistake, it should have been noDocs or noOfDocs anyway, but I guess I'll just go with numberOfDocuments.

I'll put a nocommit there for the time being, and if no sims use it, I'll just remove it from the Stats. Terrier has it, though, so I guess there should be at least one method that depends on it.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I'll put a nocommit there for the time being, and if no sims use it, I'll just remove it from the Stats. Terrier has it, though, so I guess there should be at least one method that depends on it.

I've never seen one that did... I don't imagine us ever implementing this efficiently given that we support incremental indexing.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

oh two more nitpicky comments:

can you update the patch to use two-spaces instead of tabs? if you use eclipse, you can download this and configure this as your default codestyle: http://people.apache.org/\~rmuir/Eclipse-Lucene-Codestyle.xml
can you also remove the @author? For legal reasons (i think actually for your protection!) we omit these from new files.
it might be a good idea to use the tag @lucene.experimental also for new classes: this is a template that 'ant-javadocs' replaces with "WARNING: This API is experimental and might change in incompatible ways in the next release." to tell users that its very new and not to expect precise backwards compatibility.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Oh, sorry, how lame of me :( Actually I am working now on a different machine than the one I usually do, so that's why I made those mistakes. Anyhow, I have fixed them.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

one last thing, can we do 'numberOfFieldTokens' instead of noFieldTokens?

then I think we can commit this as a step, should make things a lot easier for experimentation, if you are new to lucene it will make life much easier.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Done.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

EasySimilarity added. Lots of questions and nocommit in the code.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Just took a look, a few things that might help:

yes the maxdoc does not reflect deletions, but neither does things like totalTermFreq or docFreq either... so its best to not worry about deletions in the scoring and to be consistent and use the stats (e.g. maxDoc, not numDocs) that do not take deletions into account.
for the computeStats(TermContext... termContexts) its wierd to sum the DF across the different terms in the case? But i don't honestly have any suggestions here... maybe in this case we should make a EasyPhraseStats that computes the EasyStats for each term, so its not hiding anything or limiting anyone? and you could then do an instanceof check and have a different method like scorePhrase() that it forwards to in case its an EasyPhraseStats? In general i'm not sure how other ranking systems tend to handle this case, the phrase estimation for IDF in lucene's formula is done by summing the IDFs

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Implementation of the DFR framework added. Lots of nocommits, though. I things to think about:

lots of (float) conversions. Maybe the inner API (BasicModel, etc.) could use doubles? According to my experience, double is faster anyway, at least on 64bit architectures
I am not overly happy about the naming scheme, i.e. BasicModelBE, etc. Maybe we should do it the same way as in Terrier, with a basicmodel package and class names like BE?
A regular SimilarityProvider implementation won't play well with DFRSimilarity, in case the user wants to use several different setups. Actually, this is a problem for all similarities that have parameters (e.g. BM25 with b and k).

Also, I think we need that NormConverter we talked earlier on irc, so that the Similarities can run on any index.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Made the signature of EasySimilarity.score() a bit saner.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Explanation-handling added to EasySimilarity and DFRSimilarity.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Information-based model framework due to Clinchant and Gaussier added.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Fixed a few things in MockBM25Similarity.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

* log2() moved from DFRSimilarity to EasySimilarity,

changed DFRSimilarity so that it constructor does not use reflection.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hi David: I had some ideas on stats to simplify some of these sims:

I think we can use an easier way to compute average document length: sumTotalTermFreq() / maxDoc(). This way the average is 'exact' and not skewed by index-time-boosts, smallfloat quantization, or anything like that.
To support pivoted unique normalization like lnu.ltc, I think we can solve this in a similar way: add sumDocFreq(), which is just a single long, and divide this by maxDoc. this gives us avg # of unique terms. I think terrier might have a similar stat (#postings or #pointers or something)?

so i think this could make for nice simplifications: especially for switching norms completely over to docvalues: we should be able to do #1 immediately right now, change the way we compute avgdoclen for e.g. BM25 and DFR.

then in a separate issue i could revert this norm summation stuff to make the docvalues integration simpler, and open a new issue for sumDocFreq()

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

* Fixed #1

Added a totalBoost to EasySimilarity, and a getter method – noone uses it yet
Added basic implementations for the Jelinek-Mercer and the Dirichlet LM methods.

As for the last one: the implementation is very basic now, I want to factor a few things out (e.g. p(w|C) to LMStats, possibly in a pluggable way so ppl can implement it however they want). It also doesn't seem right to have the same LM method implemented twice (both as MockLMSimilarity and here), so I'll take a look to see if I can merge those two. Finally, I am wondering whether I should implement the absolute discounting method, which, according to the paper, seems inferior to the Jelinek-Mercer and Dirichlet methods. Right now I am more on the "no" side.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Added LMSimilarity so that the two LM methods have a common parent. It also defines the CollectionModel interface which computes p(w|C) in a pluggable way (and only once per query, though I am not sure this improves performance as I need a cast in score()).

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Explanation added to LM models; query boost added.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Made the score() and explain() methods in Similarity components final.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hi David, this is looking really good! The patch is quite large so what i did was:

re-sync flexscoring branch to trunk
commit your patch as is (i did a tiny tweak for LUCENE-3299)

I saw a couple things we should address (full review will really mean i have to take quite a bit of time for each model!) But we can take care of some of this easy stuff first!

numberOfFieldTokens seems to be the same as sumOfTotalTF? we should only have one name for this stat i think
i like the idea of NoAfterAffect/NoNormalization in DFR, maybe we should make these ordinary classes, and in DFR we just don't allow null for any of the components? just thought it might look cleaner.
some of the files in .similarities need apache license header.
because we dont need the norm for averaging, maybe we should use lucene's encoding? we can pre-build the decode table like TF-IDF similarity, except our decode table is basically 1 / decode(float)^2 to give us the quantized doc length. from a practical perspective, this would mean someone could use this stuff with existing lucene indexes (once they upgrade their segments to 4.0's format), and easily switch between things without reindexing.

if you want, you can do these things on this issue or open separate issues, whichever is easiest. but i think looking at smaller patches at this point will make iteration easier!

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Fixed two of the issues you mentioned:

Apache license header added to all files in the similarities package;
cleaned up the constructor of DFRSimilarity and added a few new ones.

I have not yet moved the NoNormalization and NoAfterEffect classes to their own files, because I feel a bit uncomfortable about the naming, since it's different from that of the other classes, e.g. NormalizationH2 vs NoNormalization.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks David: i committed this.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

I think I realized what I wanted with numberOfFieldTokens. I was afraid that sumTotalTermFreq is affected by norms / index time boost / etc, and I wanted to make numberOfFieldTokens to unaffected by those (I don't know now how); only I forgot to do so.

But if sumTotalTermFreq is really just the number of tokens in the field, I will delete one of them. Not sure which, because for me numberOfFieldTokens seems a more descriptive name than sumTotalTermFreq, but the latter is used everywhere in Lucene. May I ask your opinion on this question?

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Not sure which, because for me numberOfFieldTokens seems a more descriptive name than sumTotalTermFreq

I think I agree with you: in the context of stats for scoring this might be the way to go, as its easier to understand.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Added norm decoding table to EasySimilarity, and removed sumTotalFreq. Sorry I could only upload this patch now but I didn't have time to work on Lucene the last week.

As I see, all the problems you mentioned have been corrected, so maybe we can go on with the review?

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hi David, i was thinking for the norm, we could store it like DefaultSimilarity. this would make it especially convenient, as you could easily use these similarities with the same exact index as one using Lucene's default scoring. Also I think (not sure!) by using 1/sqrt we will get better quantization from smallfloat?

  public byte computeNorm(FieldInvertState state) {
    final int numTerms;
    if (discountOverlaps)
      numTerms = state.getLength() - state.getNumOverlap();
    else
      numTerms = state.getLength();
    return encodeNormValue(state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms))));
  }

for computations, you have to 'undo' the sqrt() to get the quantized length, but thats ok since its only done up-front a single time and tableized, so it won't slow anything down.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

EasySimilarity now computes norms in the same way as DefaultSimilarity.

Actually not exactly the same way, as I have not yet added the discountOverlaps property. I think it would be a good idea for EasySimilarity as well (it is for phrases, right), what do you reckon?

I also wrote a quick test to see which norm (length directly or 1/sqrt) is closer to the original value and it seems that the direct one is usually much closer (RMSE is 0.09689688608375747 vs 0.23787634482532286). Of course, I know it is much more important that the new Similarities can use existing indices.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Deleted the accidentally forgotten abstract modifier from the Distribution classes.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Removed reflection from IBSimilarity.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks, I committed your latest patch, some ideas just perusing:

we can move the calculations currently in decodeNormValue into the static table, this way we aren't doing these per-document multiplications and divisions... so decodeNormValue just returns the document length.
should easysim change its score method from score(Stats stats, float freq, byte norm) to score(Stats stats, float freq, int documentLength) ? then it could encapsulate this encoding/decoding.
I think we should try to factor in the index-time boost in computeNorm here if we can, e.g. just divide the document length by it? So documents with a higher index-time boost have a shorter length.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Done. Actually, I wanted to implement the norm table in the way you said, but somehow forgot about it.

Two questions remain on my side:

the one about discountOverlaps (see above)
what kind of index-time boosts do people usually use? Too big a boost might cause problems if we just divide the length with it. Maybe we should take the logarithm or sth like that?

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Added a short explanation on the parameter for the Jelinek-Mercer method.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Added discountOverlaps to EasySimilarity.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Got rid of all but one nocommits.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks David: I committed this.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Robert: Since we use #4430 for testing & bug fixing, I propose we close this issue. If we decide to implement other methods as well, we can do it under a new issue. Or do you have something else in mind (such as to rename EasySimilarity to SimilarityBase)?

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

+1, I do think we should consider naming and stuff (I sorta like SimilarityBase but we can discuss it)... but we should just open separate issues for that after we have worked out all the technical details first, its easy to refactor naming.

And we also want to at the same time move it into src/java, we can open a separate issue for all of this "integrate new similarities" or something. Let's close this one!

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks David! Awesome work :)

asfimport commented 12 years ago

Fis Ka (migrated from JIRA)

Hi All,

pardon my ignorance, I'm new to this. What I need is the BM25 to implement in my current project (bachelor thesis), I'm using Lucene 3.0.2. Can you instruct me what do I need to do, so that I can add the bm25 to my project? Do I get a jar? or do I need to compile everything on my own? furthermore, do I need to re-index sources in order to have BM25 working?

best,

fiska

apache / lucene

Implement various ranking models as Similarities [LUCENE-3220] #4293

4430