apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.64k stars 1.02k forks source link

Implement various ranking models as Similarities [LUCENE-3220] #4293

Closed asfimport closed 13 years ago

asfimport commented 13 years ago

With #4247 done, we can finally work on implementing the standard ranking models. Currently DFR, BM25 and LM are on the menu.

Done:


Migrated from LUCENE-3220 by David Mark Nemeskey, resolved Aug 12 2011 Parent: #4033 Attachments: LUCENE-3220.patch (versions: 24) Linked issues:

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

EasyStats object added.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

a few comments (it generally looks close to me):

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

* I was wondering about that too – actually docNo is a mistake, it should have been noDocs or noOfDocs anyway, but I guess I'll just go with numberOfDocuments.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I'll put a nocommit there for the time being, and if no sims use it, I'll just remove it from the Stats. Terrier has it, though, so I guess there should be at least one method that depends on it.

I've never seen one that did... I don't imagine us ever implementing this efficiently given that we support incremental indexing.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

oh two more nitpicky comments:

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Oh, sorry, how lame of me :( Actually I am working now on a different machine than the one I usually do, so that's why I made those mistakes. Anyhow, I have fixed them.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

one last thing, can we do 'numberOfFieldTokens' instead of noFieldTokens?

then I think we can commit this as a step, should make things a lot easier for experimentation, if you are new to lucene it will make life much easier.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Done.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

EasySimilarity added. Lots of questions and nocommit in the code.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Just took a look, a few things that might help:

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Implementation of the DFR framework added. Lots of nocommits, though. I things to think about:

Also, I think we need that NormConverter we talked earlier on irc, so that the Similarities can run on any index.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Made the signature of EasySimilarity.score() a bit saner.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Explanation-handling added to EasySimilarity and DFRSimilarity.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Information-based model framework due to Clinchant and Gaussier added.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Fixed a few things in MockBM25Similarity.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

* log2() moved from DFRSimilarity to EasySimilarity,

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hi David: I had some ideas on stats to simplify some of these sims:

  1. I think we can use an easier way to compute average document length: sumTotalTermFreq() / maxDoc(). This way the average is 'exact' and not skewed by index-time-boosts, smallfloat quantization, or anything like that.
  2. To support pivoted unique normalization like lnu.ltc, I think we can solve this in a similar way: add sumDocFreq(), which is just a single long, and divide this by maxDoc. this gives us avg # of unique terms. I think terrier might have a similar stat (#postings or #pointers or something)?

so i think this could make for nice simplifications: especially for switching norms completely over to docvalues: we should be able to do #1 immediately right now, change the way we compute avgdoclen for e.g. BM25 and DFR.

then in a separate issue i could revert this norm summation stuff to make the docvalues integration simpler, and open a new issue for sumDocFreq()

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

* Fixed #1

As for the last one: the implementation is very basic now, I want to factor a few things out (e.g. p(w|C) to LMStats, possibly in a pluggable way so ppl can implement it however they want). It also doesn't seem right to have the same LM method implemented twice (both as MockLMSimilarity and here), so I'll take a look to see if I can merge those two. Finally, I am wondering whether I should implement the absolute discounting method, which, according to the paper, seems inferior to the Jelinek-Mercer and Dirichlet methods. Right now I am more on the "no" side.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Added LMSimilarity so that the two LM methods have a common parent. It also defines the CollectionModel interface which computes p(w|C) in a pluggable way (and only once per query, though I am not sure this improves performance as I need a cast in score()).

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Explanation added to LM models; query boost added.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Made the score() and explain() methods in Similarity components final.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hi David, this is looking really good! The patch is quite large so what i did was:

  1. re-sync flexscoring branch to trunk
  2. commit your patch as is (i did a tiny tweak for LUCENE-3299)

I saw a couple things we should address (full review will really mean i have to take quite a bit of time for each model!) But we can take care of some of this easy stuff first!

if you want, you can do these things on this issue or open separate issues, whichever is easiest. but i think looking at smaller patches at this point will make iteration easier!

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Fixed two of the issues you mentioned:

I have not yet moved the NoNormalization and NoAfterEffect classes to their own files, because I feel a bit uncomfortable about the naming, since it's different from that of the other classes, e.g. NormalizationH2 vs NoNormalization.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks David: i committed this.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

I think I realized what I wanted with numberOfFieldTokens. I was afraid that sumTotalTermFreq is affected by norms / index time boost / etc, and I wanted to make numberOfFieldTokens to unaffected by those (I don't know now how); only I forgot to do so.

But if sumTotalTermFreq is really just the number of tokens in the field, I will delete one of them. Not sure which, because for me numberOfFieldTokens seems a more descriptive name than sumTotalTermFreq, but the latter is used everywhere in Lucene. May I ask your opinion on this question?

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Not sure which, because for me numberOfFieldTokens seems a more descriptive name than sumTotalTermFreq

I think I agree with you: in the context of stats for scoring this might be the way to go, as its easier to understand.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Added norm decoding table to EasySimilarity, and removed sumTotalFreq. Sorry I could only upload this patch now but I didn't have time to work on Lucene the last week.

As I see, all the problems you mentioned have been corrected, so maybe we can go on with the review?

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hi David, i was thinking for the norm, we could store it like DefaultSimilarity. this would make it especially convenient, as you could easily use these similarities with the same exact index as one using Lucene's default scoring. Also I think (not sure!) by using 1/sqrt we will get better quantization from smallfloat?

  public byte computeNorm(FieldInvertState state) {
    final int numTerms;
    if (discountOverlaps)
      numTerms = state.getLength() - state.getNumOverlap();
    else
      numTerms = state.getLength();
    return encodeNormValue(state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms))));
  }

for computations, you have to 'undo' the sqrt() to get the quantized length, but thats ok since its only done up-front a single time and tableized, so it won't slow anything down.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

EasySimilarity now computes norms in the same way as DefaultSimilarity.

Actually not exactly the same way, as I have not yet added the discountOverlaps property. I think it would be a good idea for EasySimilarity as well (it is for phrases, right), what do you reckon?

I also wrote a quick test to see which norm (length directly or 1/sqrt) is closer to the original value and it seems that the direct one is usually much closer (RMSE is 0.09689688608375747 vs 0.23787634482532286). Of course, I know it is much more important that the new Similarities can use existing indices.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Deleted the accidentally forgotten abstract modifier from the Distribution classes.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Removed reflection from IBSimilarity.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks, I committed your latest patch, some ideas just perusing:

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Done. Actually, I wanted to implement the norm table in the way you said, but somehow forgot about it.

Two questions remain on my side:

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Added a short explanation on the parameter for the Jelinek-Mercer method.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Added discountOverlaps to EasySimilarity.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Got rid of all but one nocommits.

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks David: I committed this.

asfimport commented 13 years ago

David Mark Nemeskey (migrated from JIRA)

Robert: Since we use #4430 for testing & bug fixing, I propose we close this issue. If we decide to implement other methods as well, we can do it under a new issue. Or do you have something else in mind (such as to rename EasySimilarity to SimilarityBase)?

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

+1, I do think we should consider naming and stuff (I sorta like SimilarityBase but we can discuss it)... but we should just open separate issues for that after we have worked out all the technical details first, its easy to refactor naming.

And we also want to at the same time move it into src/java, we can open a separate issue for all of this "integrate new similarities" or something. Let's close this one!

asfimport commented 13 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks David! Awesome work :)

asfimport commented 12 years ago

Fis Ka (migrated from JIRA)

Hi All,

pardon my ignorance, I'm new to this. What I need is the BM25 to implement in my current project (bachelor thesis), I'm using Lucene 3.0.2. Can you instruct me what do I need to do, so that I can add the bm25 to my project? Do I get a jar? or do I need to compile everything on my own? furthermore, do I need to re-index sources in order to have BM25 working?

best,

fiska