Closed asfimport closed 13 years ago
David Mark Nemeskey (migrated from JIRA)
EasyStats object added.
Robert Muir (@rmuir) (migrated from JIRA)
a few comments (it generally looks close to me):
David Mark Nemeskey (migrated from JIRA)
* I was wondering about that too – actually docNo is a mistake, it should have been noDocs or noOfDocs anyway, but I guess I'll just go with numberOfDocuments.
Robert Muir (@rmuir) (migrated from JIRA)
I'll put a nocommit there for the time being, and if no sims use it, I'll just remove it from the Stats. Terrier has it, though, so I guess there should be at least one method that depends on it.
I've never seen one that did... I don't imagine us ever implementing this efficiently given that we support incremental indexing.
Robert Muir (@rmuir) (migrated from JIRA)
oh two more nitpicky comments:
@author
? For legal reasons (i think actually for your protection!) we omit these from new files.@lucene
.experimental also for new classes: this is a template that 'ant-javadocs' replaces with "WARNING: This API is experimental and might change in incompatible ways in the next release." to tell users that its very new and not to expect precise backwards compatibility.David Mark Nemeskey (migrated from JIRA)
Oh, sorry, how lame of me :( Actually I am working now on a different machine than the one I usually do, so that's why I made those mistakes. Anyhow, I have fixed them.
Robert Muir (@rmuir) (migrated from JIRA)
one last thing, can we do 'numberOfFieldTokens' instead of noFieldTokens?
then I think we can commit this as a step, should make things a lot easier for experimentation, if you are new to lucene it will make life much easier.
David Mark Nemeskey (migrated from JIRA)
Done.
David Mark Nemeskey (migrated from JIRA)
EasySimilarity added. Lots of questions and nocommit in the code.
Robert Muir (@rmuir) (migrated from JIRA)
Just took a look, a few things that might help:
yes the maxdoc does not reflect deletions, but neither does things like totalTermFreq or docFreq either... so its best to not worry about deletions in the scoring and to be consistent and use the stats (e.g. maxDoc, not numDocs) that do not take deletions into account.
for the computeStats(TermContext... termContexts) its wierd to sum the DF across the different terms in the case? But i don't honestly have any suggestions here... maybe in this case we should make a EasyPhraseStats that computes the EasyStats for each term, so its not hiding anything or limiting anyone? and you could then do an instanceof check and have a different method like scorePhrase() that it forwards to in case its an EasyPhraseStats? In general i'm not sure how other ranking systems tend to handle this case, the phrase estimation for IDF in lucene's formula is done by summing the IDFs
David Mark Nemeskey (migrated from JIRA)
Implementation of the DFR framework added. Lots of nocommits, though. I things to think about:
Also, I think we need that NormConverter we talked earlier on irc, so that the Similarities can run on any index.
David Mark Nemeskey (migrated from JIRA)
Made the signature of EasySimilarity.score() a bit saner.
David Mark Nemeskey (migrated from JIRA)
Explanation-handling added to EasySimilarity and DFRSimilarity.
David Mark Nemeskey (migrated from JIRA)
Information-based model framework due to Clinchant and Gaussier added.
David Mark Nemeskey (migrated from JIRA)
Fixed a few things in MockBM25Similarity.
David Mark Nemeskey (migrated from JIRA)
* log2() moved from DFRSimilarity to EasySimilarity,
Robert Muir (@rmuir) (migrated from JIRA)
Hi David: I had some ideas on stats to simplify some of these sims:
so i think this could make for nice simplifications: especially for switching norms completely over to docvalues: we should be able to do #1
immediately right now, change the way we compute avgdoclen for e.g. BM25 and DFR.
then in a separate issue i could revert this norm summation stuff to make the docvalues integration simpler, and open a new issue for sumDocFreq()
David Mark Nemeskey (migrated from JIRA)
* Fixed #1
As for the last one: the implementation is very basic now, I want to factor a few things out (e.g. p(w|C) to LMStats, possibly in a pluggable way so ppl can implement it however they want). It also doesn't seem right to have the same LM method implemented twice (both as MockLMSimilarity and here), so I'll take a look to see if I can merge those two. Finally, I am wondering whether I should implement the absolute discounting method, which, according to the paper, seems inferior to the Jelinek-Mercer and Dirichlet methods. Right now I am more on the "no" side.
David Mark Nemeskey (migrated from JIRA)
Added LMSimilarity so that the two LM methods have a common parent. It also defines the CollectionModel interface which computes p(w|C) in a pluggable way (and only once per query, though I am not sure this improves performance as I need a cast in score()).
David Mark Nemeskey (migrated from JIRA)
Explanation added to LM models; query boost added.
David Mark Nemeskey (migrated from JIRA)
Made the score() and explain() methods in Similarity components final.
Robert Muir (@rmuir) (migrated from JIRA)
Hi David, this is looking really good! The patch is quite large so what i did was:
I saw a couple things we should address (full review will really mean i have to take quite a bit of time for each model!) But we can take care of some of this easy stuff first!
if you want, you can do these things on this issue or open separate issues, whichever is easiest. but i think looking at smaller patches at this point will make iteration easier!
David Mark Nemeskey (migrated from JIRA)
Fixed two of the issues you mentioned:
I have not yet moved the NoNormalization and NoAfterEffect classes to their own files, because I feel a bit uncomfortable about the naming, since it's different from that of the other classes, e.g. NormalizationH2 vs NoNormalization.
Robert Muir (@rmuir) (migrated from JIRA)
Thanks David: i committed this.
David Mark Nemeskey (migrated from JIRA)
I think I realized what I wanted with numberOfFieldTokens. I was afraid that sumTotalTermFreq is affected by norms / index time boost / etc, and I wanted to make numberOfFieldTokens to unaffected by those (I don't know now how); only I forgot to do so.
But if sumTotalTermFreq is really just the number of tokens in the field, I will delete one of them. Not sure which, because for me numberOfFieldTokens seems a more descriptive name than sumTotalTermFreq, but the latter is used everywhere in Lucene. May I ask your opinion on this question?
Robert Muir (@rmuir) (migrated from JIRA)
Not sure which, because for me numberOfFieldTokens seems a more descriptive name than sumTotalTermFreq
I think I agree with you: in the context of stats for scoring this might be the way to go, as its easier to understand.
David Mark Nemeskey (migrated from JIRA)
Added norm decoding table to EasySimilarity, and removed sumTotalFreq. Sorry I could only upload this patch now but I didn't have time to work on Lucene the last week.
As I see, all the problems you mentioned have been corrected, so maybe we can go on with the review?
Robert Muir (@rmuir) (migrated from JIRA)
Hi David, i was thinking for the norm, we could store it like DefaultSimilarity. this would make it especially convenient, as you could easily use these similarities with the same exact index as one using Lucene's default scoring. Also I think (not sure!) by using 1/sqrt we will get better quantization from smallfloat?
public byte computeNorm(FieldInvertState state) {
final int numTerms;
if (discountOverlaps)
numTerms = state.getLength() - state.getNumOverlap();
else
numTerms = state.getLength();
return encodeNormValue(state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms))));
}
for computations, you have to 'undo' the sqrt() to get the quantized length, but thats ok since its only done up-front a single time and tableized, so it won't slow anything down.
David Mark Nemeskey (migrated from JIRA)
EasySimilarity now computes norms in the same way as DefaultSimilarity.
Actually not exactly the same way, as I have not yet added the discountOverlaps property. I think it would be a good idea for EasySimilarity as well (it is for phrases, right), what do you reckon?
I also wrote a quick test to see which norm (length directly or 1/sqrt) is closer to the original value and it seems that the direct one is usually much closer (RMSE is 0.09689688608375747 vs 0.23787634482532286). Of course, I know it is much more important that the new Similarities can use existing indices.
David Mark Nemeskey (migrated from JIRA)
Deleted the accidentally forgotten abstract modifier from the Distribution classes.
David Mark Nemeskey (migrated from JIRA)
Removed reflection from IBSimilarity.
Robert Muir (@rmuir) (migrated from JIRA)
Thanks, I committed your latest patch, some ideas just perusing:
David Mark Nemeskey (migrated from JIRA)
Done. Actually, I wanted to implement the norm table in the way you said, but somehow forgot about it.
Two questions remain on my side:
David Mark Nemeskey (migrated from JIRA)
Added a short explanation on the parameter for the Jelinek-Mercer method.
David Mark Nemeskey (migrated from JIRA)
Added discountOverlaps to EasySimilarity.
David Mark Nemeskey (migrated from JIRA)
Got rid of all but one nocommits.
Robert Muir (@rmuir) (migrated from JIRA)
Thanks David: I committed this.
David Mark Nemeskey (migrated from JIRA)
Robert: Since we use #4430 for testing & bug fixing, I propose we close this issue. If we decide to implement other methods as well, we can do it under a new issue. Or do you have something else in mind (such as to rename EasySimilarity to SimilarityBase)?
Robert Muir (@rmuir) (migrated from JIRA)
+1, I do think we should consider naming and stuff (I sorta like SimilarityBase but we can discuss it)... but we should just open separate issues for that after we have worked out all the technical details first, its easy to refactor naming.
And we also want to at the same time move it into src/java, we can open a separate issue for all of this "integrate new similarities" or something. Let's close this one!
Robert Muir (@rmuir) (migrated from JIRA)
Thanks David! Awesome work :)
Fis Ka (migrated from JIRA)
Hi All,
pardon my ignorance, I'm new to this. What I need is the BM25 to implement in my current project (bachelor thesis), I'm using Lucene 3.0.2. Can you instruct me what do I need to do, so that I can add the bm25 to my project? Do I get a jar? or do I need to compile everything on my own? furthermore, do I need to re-index sources in order to have BM25 working?
best,
fiska
With #4247 done, we can finally work on implementing the standard ranking models. Currently DFR, BM25 and LM are on the menu.
Done:
EasyStats
: contains all statistics that might be relevant for a ranking algorithmEasySimilarity
: the ancestor of all the other similarities. Hides the DocScorers and as much implementation detail as possibleMigrated from LUCENE-3220 by David Mark Nemeskey, resolved Aug 12 2011 Parent: #4033 Attachments: LUCENE-3220.patch (versions: 24) Linked issues:
4430