Closed asfimport closed 15 years ago
Doron Cohen (migrated from JIRA)
Attached patch adds explanations on how current scoring function is derived from VSM. I hope it is not to "colorful", and that it indeed clarifies rather than confuses. I marked it for 2.9, because if others like it I think it would be good to include it, but if there is no concenzus on this modification let's move it to a future release.
Doron Cohen (migrated from JIRA)
just relates to, not blocking...
Doron Cohen (migrated from JIRA)
Suggested Similarity javadoc can be reiewed here: http://people.apache.org/\~doronc/sim-docs/org/apache/lucene/search/Similarity.html
Mark Miller (@markrmiller) (migrated from JIRA)
Looks great!
Document Euclidean norm |V(d)| is excluded by Lucene, for indexing performance considerstions (?).
Hmm - I'm not sure if that is right either. Are we not replacing the |V(d)| normalization factor with our document length factor?
That's how it appears to me anyway -
for |V(d)| you have many options right?
the cosine normalization - your standard euclidean length - |V(d)| or none (eg 1) or pivoted normalized doc length or SweetSpotSimilarity's formula or the quick,dirty,easy, not great default doc length normalization that Lucene uses by default or Okapi's formula, or ...
So we are replacing (which everyone generally does) not dropping right?
And I don't think we are replacing for performance reasons (though it is complicated to calculate) - we are replacing because its not a great normalization factor. Using |V(d)| eliminates info on the length of the orig document - but longer documents will still have higher tf's and more distinct terms - so it unnaturally gives them an advantage (most long docs will be repeated pieces or cover multiple topics). So its not generally a good normalization factor, and we have a chosen another? (the one we have chosen isnt great either - long docs are punished too much and short preferred too much)
Again, I'm not an IR guy, but thats what my modest take is.
Shai Erera (@shaie) (migrated from JIRA)
This looks great ! Really helpful. Few comments:
Doron Cohen (migrated from JIRA)
Mark and Shai Thanks for reviewing!
Mark, I think you have a point here (and I am definitely no more an IR guy than you are :)).
Truth is I was surprised to find out (through your comments in LUCENE-1896) that this component of the score is "missing", and I indeed thought that the "right thing to do" (if there is such thing as "right") really is to do both: normalize to the unit vector, and then normalize by length to compensate for "unfair" advantage of long documents.
But you're right, and the way I presented V(d) normalization and doc-length normalization is incorrect, as if it is a the right thing to do both, and the way it is presented is not doing justice to Lucene. I will change the writing.
Interestingly, for a document containing N distinct terms, the 1/Euclidean-norm and Lucene's default similarity's length norm are the same: 1/sqrt(N). But if you double that doc to have two occurrences of each of the N distinct terms, its length would be 2N, 1/Euclidean-norm would be 1/sqrt(4N) but Lucene's default similarity's length norm would be 1/sqrt(2N). So we will punish documents with duplicate terms less than would the Euclidean norm...
I am not aware of an evaluation or discussion of this - I mean - why was this approach selected, and so I assumed (under question) that it was merely for performance considerations. You said in Lucene-1896:
not just similar properties - but many times better properties - the standard normalization would not factor in document length at all - it essentially removes it.
Is it really better? It seems to "punish" the same for length due to distinct terms, and to punish less for length due to duplicate terms. Is this really a desired behavior? My intuition says no, but I am not sure.
Anyhow this issue more about describing what Lucene is doing today than on what should Lucene do, so think I have the correct picture now (except for historical justification which is interesting but not a show stopper).
Shai thanks for the fixes.
(updated patch to follow).
Mark Miller (@markrmiller) (migrated from JIRA)
Is it really better? It seems to "punish" the same for length due to distinct terms, and to punish less for length due to duplicate terms. Is this really a desired behavior? My intuition says no, but I am not sure.
Its only desired behavior if you have a corpus that favors it, but most do. I think you can think of the |V(d)| as taking out information about the document length - you start with an m space vector, with each term given a weight depending on how many times it occurs - in other words, there is information about the length of that document there, and when you normalize by |V(d)|, you will take out that information - but it will appear more similar the more unique terms it started with and the higher the tf's. So that method favors long docs, witch will naturally have more of both, though of course not always be more similar.
All of the normalizations I have seen appear in the vein of |V(d)| -eg 1/root(something). All of the others also try and make up for this doc length problem - by messing with the curve so that ultra long docs are not favored too highly. Our default method is a known not very good one - buts its also very fast and efficient compared to the better ones. Sweetspot is much better and I think still efficient - but you need to tune the right params I believe.
I'm still a little confused I guess :) The longer docs will have larger weights naturally is what I meant, but larger weights actually hurts in the cosine normalization - so it actually over punishes I guess? I dunno - all of this over punish/ under punish is in comparison to a relevancy curve they figure out ( a probability of relevance as a function of document length), and how the pivoted cosine curves compare against it. I'm just reading across random interweb pdfs from google. Apparently our pivot also over punishes large docs and over favors small, the same as this one, but perhaps not as bad ? I'm seeing that in a Lucene/Juru research pdf. This stuff is hard to grok on first pass.
Doron Cohen (migrated from JIRA)
Attached fixes according to comments by Shai and Mark:
For convenient review of the patch see http://people.apache.org/\~doronc/sim-docs/org/apache/lucene/search/Similarity.html
Doron Cohen (migrated from JIRA)
I'm still a little confused I guess
That makes too of us... :)
The longer docs will have larger weights naturally is what I meant, but larger weights actually hurts in the cosine normalization - so it actually over punishes I guess? I dunno - all of this over punish/ under punish is in comparison to a relevancy curve they figure out ( a probability of relevance as a function of document length), and how the pivoted cosine curves compare against it. I'm just reading across random interweb pdfs from google. Apparently our pivot also over punishes large docs and over favors small, the same as this one, but perhaps not as bad ? I'm seeing that in a Lucene/Juru research pdf. This stuff is hard to grok on first pass.
In that work we got best results from Lucene (for TREC) with SweetSpot similarity and normalizing tf by average term-freq in doc.
For me it was mainly experimental and intuitive, but I was not able to convince Hoss (or even convince myslf once I read Hoss comments) that this was justified theoretically.
At that time I was not aware of the V(d) normalization delicacy we are discussing now. I think I understand things better now, and still I am not sure. Need to read http://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html and sleep on it...
Mark Miller (@markrmiller) (migrated from JIRA)
In that work we got best results from Lucene
Thats funny - I didn't notice you were an author one that one! Small world.
The original idea of why the cosine normalization for the doc vector is bad, I got from the free intro to info retrieval book thats out there - but what it says doesn't fully jive with the info I am finding elsewhere, or my own common sense. Thats what has me most confused at the moment - the intro to ir book appears to break it down so that you can explain it with the math (why going into the unit vector space favors longer docs) - but other work I am seeing says the math tells you no such thing, and its just comparing it to the computed relevancy curve that tells you its not great.
I dunno :) - though at least I know a lot more than I did a few days ago - it never even occurred to me how the scoring we did equated to any kind of dot product before this - I used to read Lucene's scoring algorithm and then look at the code and it was like .... okay - sure ... - so I've come a long way.
Doron Cohen (migrated from JIRA)
The intro to ir book appears to break it down so that you can explain it with the math (why going into the unit vector space favors longer docs) - but other work I am seeing says the math tells you no such thing, and its just comparing it to the computed relevancy curve that tells you its not great.
To my (current) understanding it goes like this: normalizing all V(d)'s to unit vector is losing all information about lengths of documents. For a large document made by duplicating a smaller one this is probably ok. For a large document which just contains lots of "unique" text this is probably wrong. To solve this, a different normalization is sometimes preferred, one that would not normalize V(d) to the unit vector. (very much in line with what you (Mark) wrote above, finally I understand this...).
The pivoted length normalization which you mentioned is one different such normalization. Juru in fact is using this document length normalization. In our TREC experiments with Lucene we tried this approach (we modified Lucene indexing such that all require components were indexed as stored/cached fields and at search time we could try various scoring algorithms). It is interesting that pivoted length normalization did not work well - by our experiments - with Lucene for TREC.
The document length normalization of Lucene's DefaultSimilarity (DS) now seems to me - intuitively - not so good - at least for the previously mentioned two edge cases, where doc1 is made of N distinct terms, and doc2 is made of same N distinct terms, but its length is 2N because each term appears twice. For doc1 DS will normalize to the unit vector same as EN, and for doc2 DS will normalize to a vector larger than the unit vector. However I think the desired behavior is the other way around - for doc2 to be the same as EN, and for doc1 to be normalized to a vector larger than the unit vector.
Back to the documentation patch, again it seems wrong presenting as if both EU and some additional doc length normalization are required - fixed patch to follow...
Doron Cohen (migrated from JIRA)
New patch is more accurate about doc-length-normalization choices.
For easy reviewing I updated http://people.apache.org/\~doronc/sim-docs/org/apache/lucene/search/Similarity.html accordingly.
I plan to commit this in a day or two if there are no objections.
Ted Dunning (@tdunning) (migrated from JIRA)
I apologize for complaining here without actually editing this text to illustrate what I would rather see, but the new text seems to say things like "the scoring function is like this (formula) except that it isn't because it is really like this (other-formula) but that isn't really right either because it is like this (still-another-formula) which actually isn't right because of fields and <mumble>".
There are also many small errors such as claiming that tf is proportional to term frequency and idf is proportional to inverse of document frequency. Proportional means that there is a linear relationship which is definitely not the case here. It would be better to say tf usually increases with increasing term frequency, although occasionally a constant might be used. IDF, on the other hand, decreases with increasing document frequency.
Doron Cohen (migrated from JIRA)
Thanks for reviewing this Ted.
the new text seems to say things like "the scoring function is like this (formula) except that it isn't because it is really like this (other-formula) but that isn't really right either because it is like this (still-another-formula) which actually isn't right because of fields and <mumble>".
I see what you mean.
I tried to take the reader of this from VSM to the actual elements computed and aggregated in Lucene scoring code. This would also answer questions several times asked in the lists: "but what is the scoring model of Lucene" - it is not that straightforward to tell why a certain method is called during scoring.
But I think you have a good point - the reader is told "this is the scoring formula" just to discover 20 lines ahead that in fact "that is the formula" and yet again the same thing in another paragraph.
I think all 3 formulas are required, just the gluing text should improve. Might have helped to have better English than mine for this, but I'll give it a try, I think I know how to write it better in this sense.
There are also many small errors such as claiming that tf is proportional to term frequency and idf is proportional to inverse of document frequency. Proportional means that there is a linear relationship which is definitely not the case here. It would be better to say tf usually increases with increasing term frequency, although occasionally a constant might be used. IDF, on the other hand, decreases with increasing document frequency.
I agree. "Proportional" is wrong. Thanks for catching this! In fact it appears wrongly in two other places in Similarity - idf() and in idfExplain(). In these two other places I think replacing it to "related" would be correct, i.e. like this:
Note that Searcher.maxDoc() is used instead of
org.apache.lucene.index.IndexReader.numDocs()
because it is related to Searcher.docFreq(Term) ,
i.e., when one is inaccurate, so is the other, and
in the same direction.
For tf and idf I think this will do: ?
Tf and Idf are described in more detail below,
but for now, for completion, let's just say that
for given term t and document (or query) x,
Tf(t,x) is related to the number of occurrences of
term t in x - when one increases so does
the other - and idf(t) is similarly related to the
inverse of the number of index documents
containing term t.
Ted Dunning (@tdunning) (migrated from JIRA)
Thanks for reviewing this Ted. ... Might have helped to have better English than mine for this
No problem reviewing this. Thanks to you for doing the hard part of actually writing it. Being a critic is easy in comparison.
And you English is doing fine. If you get the ideas down, a hundred people can chime in with grammatical and spelling fixes. And frankly, your English is so much better than any of my other languages that I would never be brave enough to complain.
I think all 3 formulas are required, just the gluing text should improve. ... I think I know how to write it better in this sense.
Good point. This is the essence of my grumpiness.
Note that Searcher.maxDoc() is used instead of org.apache.lucene.index.IndexReader.numDocs() because it is related to Searcher.docFreq(Term), i.e., when one is inaccurate, so is the other, and in the same direction.
Actually, in this case, I think that the motivation is that maxDoc is very commonly exactly correct and typically vastly more efficient. As you say, when it is wrong, docFreq can also be wrong the same way. My suggestion would be this:
Note that Searcher.maxDoc() is used instead of org.apache.lucene.index.IndexReader.numDocs() because it is more efficient to compute and is pretty much correct except for when many documents have been deleted. In any case Searcher.docFreq(Term) is likely to have a similar problem.
Regarding the proportional/related issue, I think that your language is fine. At most, I would suggest "varies with" instead of "related" because it is slightly stronger, but you make the relationship abundantly clear in your text.
Doron Cohen (migrated from JIRA)
Thanks Ted (for both your comments and kindness).
I modified to "varies" as you suggested, and added the note about efficient computation of maxDoc(), but omitted the part of "pretty much correct except" because this might not be correct for certain applications.
Patch to follow...
Doron Cohen (migrated from JIRA)
Updated patch according to comments by Ted.
To review without applying the patch see http://people.apache.org/\~doronc/sim-docs/org/apache/lucene/search/Similarity.html
Jiri Kuhn (migrated from JIRA)
The last paragraph starts with:
However the resulted norm value is encoded as a single byte before being stored.
Maybe this is an ideal time to document reasons why norms are encoded as bytes.
Doron Cohen (migrated from JIRA)
However the resulted norm value is encoded as a single byte before being stored.
Maybe this is an ideal time to document reasons why norms are encoded as bytes.
? Perhaps something like: "Once a field is referenced at search time, its norms - for all documents - are maintained in memory, hence it is important to keep norms footprint low." ...?
Jiri Kuhn (migrated from JIRA)
Yes, sounds great.
The consequence of encoding norms as bytes is that norms can't be used to fine tune computed document score as I tried to (and failed). I was forced to use function query to achieve my goals.
Marvin Humphrey (migrated from JIRA)
The rationale behind the coarseness of the norms is that since the accuracy of search engines in retrieving the documents that the user really wants is so poor, only big differences matter. (It's not just poor "recall" against a given query, but the difficulty that the user experiences in formulating a proper query to express what they're really looking for in the first place.)
Doug wrote at least once about this some years back, but I haven't been able to track down the post.
Doron Cohen (migrated from JIRA)
The rationale behind the coarseness of the norms is that since the accuracy of search engines in retrieving the documents that the user really wants is so poor, only big differences matter. (It's not just poor "recall" against a given query, but the difficulty that the user experiences in formulating a proper query to express what they're really looking for in the first place.)
Doug wrote at least once about this some years back, but I haven't been able to track down the post.
Thanks! I too failed to find that post.
I like the part about users difficulty to express their information need in the query.
So I am updating like this:
However the resulted norm value is encoded as a single byte before being
stored. At search time, the norm byte value is read from the index directory
and decoded back to a float norm value. This encoding/decoding, while reducing
index size, comes with the price of precision loss - it is not guaranteed that
decode(encode(x)) = x. For instance, decode(encode(0.89)) = 0.75.
Compression of norm values to a single byte saves memory at search time,
because once a field is referenced at search time, its norms - for all
documents - are maintained in memory.
The rationale supporting such lossy compression of norm values is that
given the difficulty (and inaccuracy) of users to express their true information
need by a query, only big differences matter.
Last, note that search time is too late to modify this norm part of scoring,
e.g. by using a different Similarity for search.
Doron Cohen (migrated from JIRA)
Updated patch according to comments for the norms section
To review without applying the patch see http://people.apache.org/\~doronc/sim-docs/org/apache/lucene/search/Similarity.html
Doron Cohen (migrated from JIRA)
Committed, Thanks Mark, Shai, Ted, Jiri, and Marvin!
See discussion in the related issue.
Migrated from LUCENE-1908 by Doron Cohen, resolved Sep 15 2009 Attachments: LUCENE-1908.patch (versions: 5) Linked issues:
2971