Implement a state-of-the-art retrieval function in Lucene [LUCENE-965]

asfimport commented 17 years ago

We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene.

Most changes related to the implementation are made in AXSimilarity, TermScorer and TermQuery.java. However, many test cases are hand coded to test whether the implementation of the default function is correct. Thus, I also made the modification to many test files to make the new retrieval function pass those cases. In fact, we found that some old test cases are not reasonable. For example, in the testQueries02 of TestBoolean2.java, the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 xx w2 yy w3". The second document should be more relevant than the first one, because it has more occurrences of the query term "w3". But the original test case would require us to rank the first document higher than the second one, which is not reasonable.

Migrated from LUCENE-965 by Hui Fang, 2 votes, resolved Dec 06 2011 Attachments: axiomaticFunction.patch

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

What do people make of this? Interesting claims. I haven't looked at the patch yet or read up on the Axiomatic retrieval model, but the precision numbers in the report are impressive. I think it dovetails nicely with Doron and Chris' discussions on retrieval performance and making better efforts to gauge Lucene's retrieval effectiveness. These numbers are for TREC and that doesn't always correlate to the real world, but still, not to be discounted, either.

I think it would be cool to see a couple things out of this (at least):

contrib/benchmark algorithms to be applied for before and after, including #1911. This would give everyone a way of easily evaluating (assuming they have TREC data). I would wait for 836 to be committed, though, as it is not final yet.
Search speed numbers comparing the two approaches. That is if it is significantly slower, than it probably isn't going to be the default way of doing things

My gut reaction would be, if everything checks out of course, is to see how to factor it in as a separate querying mechanism, if possible like the Spans functionality, to give people the option of using this and if the claims hold up in the wild and feedback is positive, then we could look to making it the default approach. Not sure how feasible this is, though

In the meantime, looks like I've got some reading to do...

Cheers, Grant

asfimport commented 17 years ago

Doron Cohen (migrated from JIRA)

Thanks for contributing this Hui Fang! Very interesting. I agree with Grant that we should be able to evaluate this in the context of #1911 - I hope to finalize it pretty soon. I looked into the patch and read the short paper referenced and I have a few comments:

1) Interestingly this too makes use of the average document length, as discussed in http://www.gossamer-threads.com/lists/lucene/java-dev/50537 2) The current patch seem out dated comparing to trunk and also contain many changes that are not part of the proposed improvement. You need to run "svn update" to update with trunk (but do "svn stat -u" beforehand to see what is going to be updated and that there are no conflicts, and bkup your code before that just in case...) 3) The AXSimilarity class itself was is not included in the patch (note that you need to "svn add" the new files in order for "svn diff" to include these new files in the patch. 4) On first reading of the patch it seems that the avarage length is computed at search time for each scored term... right? This is good enough for the evaluation of this Similarity function, but for a running solution a better performance method would be required, like the one Hoss suggested in http://www.gossamer-threads.com/lists/lucene/java-dev/5053

asfimport commented 17 years ago

Hui Fang (migrated from JIRA)

Hi Grant and Doron,

Thank you very much for your comments! They are very useful. I agree that it would be interesting to evaluate this in the context of Lucene-836, which is a very nice idea. Actually, my advisor and I also discussed that we could put some evaluation scripts in Lucene so that others could easily evaluate the retrieval performance. Hope that Lucene-836 would be finalized soon, and please let me know if there is anything I could help.

Regarding to the speed, the axiomatic retrieval function should have the same computatlonal complexity as the default function if we could compute the average document length at the indexing time instead of search time. As Doron pointed out, my current implementation is not optimal, I will fix this problem and other svn related problems as soon as possible, and resubmit a new patch.

Thanks, -Hui

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

It does seem like calculating the average field length at index time should be relatively cheap. I haven't seen the Similarity implementation, but the axiomatic TermScorer.score() will be somewhat slower than Lucene's due to the necessary division (all but one can be turned into a multiply I think).

asfimport commented 17 years ago

Charlie Zhao (migrated from JIRA)

Hello Hui:

Thank you for contributing your axiomatic retrieval function to Lucene. Can not wait for the test drive.

Would you please report your settings for your experiment on

Collection Function MAP P5 P10 P20 P100 NumRR ROBUST04 Lucene Default 0.048 0.12 0.09 0.08 0.05 21

Since there are disparities comparing with mine.

num_q 249 num_ret 239436 num_rel 17412 num_rel_ret 9780 map 0.2076 gm_ap 0.1049 R-prec 0.2551 bpref 0.2189 recip_rank 0.5684 ircl_prn.0.00 0.6288 ircl_prn.0.10 0.4459 ircl_prn.0.20 0.3562 ircl_prn.0.30 0.2864 ircl_prn.0.40 0.2289 ircl_prn.0.50 0.1925 ircl_prn.0.60 0.145 ircl_prn.0.70 0.1062 ircl_prn.0.80 0.0702 ircl_prn.0.90 0.0461 ircl_prn.1.00 0.0261 P5 0.3944 P10 0.3598 P15 0.3307 P20 0.307 P30 0.2657 P100 0.1618 P200 0.1117 P500 0.0635 P1000 0.0393

Before we go further, let us make sure we are in the same page.

Here is my setting:

Data: TREC Disk 4 & 5; 528,155 documents; 1,904 MB of text.

Query Number: TREC Query Number 301-700

Query Field: &lt;title&gt; only

IR Engine: Lucene 2.0 (need double check, but close:)

Note: default Lucene similarity function, using title words only.

If we are in the same page, then 0.048 MAP score is terribly low for 301-700, whereas 0.2076 in mine.

Still your axiomatic retrieval function outperformed the default in many other aspects. So if you would like to check your experimental setting, and if my result is more closer to the real default, then we might discover a further improvement with the axiomatic retrieval function. That is my hope.

Charlie Zhao

asfimport commented 17 years ago

Doug Cutting (@cutting) (migrated from JIRA)

> It does seem like calculating the average field length at index time should be relatively cheap.

Yes, it should. But if average norm suffices, that can be computed on demand and cached in the Searcher without significantly impacting performance. The norms need to be read anyway, and averaging them adds only a small constant factor to the cost of reading them.

asfimport commented 17 years ago

Charlie Zhao (migrated from JIRA)

Document Length and Average Document Length are sort of speed bottlenecks of Lucene's implementation of some IR models, like Axiomatic Retrieval Function we just saw and one Language Model I have extended in Lucene. I said speed, instead of performance. Because Lucene's performance measures (in the sense of recall and precision) are relatively low comparing with other IR models with my experimental results. And since early Lucene, we never updated the kernel of similarity measure algorithm. Do general users value (recall+precision) more than (speed)?

How to conveniently store and retrieve "field length", "document length", "average document length", etc.? Can they be the payload data at document level and index level? So we may say bye to their corresponding overhead during query time?

I used to leverage from TermFreqVector's getTermFrequencies() to obtain the field length. (size() only return the unique terms) But shall I just reverse that field's norm value back to its length as (1/norm)^2? Which might be faster. Can someone confirm this?

BTW, I need help to understand the claim of "a small constant factor to the cost of reading them." in Doug's comment. Average norm does not give us the average field length. We need to recover the individual field length to get the average field length, which involve a great deal of floating point operations there. Did I miss something?

Can we store the "document length" (with multiple fields) and "average document length" as the payload data at document level and index level respectively? The current payload is designed at term level, is it right? If we want to store something at document and index level, do we necessary change the Lucene file format?

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

> Can we store the "document length" (with multiple fields) and "average document length" > as the payload data at document level and index level respectively? The current payload > is designed at term level, is it right? If we want to store something at document and > index level, do we necessary change the Lucene file format?

You are right, currently we can only store payloads per term occurrence, not at the doc level. However, it is possible to simply add a special term to every document that has only one occurrence with a payload, then you have one payload per doc.

Coincidentally I am currently testing how search performance would suffer if we stored the norms as payloads in the posting lists. At search time we would then not buffer the norms but read them on demand from the prx file. This is probably somewhat slower than buffering the norms, but has a lot of advantages, such as much simpler code and less memory consumption by the IndexReader. Since all norms are then stored in a single posting lists I'm hoping that the FS cache will help that the search performance won't suffer too much. And multi-level skipping should help too. Well let's see, I'm currently building an index with norms as payloads, I should have some numbers tonight or tomorrow.

asfimport commented 17 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

I guess I would not be in favor of a special term, I would rather see it integrated into the file format somehow. Special terms get deleted, misused, etc. Plus the avg. doc length is going to be something that is going to need to be updated frequently, right?

Since we are talking 3.x of Lucene fairly soon anyway (assuming the JDK 1.5 vote passes), this would allow us to make the file format change as well, as long as we can still read prior versions.

Charlie, as for you question about what users value in Lucene, speed or recall and precision, the answer is both. :-) Some people care more about speed while others care about p/r. I think most people that use Lucene have the feeling that the results are good enough in production environments and that we don't always worry about getting every last bit out of TREC (especially since we can't, as a group, test against TREC). That being said, I would bet most users would be willing to trade off a few percentage points of speed in exchange for the kind of MAP improvements we are talking here. Especially since we probably can eventually figure out a way to make it as fast anyway, or at least find other things we can speed up.

Correct me if I am wrong, but there are other IR strategies that can use the avg. doc. length, too, right? So, not to sidetrack too much, but if we do this right, maybe we can also open up the door to other scoring strategies as well without much downside. Just something to consider.

asfimport commented 17 years ago

Doug Cutting (@cutting) (migrated from JIRA)

> Did I miss something?

What I meant is that the loops added by this patch to compute average document length per query term could be more efficiently computed once per field in a searcher. They could thus be cached in, e.g., a WeakHashMap<Searcher,Map<String,float[]>>.

The cost of computing these is proportional to the size of the norms, which means that it is proportional to the cost of reading the norms. Computing them on demand when a searcher is opened would not be as fast as pre-computing them, but it might not prohibitively slow either, and would be simple to implement without other changes to Lucene.

By "average norm" I guess I really meant "easily computable from norms". This may not always be possible, since, e.g., with boosting, document lengths may not be recoverable from the norms. But, in many cases, it might suffice.

Does that help?

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

> I guess I would not be in favor of a special term, I would rather see it integrated > into the file format somehow. Special terms get deleted, misused, etc.

Well yes, I would also prefer to have real per-doc payloads in the file format, but until we have that we can use this workaround to try things out, as the performance should be comparable to real per-doc payloads.

asfimport commented 17 years ago

Charlie Zhao (migrated from JIRA)

Regarding the approach to compute avgDL, this patch goes like this:

float avgDL=0.0f;
for (int i=0; i<norms.length;i++) {
avgDL += normDecoder[norms[i] & 0xFF];
}
avgDL /= norms.length * 1.0f;

But may I suggest the alternative?

  float CL = 0.0f;
  float avgDL = 0.0f;
  float aDL = 0.0f; 
  for (int i=0; i<norms.length;i++) {
    aDL = 1.0f / normDecoder[norms[i] & 0xFF] ; 
    aDL *= aDL;
    CL += aDL;
  }
  avgDL = CL / norms.length;

Let us see a toy example:

2 docs in index

|D| avgD        |D|/avgD    norm    avgNorm     norm/avgNorm

D1 4 10 2/5 1/2 3/8 4/3 D2 16 10 8/5 1/4 3/8 2/3

norm/avgNorm is what we got from the patch code and D1>D2

D	/avgD is what we got from the suggested alternative code and D1 < D2

They have totally flipped the relationship between D1 and D2.

My impression of the Axiomatic Retrieval Function is: it still tries to penalize longer doc. So maybe the alternative code is what we need?

By the same token, |D| != Similarity.decodeNorm(fieldNorms[doc]).

Note: since we are recovering from the norm, so avgDL and DL != their original absolute value. But they suffice for the scoring purpose.

Based on Doug's previous comment, I totally agree that avgDL should be pre-computed and cached in the searcher before where the rubber meets the road. And the cost might be invisible if we warm up the searcher first. Thanks for explaining.

Not sure where Doron implemented 1 / sqrt((1 - Slope) * Pivot + (Slope) * Doclen). Since #1911 looks will be committed soon. I am really excited to see which similarity function will prevail in this era.

BTW, anyone would like to share how to read Lucene patches more efficiently? I mean I had hard time to make sense of those +s and -s independently from their source files. Is there a way to plug in a patch into my local source repository, so I can diff with my favorite diff tool? Thanks in advance.

asfimport commented 17 years ago

Doug Cutting (@cutting) (migrated from JIRA)

> Is there a way to plug in a patch into my local source repository, so I can diff with my favorite diff tool?

patch -p 0 < foo.patch

asfimport commented 17 years ago

Doron Cohen (migrated from JIRA)

> Is there a way to plug in a patch into my local source repository, so I can diff with my favorite diff tool? : patch -p 0 <foo.patch

Try with --dry-run first. Another convenient way in case you are using Eclipse is the Subclipse plugin that lets you visually diff patches just before applying them.

> But may I suggest the alternative?

I think you have a valid point here. I too don't understand the proposed "Axiomatic Retrieval Function" (ARF) in this regard: in Lucene, the norm value stored for a document (assuming all boosts are 1) is norm(D) = 1 / sqrt(numTerms(D)) This value is ready to use at scoring time, multiplying it with
tf(t in d) - idf(t)^^2
as described in http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/Similarity.html

Now, the ARF paper in http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf describes Lucene scoring using |D| in place of norm(D) above, and describes ARF scoring using |D| again, the same as it seems to be implemented in this patch e.g. in TermScorer. However, the paper defines |D| as "the length of D". I find this confusing. Usually "|D|" really means the number of words in a document, and "avgdl" would mean the average of all |D|'s in the collection (see for instance "Okapi BM25" in Wikipedia).

Now, your proposed change is something I can understand - it first translates back norm(D) into Length(D) (ignoring boosts), and only then averaging it.

In any case - I mean if either this is fixed or I am wrong and an explanation shows why no fix is needed - I have to admit I still don't understand the logic behind ARF, intuitively, why would it be better? Guess provable search quality results can help in persuading... (LUCENE-836 is resolved btw).

asfimport commented 17 years ago

Hui Fang (migrated from JIRA)

Hi Charlie,

I am sorry for the late reply. I just saw your message. I am not sure why your results are different from mine. But your problem setting is same as mine. Did you use any document preprocessing?

asfimport commented 16 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

Hui - would it be possible to bring this patch up to date, so it's in sync with Lucene 2.3?

Mike McCandless & Co. have made so many changes to the Lucene index format, I get a feeling that avg. doc. length could also make it into the index format at the segment/index level if this patch is revived.

asfimport commented 16 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Lets not forget about trying to get avg doc length in by 3.0 -if it can be done with little/to no impact on non users of it, would be really cool to have.

asfimport commented 15 years ago

Ian Holsman (migrated from JIRA)

It's a bit late over here, but when I try to apply the patch it doesn't seem to have the AXSimilarity class in it. is there a file missing here, or should i not be looking at applying patches late at night?

asfimport commented 15 years ago

Hui Fang (migrated from JIRA)

Hello everyone,

We have re-implemented the retrieval functions in a very different way. The main differences are (1) the average document length will not be computed in the retrieval process as we did the previous implementation, which could make the retrieval process more efficiently and (2) instead of modifying the existing search related classes, we integrate the new retrieval functions through two new classes, i.e., AXTermQuery and. AXTermScorer by extending TermQuery and TermScorer classes. I think that the current implementation addresses most concerns raised in this discussion threads.

The source codes and the updated reports of our implementation is now available at http://www.ece.udel.edu/\~hfang/LuceneAX.html. We have implemented two slightly versions for lucene-2.4.1 and lucene-2.9-dev. We hope that the implementation of the axiomatic retrieval function could be integrated in the releases of the Lucene. Please feel free to let me know if you have any questions or comments.

Thanks, -Hui

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

The link http://www.ece.udel.edu/\~hfang/lucene/lucene-2.9-dev-AX-contrib.tar.gz doesn't work?

asfimport commented 15 years ago

Hui Fang (migrated from JIRA)

Jason, the problem has been fixed. Please try again. Thanks.

asfimport commented 14 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Hi Hui,

I see you updated your paper on this, have you looked at how this might be implemented given the flexible indexing work under way?

asfimport commented 12 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

This seems to have gone silent and is likely replaced by the pluggable similarity options.

apache / lucene

Implement a state-of-the-art retrieval function in Lucene [LUCENE-965] #2040