aaronbinns / tnh

(T)he (N)ew (H)otness. Improved full-txt search of archival web data.
Apache License 2.0
6 stars 7 forks source link

TNH returns values for totalResults which depend on input parameters (p, n) #6

Open mpearson-nla opened 11 years ago

mpearson-nla commented 11 years ago

Hi Aaron,

My name is Mark Pearson and I am the Web Archiving Engineer at the National Library of Australia. I'm currently working on a project to build a web interface for the NLA's Government Web Archive using TNH and Wayback to provide full-text and URL search capabilities and Wayback to play back documents.

During this project I came across (what I believe is) a bug in TNH. The value for totalResults returned by TNH is not consistent but changes as the value of the input parameter p increases. I have documented an example of this in a PDF document which I will try to attach to this issue.

Oops, no attachment facility - I'll have to find another way... Here is the content pasted and formatted for GitHub:

Value of totalResults returned by TNH is dependent on input parameters (p & n)

This document details my investigations into a bug in TNH which causes inconsistent values for the totalResults element to be returned in the XML response (although the bug shows up when not using the OpenSearch API also).

For values of the input parameter p from 0 up to a certain threshold the value of totalResults remains constant (284 in the example below), but then suddenly changes to a completely different value as p is increased by 1 (in the example below it changes to 44).

The following snippets show input parameters and XML output for values of p just under and over the threshold value where the bug shows up (obviously you won't be able to reproduce this example directly since it relies on the particular set of indexes I am querying but the bug shows up on many if not all queries to TNH).

Query URL

http://.../search/opensearch?q=terry+cutler&n=1&p=43&h=1&t=application%2Fpdf

XML Response (for p = 43 totalResults = 284)

<rss version="2.0">
  <channel>
    <title>terry cutler</title>
    <description>terry cutler</description>
    <link/>
    <totalResults>284</totalResults>
    <startIndex>43</startIndex>
    <itemsPerPage>1</itemsPerPage>
    <query>terry cutler</query>
    <index/>
    <urlParams>
      <param name="t" value="application/pdf"/>
      <param name="q" value="terry cutler"/>
      <param name="p" value="43"/>
      <param name="n" value="1"/>
      <param name="h" value="1"/>
    </urlParams>
    <item>
      <title/>
      <link>http://www.acip.gov.au/library/ACIP_postgrant_enforcement_strategies.pdf</link>
      <docId>922719</docId>
      <score>0.013092474</score>
      <site>acip.gov.au</site>
      <length>404340</length>
      <type>application/pdf</type>
      <boost/>
      <collection/>
      <index>001</index>
      <date>20110601172822</date>
      <description/>
    </item>
    <responseTime>0.041</responseTime>
  </channel>
</rss>

Query URL

http://.../search/opensearch?q=terry+cutler&n=1&p=44&h=1&t=application%2Fpdf

XML Response (for p = 44 totalResults = 44)

<rss version="2.0">
  <channel>
    <title>terry cutler</title>
    <description>terry cutler</description>
    <link/>
    <totalResults>44</totalResults>
    <startIndex>44</startIndex>
    <itemsPerPage>1</itemsPerPage>
    <query>terry cutler</query>
    <index/>
    <urlParams>
      <param name="t" value="application/pdf"/>
      <param name="q" value="terry cutler"/>
      <param name="p" value="44"/>
      <param name="n" value="1"/>
      <param name="h" value="1"/>
    </urlParams>
    <responseTime>0.022</responseTime>
  </channel>
</rss>

I modified the TNH source code to allow me to watch how various variables change in the code as the value of p approaches and passes the threshold described above and the results are reproduced below:

p.start=0 maxHits=3 p.hitsPerSite=1 end=1 length=1 result.hits.length=3 p.hitsPerPage=1 
result.numRawHits=284 totalResults=284 

p.start=40 maxHits=43 p.hitsPerSite=1 end=41 length=41 result.hits.length=43 p.hitsPerPage=1 
result.numRawHits=284 totalResults=284 

p.start=42 maxHits=45 p.hitsPerSite=1 end=43 length=43 result.hits.length=44 p.hitsPerPage=1 
result.numRawHits=284 totalResults=284 

p.start=43 maxHits=46 p.hitsPerSite=1 end=44 length=44 result.hits.length=44 p.hitsPerPage=1 
result.numRawHits=284 totalResults=284 

p.start=44 maxHits=47 p.hitsPerSite=1 end=44 length=44 result.hits.length=44 p.hitsPerPage=1 
result.numRawHits=284 totalResults=44 

I have tried to understand the relationship between the variables and therefore spot what might be causing the bug and have identified the following line of code (OpenSearchServlet.java:163) as the culprit but I don't know exactly how to fix this problem:

long totalResults = 
    result.hits.length < (p.start+p.hitsPerPage) ? result.hits.length : result.numRawHits;

Any help would be much appreciated.

Thanks,

Mark Pearson Web Archiving Engineer National Library of Australia Canberra ACT 2600 AUSTRALIA

aaronbinns commented 11 years ago

First, thank you Mark for such a thorough and well-written bug report. It looks like you have a pretty firm handle on the observed behavior, but what may come as a surprise to you is that this is....wait for it...."not a bug, but a feature".

What you observe is due to the way we implement "site collapsing". Site collapsing is where we only show the top 1 search result from any one website (i.e. domain name). The idea is to give search results that are similar to what people see with Google/Yahoo!/Bing where there is diversity of sites in the results. For example, if you do a Google search on "Facebook", you don't see the top 10 results all from facebook.com. No, the first (best) result is facebook.com but the rest are from other sites talking about Facebook. We approximate this behavior by limiting the number of results shown from any one website. By default, we only show the top 1 result from any site, but that can be changed via the 'hitsPerSite' parameter.

If you disable the "site collapsing" feature, by setting hitsPerSite to 0, you will see two things:

  1. Most likely you will start seeing a bunch of results from the same website (i.e. domain name)
  2. All the results can be paged through to the number of "raw" hits -- in your case 284.

The way I implemented site collapsing in TNH is to do it "as necessary". The total number of "raw" hits is exactly that: the number of results that matched the query, without doing any site collapsing. That "raw" number is the maximum number of search results we could show. The number we actually show is going to be less than that, depending on how many different sites there are in the search results.

For example, imagine you have a search index of 1,000,000 web pages, and suppose that those pages came from 10,000 different websites. If we perform a search for "foo" and get 1000 "raw" hits, we don't know how many different websites those results come from until we iterate through them all and group them by website. One way to do that might be to setup a Java Map where the website name (i.e. domain name) is the key and the value is an ordered list of all the results from that site from highest/best to lowest/worst. Once you have built that rather large data structure, you would then know how many results there are after collapsing (or "grouping") the results by domain. The main problem with this approach is that it is costly -- in terms of time. Remember, this process of grouping results together would happen for each and every query, so every second spent building this giant Map and grouping the results is a second that the user is sitting there waiting for the results to appear.

So, rather than implement this total "grouping" of results by site, TNH implements a "only collapse what you need" approach. This means that in TNH, as we iterate through the results, we retain only the top N results /and/ we collapse them by site. In the end, we only keep the top N collapsed results. This approach is much faster than total grouping, but the side-effect is that we don't know how many groups there are (or would be if we grouped all the results by site). The only time we know the total number of groups is if the number of results to show exceeds the number of groups. For example, if we request the top 10 results for query "foo", and that query only yielded 8 collapsed results, then we know that total number of collapsed results is 8. No more, no less.

That is why the totalResults suddenly changes from 284 to 44. Until we actually perform the site collapsing on all the results, we don't know how many there will be. So, we report the total number of "raw" results until we "hit the end". And when we "hit the end" we know there are no more results and therefore the number of collapsed results we have so far is all of them.

The end-user experience can be a little confusing, but it's not unlike what people see if they actually try to page through the results in Google, etc. Run a Google search on "library" and Google tells me that there are "about 1,860,000,000" results. Can I see the 1,859,999,999th result? No. After I page through 10 or so pages of results, Google suddenly tells me that I'm at the end and if I want more results/info I should change my query.

In Lucene 3.x (don't remember which minor release) a grouping module was added, which was then incorporated into Solr. AFAIK, this module performs the "full grouping" I described above. Last I heard, the run-time overhead/cost was something 2-3x running the plain-old ungrouped query. There are limitations in Solr's use of the grouping module in that you cannot group results across multiple Solr cores, nor in a distributed (multi-server) environment.

The other popular Lucene-based project, elasticsearch, does not use the Lucene grouping module....yet. There's an open, and heavily +1'd issue to add grouping to elasticsearch.

mpearson-nla commented 11 years ago

Thanks for the detailed explanation Aaron. I understand the rationale behind the development of the CollapsingCollector class and the other features you have added to TNH since and I'm pleased to say that your explanation above at least reassures me that I hadn't got completely the wrong end of the stick!

The thing I couldn't figure out with the changing value of totalResults is that I was trying to rely on it when generating the paging links at the bottom of my full text search results page, but obviously that is a no-no. As you rightly say this presents a slight confusion to end-users but I guess you just work around it.

In fact I just did an obscure search on Google and saw the total number of results drop from 22,600 to 261 as I clicked through the pages of results! I had an vague idea that the behaviour I'd observed in TNH might have something to do with not being able to predict how many actual results are available until they are exhausted.

Thanks for your quick response anyway.

aaronbinns commented 11 years ago

I'm glad that all made sense. And yeah, regarding generation of paging links...there's not much you can do other than generate all the links, then if the user clicks on one that "goes past the end", then only generate links up to the now-known end of the collapsed results.

A fairly crappy example of this can be seen here: http://webharvest.gov/collections/congress111th/search?q=tuvalu&i=congress111th&p=750

Click the next link and you'll notice that now that the user is at the end, the paging links stop at the current page...there's no "next".