Utility to output total term frequency and df from a lucene index [LUCENE-2393]

asfimport commented 14 years ago

This is a pair of command line utilities that provide information on the total number of occurrences of a term in a Lucene index. The first takes a field name, term, and index directory and outputs the document frequency for the term and the total number of occurrences of the term in the index (i.e. the sum of the tf of the term for each document). The second reads the index to determine the top N most frequent terms (by document frequency) and then outputs a list of those terms along with the document frequency and the total number of occurrences of the term. Both utilities are useful for estimating the size of the term's entry in the *prx files and consequent Disk I/O demands.

Migrated from LUCENE-2393 by Tom Burton-West, resolved Jun 24 2010 Attachments: ASF.LICENSE.NOT.GRANTED--LUCENE-2393.patch (versions: 3), LUCENE-2393.patch (versions: 4), LUCENE-2393-3x.patch, LUCENE-2393-3xbranch.patch

asfimport commented 14 years ago

Tom Burton-West (migrated from JIRA)

Patch against recent trunk. Can someone please suggest an appropriate existing unit test to use as a model for creating a unit test for this? Would it be appropriate to include a small index file for testing or is it better to programatically create the index file?

asfimport commented 14 years ago

Tom Burton-West (migrated from JIRA)

For an example of how this utility can be used please see: http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-1

asfimport commented 14 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

I think creating a small index with a couple of docs would be the way to go.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Programmatically indexing those docs is fine – most tests make a MockRAMDir, index a few docs into it, and test against that.

This tool looks useful, thanks Tom!

Note that with flex scoring (#3467) we are planning on storing this statistic (sum of tf for the term across all docs) in the terms dict, for fields that enable statistics. So when that lands, this tool can pull from that, or regenerate it if the field didn't store stats.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Perhaps this should be combined with high freq terms tool ... could make a ton of this little guys, so prob best to consolidate them.

asfimport commented 14 years ago

Tom Burton-West (migrated from JIRA)

New patch includes a (pre-flex ) version of HighFreqTerms that finds the top N terms with the highest docFreq and looks up the total term frequency and outputs the list of terms sorted by highest term frequency (which approximates the largest entries in the *prx files). I'm not sure how to combine the GetTermInfo program, with either version of HighFreqTerms in a way that leads to sane command line arguments and argument processing. I suppose that HighFreqTerms could have a flag that turns on or off the inclusion of total term frequency.

asfimport commented 14 years ago

Tom Burton-West (migrated from JIRA)

Updated the HighFreqTermsWithTF to use flex API.

I don't understand the flex API well enough yet to determine if I should have used DocsEnum.read/DocsEnum.getBulkResult() to do a bulk read instead of DocsEnum.nextDoc() and DocsEnum.freq()..

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks good Tom – thanks for cutting over to flex. You could in fact use the bulk read API here; it'd be faster. But performance isn't a big deal here :)

Maybe you should require a field instead of defaulting to "ocr"?

Why does GetTermInfo.getTermInfo take a String[] fields (it's not used I think)?

Probably we should cutover to BytesRef here too, eg TermInfoWithTotalTF?

Maybe you could share the code between HighFreqTermsWithTF.getTermFreqOrdered & GetTermInfo.getTermInfo? (They both loop, summing up the .freq() of each doc to get the total term freq).

Small typo in javadoc thier -> their.

asfimport commented 14 years ago

Tom Burton-West (migrated from JIRA)

Revised patch updated everything to flex. Replaces all references to Term with BytesRef and field.
GetTermInfo now requires a field instead of default= ocr removed unused String[] fields argument GetTermInfo now uses shared code HighFreqTermsWithTF.getTotalTF(); to get total tf. Removed GetTermInfo dependency on TermInfoWithTotalTF[] and inlined it into HighFreqTermsWithTF.

Still don't understand the bulk read API, but given that I have indexes with *frq files of 60GB I'd like to use it. Is there some documentation, code, or a test case I might look at ?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Still don't understand the bulk read API, but given that I have indexes with *frq files of 60GB I'd like to use it. Is there some documentation, code, or a test case I might look at ?

I just committed some small improvements to the javdadocs for this – can you look & see if it's understandable now?

Also, have a look at oal.search.TermScorer – it consumes the bulk API.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks for the updated patch Tom... feedback:

Maybe do away with the "hack to allow tokens with whitespace"? One should use quotes with their shell for this? (And eg the hack doens't work with tokens that have 2 spaces).
Can you rename things like total_tf --> totalTF (consistent w/ Lucene's standard code style)
Maybe rename TermInfoWithTotalTF -> TermStats? (It also has .docFreq)
Maybe rename TermInfoWithTotalTF.termFreq -> .totalTermFreq?
Maybe rename .getTermFreqOrdered -> .sortByTotalTermFreq?
You don't really need a priority queue to for the getTermFreqOrdered case? Ie, instead, just fill in the .totalTermFreq and then do a normal sort (make a Comparator<TermStats> that sorts by the .totalTermFreq)

asfimport commented 14 years ago

Tom Burton-West (migrated from JIRA)

Added unit tests. Made changes outlined by Mike. Still working on BulkRead.

asfimport commented 14 years ago

Tom Burton-West (migrated from JIRA)

Patch that includes unit tests and changes outlined in Mike's comment

asfimport commented 14 years ago

Tom Burton-West (migrated from JIRA)

Updated to use BulkResult api.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks good Tom!

I cleaned things up a bit – eg, you don't need to use the class members when interacting w/ the bulk DocsEnum API.

I think it's ready to go in!

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think we should just replace the current HighFreqTerms with the HighFreqTermsWithTF?

asfimport commented 14 years ago

Tom Burton-West (migrated from JIRA)

Hi Mike,

Thanks for all your help.

If we replace the current HighFreqTerms with the HighFreqTermsWithTF should there be a command line switch so that you could ask for the default behavior of the current HighFreqTerms? Or perhaps the default should be the current behavior and the switch should turn on the additional step of gathering and reporting on the totalTF for the terms.

I haven't bench-marked it but I'm wondering if getting the totalTF could take a significant additional amount of time for large indexes. When I ask for the top 10,000 terms using HighFreqTermsWithTF for our 500,000 document indexes it takes about 40 minutes to an hour. I'm guessing that most of that time is taken in the first step of getting the top 10,000 terms by docFreq, but still it seems that reading the data and calculating the totalTF for 10,000 terms might be a significant enough fraction of the total time that the option to skip that step might be useful.

Tom

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Tom, I agree, we should make it optional to compute the totalTF, and probably default it to off? Can you tweak the latest patch to do this?

asfimport commented 14 years ago

Tom Burton-West (migrated from JIRA)

I tweaked the latest patch to mimic the current HighFreqTerms unless you give it a -t argument. However, while testing the argument parsing I found a bug I suspect I inserted into the patch a few versions ago. Am in the process of writing a unit test to exercise the bug and then will fix bug and post both tests and code.

asfimport commented 14 years ago

Tom Burton-West (migrated from JIRA)

Rewrote argument processing so the default behavior is that of HighFreqTerms. The field and number of terms are now both optional with the default being all fields and 100 terms (same default as currrent HighFreqTerms). If a -t flag is used the totalTermFreq stats will be read,calculated, and displayed.

The bug surfaced when not specifying a field. Added test data with multiple fields and tests to check that correct results are returned with and without a field being specified. Fixed bug and new tests pass.

With the increasing number of options, I started thinking about more robust command line argument processing. I'm used to languages where there is a commonly used Getopt(s) library. There appear to be several for Java with different features, different levels of active development and different licenses. Is it worth the overhead of using one, and if so which one would be the best to use?

Tom

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks good Tom! I'll re-merge my small changes from the prior patch, add a CHANGES, and commit.

I don't think we need to upgrade to CL processing lib...

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Tom!

asfimport commented 14 years ago

Tom Burton-West (migrated from JIRA)

Since many people will want to use branch 3.x instead of trunk, I back-ported the flex version to 3x ( patched against http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene : 955141) Mike, can this be committed to branch_3x?

Tom

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Tom!

Reopening for backport to 3x....

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch, just cleans up a few minor things...

asfimport commented 13 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Bulk close for 3.1

apache / lucene

Utility to output total term frequency and df from a lucene index [LUCENE-2393] #3468