apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.66k stars 1.03k forks source link

High Frequency Terms/Phrases at the Index level [LUCENE-474] #1552

Closed asfimport closed 11 years ago

asfimport commented 18 years ago

We should be able to find the all the high frequency terms/phrases ( where frequency is the search criteria / benchmark)


Migrated from LUCENE-474 by Suri Babu B, resolved Mar 03 2013 Attachments: colloc.zip, collocations.zip

asfimport commented 18 years ago

Pasha Bizhan (migrated from JIRA)

Look for the HighFreqTerms package in contib area: http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/contrib/miscellaneous/src/java/org/apache/lucene/misc/HighFreqTerms.java?rev=164963&view=log

asfimport commented 18 years ago

Suri Babu B (migrated from JIRA)

HighFreqTerms.java available in misc package is about terms that have high document frequency. Actually whats my requirement is

I have set of documents which are indexed I need to find out the high frequency terms as well phrases at the index level, not document level

I am able to find out the high frequency terms by iterating through the termDocs.

But how to find out the high frequency phrased that are occurring in the index

asfimport commented 18 years ago

Pasha Bizhan (migrated from JIRA)

I understand what is high freq terms. But what is high freq phrases? Could you please explain your index structure?

asfimport commented 18 years ago

Suri Babu B (migrated from JIRA)

High Frequency phrases are like high frequency terms but they will have multiple terms repeated in the index

Lets say the X document has the phrase "Session Bean" 12 times the Y document has the phrase "Session Bean" 2 times the Y document has the phrase Bean 3 times the Z document has the phrase "Bean" 5 times

so I should get a output like below

Phrase/Term Frequency


Session Bean 14 Bean 8

asfimport commented 18 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

Using JIRA for discussion? Why, when you can use java-user@lucene mailing list for that? You can figure out common/frequent phrases using the existing Lucene API by keeping track of terms and their positions. The naive way may be slow and memory intensive.

asfimport commented 18 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

Here's some code that I've used before to find phrases in an index - see CollocationFinder.java. If your index has termvector support enabled you can run it to mine the collocated terms. This is typically a long operation that you dont want to do too often. The CollocationIndexer can be used to store the mined collocations in an index.

Possible uses for collocations are:

Haven't done too much with this code but I've added it here because it sounds like it could be relevant

Cheers Mark

asfimport commented 18 years ago

Suri Babu B (migrated from JIRA)

Hi Mark,

I have tried executing your classes but I failed to see the output coz it gave me class cast exception at the following line

        //get TermPositions for matching doc
        TermPositionVector tpv = (TermPositionVector) reader.getTermFreqVector(docId, fieldName);

and while indexing , I have added the contents field like below

Field.Text("contents", fileInfo.getReader(),true); // isStoreTermVector to true

and also found some mismatches in the Field class that I have and Field class that you are referring in the CollocationIndexer class

I am using lucene 1.4.3 version and also observed 1.4.3 doesnot have implementation for TermPositionVector

Pls let me know if I am using old ver or i have to update some patches in my env

Thanks Suri

asfimport commented 18 years ago

Mark Harwood (@markharwood) (migrated from JIRA)

It looks like you will need a later version. Try check out the latest code from Subversion

Mark

asfimport commented 16 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Hi Mark,

I looked at this zip, and it seems useful, but are you intending to donate it? If so, can we get a patch?

asfimport commented 16 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

Mark, Can we:

asfimport commented 14 years ago

Ivan Provalov (migrated from JIRA)

I saw some activity on the term collocations in the lucene user forum recently and decided to make a few changes to the colloc.zip package which Mark worked on. I used it before and it worked well for my project.

I ended up doing some fixes and refactoring and adding couple of unit tests, as well as a new class which will search the collocated terms if provided with a term. This version works with Lucene 3.0.2. Also, I changed package names, added the license verbage, as well as added maven and ant for contrib packaging.

If Mark is OK with these changes, it could be published as a contrib.

asfimport commented 14 years ago

Ivan Provalov (migrated from JIRA)

Included the scoring in the CollocationsSearcher which now will return the LinkedHashMap of Collocated Terms and their scores relative to the query term. Did some minor refactoring and changed the test.

asfimport commented 11 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

It's been about 2-1/2 years since anyone touched this, and I suspect that much of the underlying terms data is now available so I'll close this. We can re-open if there's interest. SPRING_CLEANING_2013