daneran / duke

Automatically exported from code.google.com/p/duke
0 stars 0 forks source link

Can we add support for TFIDF matching? #27

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Term frequency matching is generally considered the best string matching 
approach, but requires a source of information about term frequencies. Can we 
come up with some way to add this?

Original issue reported on code.google.com by lar...@gmail.com on 25 Aug 2011 at 10:09

GoogleCodeExporter commented 8 years ago

Original comment by lar...@gmail.com on 4 Nov 2011 at 10:15

GoogleCodeExporter commented 8 years ago
Finally figured out where to find the document count for a term: 
http://blog.mikemccandless.com/2012/03/new-index-statistics-in-lucene-40.html

Original comment by lar...@gmail.com on 15 Mar 2012 at 9:31

GoogleCodeExporter commented 8 years ago
Can't find a paper that explains this properly, but here is source code: 
http://secondstring.cvs.sourceforge.net/viewvc/secondstring/secondstring/src/com
/wcohen/ss/TFIDF.java?revision=1.7&view=markup

Original comment by lar...@gmail.com on 18 Mar 2012 at 7:53