Extract the most relevant terms for every document (section content)

dbpedia / jsonpedia-extractor

Fine grained massive extraction of Wiipedia content GSoC 2014 Project

6 stars 4 forks source link

Extract the most relevant terms for every document (section content) #4

Closed michelemostarda closed 10 years ago

michelemostarda commented 10 years ago

An inverted index already stores informations about terms frequency and relevance (see tf/idf vectors). From the Lucene programmatic API it is really easy to retrieve such information, try to extract the same from the Elasticsearch API. Otherwise we have to create a parallel Lucene index used only for retrieve such information.

michelemostarda commented 10 years ago

Try to verify if ElasticSearch exposes something through JMX http://en.wikipedia.org/wiki/Java_Management_Extensions

gigaroby commented 10 years ago

http://stackoverflow.com/questions/9189179/extract-tf-idf-vectors-with-lucene