cristinae / WikiTailor

Your à-la-carte in-domain corpora extraction tool from Wikipedia
1 stars 0 forks source link

ESA vectors computation based on queries #17

Open albarron opened 7 years ago

albarron commented 7 years ago

In order to compute ESA representations, an index is queried to compute index scores. This causes problems because the boolean-generated query has a limitation of 1024 tokens, which we frequently reach with Wikipedia articles.

Beside that, the similarities are not well computed because of the boolean vs. weighted representation. We have to change to Apache's MoreLikeThis way of computing similarities between two documents