commonsearch / cosr-back

Backend of Common Search. Analyses webpages and sends them to the index.
https://about.commonsearch.org
Apache License 2.0
123 stars 24 forks source link

Improve partial indexing/matching of URLs #17

Open sylvinus opened 8 years ago

sylvinus commented 8 years ago

Not sure if this should be done before indexing or completely in Elasticsearch, but it would be helpful for cases like commonsearch/cosr-results#5 to improve the tokenization of URLs to allow better partial matches.

As a first step, could the presence of separated terms ("Le Monde") in the title have an influence on the tokenization of the URLs? ("lemonde.fr" => "le monde fr" in addition to "lemonde fr")

sylvinus commented 8 years ago

First step is done, now we split words in the domain name from a vocabulary extracted from the title and summary.

Next step would be to use a much larger vocabulary (include the body? use norvig.com/ngrams/count_1w.txt ?), and improve the splitting itself, probably with statistical methods?