barzerman / barzer

barzer engine code
MIT License
2 stars 0 forks source link

Stop words fuck up ngrams #589

Closed 0xd34df00d closed 11 years ago

0xd34df00d commented 11 years ago

Example query: http://eu.barzer.net/query/json?key=aRLsIvszISAReCoS6ktgviZxN0YlRpbs6DKH7vro&zurch=yes&flag=d&query=%D0%BC%D0%BE%D0%B6%D0%BD%D0%BE%20%D0%BB%D0%B8%20%D0%BF%D0%BB%D0%B0%D1%82%D0%B8%D1%82%D1%8C%20%D0%B7%D0%B0%20%D0%BF%D0%BE%D0%BA%D1%83%D0%BF%D0%BA%D0%B8%20%D0%BA%D0%B0%D1%82%D1%80%D0%BE%D0%B9%20%D0%B2%D0%B0%D1%88%D0%B5%D0%B3%D0%BE%20%D0%B1%D0%B0%D0%BD%D0%BA%D0%B0?

The proper document is 1.5, which is quite low.

Stuff like ли and по gets matched in the entities in the corpus, but doesn't match in the query thus the corresponding ngrams don't get formed.

I've added space-separated stopwords loading and removing them during feature convert phase, thus entities won't be affected. @barzerman please take a look and give me a go-ahead :)

0xd34df00d commented 11 years ago

Will merge soon.