MihaiValentin / lunr-languages

A collection of languages stemmers and stopwords for Lunr Javascript library
Other
431 stars 163 forks source link

trimmer #12

Closed arve0 closed 4 years ago

arve0 commented 9 years ago

When doing

idx.use(lunr.de)

lunr.trimmer is removed from the pipeline, making words including punctation and the like to enter the index. E.g., both "word." and "word" will enter the index.

Adding lunr.trimmer to the pipeline manually is not really a good solution, as lunr.trimmer uses \W to match non word characters (regexp unicode only supported as of ES6).

A solution could be to normalize characters like æøå -> aoa, like done here: https://github.com/cvan/lunr-unicode-normalizer/blob/master/lunr.unicodeNormalizer.js

Thoughts?

matiasgarciaisaia commented 7 years ago

I've made a sample page for this. Using Spanish, if you search for jubilación or jubilacion (wrongly-written version of the first one), lunr is giving different results - something that shouldn't really happen, lunr being a full-text search engine.

We've discussed a little bit about this in manastech/middleman-search#23 (that's were the example comes from), and I think this should be solved by lunr-languages rather than the user having to load lunr.unicodeNormalizer by itself.

If lunr-languages loads lunr.unicodeNormalizer or if it does a different thing, I'm not sure. But if I'm enabling spanish full-text search, I definitely want accented words to yield the exact same results than a non-accented version of the word.

I can totally try to fix lunr-languages if you give me some pointers about how to do it. It's just that I'm not sure where/how should I do it.

I'm pretty much sure @eemi wants to know about this issue.

drzraf commented 7 years ago

about handling accent, see https://github.com/fortnightlabs/snowball-js/issues/2

drzraf commented 7 years ago

and back to snowballstem/snowball#55

saawsan commented 5 years ago

Hi, any news about that issue?

I'm currently working on an offline & multi-language search client with pouchdb-quick-search and I face the same limitations.

But if I'm enabling spanish full-text search, I definitely want accented words to yield the exact same results than a non-accented version of the word.

I completely agree with @matiasgarciaisaia. Ignoring all diacritical mark (à, ñ, ç, é, ...) will highly improve the relevancy of the results.

Right now, the only workaround I can think of would be to strip all the diacritical mark before indexing the data.