Closed arve0 closed 4 years ago
I've made a sample page for this. Using Spanish, if you search for jubilación
or jubilacion
(wrongly-written version of the first one), lunr is giving different results - something that shouldn't really happen, lunr being a full-text search engine.
We've discussed a little bit about this in manastech/middleman-search#23 (that's were the example comes from), and I think this should be solved by lunr-languages rather than the user having to load lunr.unicodeNormalizer by itself.
If lunr-languages loads lunr.unicodeNormalizer or if it does a different thing, I'm not sure. But if I'm enabling spanish full-text search, I definitely want accented words to yield the exact same results than a non-accented version of the word.
I can totally try to fix lunr-languages if you give me some pointers about how to do it. It's just that I'm not sure where/how should I do it.
I'm pretty much sure @eemi wants to know about this issue.
about handling accent, see https://github.com/fortnightlabs/snowball-js/issues/2
and back to snowballstem/snowball#55
Hi, any news about that issue?
I'm currently working on an offline & multi-language search client with pouchdb-quick-search and I face the same limitations.
But if I'm enabling spanish full-text search, I definitely want accented words to yield the exact same results than a non-accented version of the word.
I completely agree with @matiasgarciaisaia. Ignoring all diacritical mark (à, ñ, ç, é, ...) will highly improve the relevancy of the results.
Right now, the only workaround I can think of would be to strip all the diacritical mark before indexing the data.
When doing
lunr.trimmer
is removed from the pipeline, making words including punctation and the like to enter the index. E.g., both"word."
and"word"
will enter the index.Adding
lunr.trimmer
to the pipeline manually is not really a good solution, aslunr.trimmer
uses\W
to match non word characters (regexp unicode only supported as of ES6).A solution could be to normalize characters like
æøå
->aoa
, like done here: https://github.com/cvan/lunr-unicode-normalizer/blob/master/lunr.unicodeNormalizer.jsThoughts?