MihaiValentin / lunr-languages

A collection of languages stemmers and stopwords for Lunr Javascript library
Other
427 stars 163 forks source link

lunr-languages/lunr.fr.js fails to find common words like "équipement" #71

Open DavidBruant opened 3 years ago

DavidBruant commented 3 years ago

Test case:

import lunr from "lunr"
import stemmerSupport from 'lunr-languages/lunr.stemmer.support.js'
import lunrfr from 'lunr-languages/lunr.fr.js'

stemmerSupport(lunr)
lunrfr(lunr)

const docs = [
    {
        text : "équipement, barrage",
        id: '1'
    },
    {
        text : "rivière",
        id: '2'
    }
]

const index = lunr(function () {
    this.field('text')
    this.ref('id')

    for(const doc of docs){
        this.add(doc)
    }
})

console.log('résultats pour "équipement"', index.search('équipement'))
console.log('résultats pour "barrage"', index.search('barrage'))
console.log('résultats pour "rivière"', index.search('rivière'))

All 3 console.log should return a result, but the first one does not

DavidBruant commented 3 years ago

I haven't taken the time to be sure, but i believe this is related to https://github.com/MihaiValentin/lunr-languages/issues/68

DavidBruant commented 3 years ago

The workaround i have found is to remove all accents to the texts i index and from the string i search using this function

function removeAccents(str){
    return str.normalize("NFD").replace(/[\u0300-\u036f]/g, "");
}

It's inconvenient, but it works until lunr-languages/lunr.fr.js is fixed

dhdaines commented 1 week ago

The language support plugins in general don't do folding, which might be by design. You can do it separately with https://www.npmjs.com/package/lunr-folding (quick but possibly buggy) or by adding your own pipeline function using unidecode (more complete): https://github.com/dhdaines/lunr.py/blob/fix_skip_docs/docs/languages.md#folding-to-ascii