MihaiValentin / lunr-languages

A collection of languages stemmers and stopwords for Lunr Javascript library
Other
430 stars 163 forks source link

Include digits and update unicode regex generation #115

Open dhdaines opened 1 month ago

dhdaines commented 1 month ago

The unicode-8.0.0 package has been deprecated for a while. The README also recommends to use regenerate to make regexes, which is much nicer than the way we were doing it before.

But also, a persistent annoyance with lunr-languages was that numbers were missing from wordCharacters in all the Latin and Cyrillic-based languages, while they are present in the default wordCharacters. (also, Indic-Arabic numerals are present for Arabic, Hindi, etc...). So this adds them back, thus fixing #66 and maybe some other bugs.

The problem of the trimmer not being run in the search pipeline persists but that's a lunr.js bug :) at least now things like "HAL9000" wil get indexed.

dhdaines commented 1 month ago

For some reason we weren't including combining diacritics, yet we were depending on them in the test, so those get added too.