CloudCannon / pagefind

Static low-bandwidth search at scale
https://pagefind.app
MIT License
3.48k stars 113 forks source link

Words with accented letters #38

Closed brycewray closed 2 years ago

brycewray commented 2 years ago

It appears that words with accented letters aren’t indexed. I searched for “Régis” as both “Régis” and “Regis” — neither returned “Régis.” Same for “Bjørn” — neither “Bjørn” nor “Bjorn” would return “Bjørn.” Just in case it was simply ignoring the accented letters, I also tried “Rgis” and “Bjrn” (respectively), and neither worked.

bglw commented 2 years ago

Oh my, I just had a dig and found the issue.

Everything is fine on the indexing side, Régis is indexed correctly. To normalize content Pagefind is replacing special characters using a [^\\w] regex.

The issue is that on the indexing side we use Rust's regex engine, which is good and counts é as a "word character". In the browser though we normalize the search via the js regex engine, which is bad and does not count é as a word character and thus removes it before searching. So it always forcibly searches for Rgis.

I'll write up some tests and aim to get a release out later today that addresses this.

bglw commented 2 years ago

I'm going to be addressing this more concretely in the coming weeks while implementing fully-fledged multilingual support, so for now I've added in a quick patch that should cover most cases by only normalizing common keyboard punctuation on the browser end of things. That should go out as a 0.5.3 release in about 20 minutes

bglw commented 2 years ago

Released!