Improve handling of punctuation in segmented languages

CloudCannon / pagefind

Static low-bandwidth search at scale

https://pagefind.app

MIT License

3.26k stars 97 forks source link

Improve handling of punctuation in segmented languages #352

Open bglw opened 1 year ago

bglw commented 1 year ago

In whitespace-delimited languages, when Pagefind encounters a-b it will be indexed as ab. In languages that go through segmentation, this might have been first segmented to 'a', '-', 'b' which indexes as the words a and b with a word in between. This makes it hard to search for, as on the client-side a-b will continue to search for ab, and an exact search for "a b" or "a - b" won't match due to the ignored word indexed between a and b.

We don't have the segmentation available on the client due to bandwidth constraints, so a different solution will need to be found. One easy(ish) option would be for the client to search for a-b as ab and a b.

bglw commented 1 year ago

Another option would be to swap out for different punctuation logic in segmented languages, so that in these cases a-b is always 'a', '-', 'b'.

hjonin commented 10 months ago

Hi, I think the same problems occur with the symbol ' that can be placed in front of a word in French (it's the contraction of the article le, which becomes l'). For example, a search for alphabet should also return results in which l'alphabet can be found. Thanks in advance and congratulations on a very good job!

bglw commented 10 months ago

Hey @hjonin 👋

The French case is slightly different (in a good way) — this current issue only applies to non-whitespace-delimited languages like Chinese.

For indexing l'alphabet, that will be resolved by #225 which will be released on a stable version next week 😄

The current v1.0.0-beta.2 release includes the behavior, though it currently doesn't split on the ' symbol. But you bring up a great point re:French, so I'll expand the logic to cover it (and probably just all punctuation for now, and we can configure it down later if need be).

Here's an example of the new word indexing finding the attribute in html_attribute:

hjonin commented 10 months ago

Hi @bglw

Thank you very much for your answer! Can't wait to use the new version!