btrask / stronglink

A searchable, syncable, content-addressable notetaking system
Other
1.04k stars 39 forks source link

Stemming plugins #28

Open btrask opened 9 years ago

btrask commented 9 years ago

Our search indexer currently uses the Porter stemming algorithm from SQLite FTS3. We've already tweaked it to ignore underscores, but it still has several other limitations, mainly regarding languages aside from English and certain search terms (such as proper names that end in "s", or certain words).

The ideal solution would be automatically detecting the language of each word and stemming according to that language's grammar rules, but I don't know of such an algorithm that is publicly (and freely) available.

I think the practical approach is to let the user choose a custom stemmer for each repository. By default we could try to include the best stemmer for each natural language.

That still isn't ideal for bilingual users, of course.

I think SQLite already has some other stemmers available so if we stick to that interface we can support them quite easily.

btrask commented 9 years ago

See also https://sqlite.org/fts3.html#tokenizer