PathwayCommons / hyper-recent

Hyper-recent article feed
MIT License
1 stars 0 forks source link

Terms with an apostrphe (') return a large number of results #16

Closed jvwong closed 1 year ago

jvwong commented 1 year ago

Noticing that searching using a term containing an apostrophe returns many results. The reason for this is unclear. This problem is also exacerbated by setting the minisearch prefix to true in that even more/all results are returned.

Reproduce this:

maxkfranz commented 1 year ago

What happens if you strip out the apostrophe before searching, e.g. Gehrig's => Gehrigs or Gehrig's => Gehrig ?

From Minisearch:

By default, documents are tokenized by splitting on Unicode space or punctuation characters. The tokenization logic can be easily changed by passing a custom tokenizer function as the tokenize option:

Technically, an apostrophe can be punctuation and the prefix option may match a lot of things to the letter 's' -- any word that starts with 's' (noisy).

Our use of prefix is a bit of a hack, since it seems intended for the autocomplete use case rather than search results. They give the example of as-you-type feedback for 'moto' matching 'motorcycle' with an autocomplete UI. For proper stemming, we'd probably need to use Minisearch's processing function with stemming logic (e.g. from Natural)

The use of prefix for our use case may be a bit better if we could specify a minimum word length for applying prefixing.

jvwong commented 1 year ago

Technically, an apostrophe can be punctuation and the prefix option may match a lot of things to the letter 's' -- any word that starts with 's' (noisy).

Yes it appears the tokenizer is splitting on apostrophe:

> let MiniSearch = require('minisearch'); > MiniSearch.getDefault('tokenize')("Gehrigh's") > [ 'Gehrigh', 's' ]