elixir-haystack / haystack

Simple, extendable full-text search engine written in Elixir
MIT License
206 stars 4 forks source link

Custom stop words #5

Open jaimeiniesta opened 1 year ago

jaimeiniesta commented 1 year ago

Is it possible to customize the stop words used, so I can provide a different list other than the default one or disable stop words?

Context: I'm setting up Haystack for the search in https://rocketvalidator.com/html-validation - currently it just uses a simple search by substring but I want to use Haystack instead. So far it's going great!

During the integration, I found that the results were not as expected in many searches, and it looks like it was due because most of the titles include characters like double quotes:

https://rocketvalidator.com/html-validation/a-link-element-must-not-appear-as-a-descendant-of-a-body-element-unless-the-link-element-has-an-itemprop-attribute-or-has-a-rel-attribute-whose-value-contains-dns-prefetch-modulepreload-pingback-preconnect-prefetch-preload-prerender-or-stylesheet

So when I searched for something containing double quotes, these guides would appear first as they scored higher because they have many double quotes.

I guess this could be solved by adding the double quotes (and other characters like parenthesis, brackets, < and >, etc.) to the stop words. My workaround was to clean up the strings, both during the load and the search:

defp cleanup(str) do
  str
  |> String.replace(["“", "”", "<", ">", "(", ")", ".", ",", ";", ":"], "")
  |> String.trim()
end

After that, I found that a search for must not appear like this https://rocketvalidator.com/html-validation?search=must+not+appear provided no results using Haystack, and that's because these are all stop words.

Finally, nor non-English content it would be great to be able to customize the stop words.

philipbrown commented 1 year ago

Hey @jaimeiniesta,

Yeah, you can pass a custom list of transformer modules when adding a field: https://github.com/elixir-haystack/haystack/blob/55e8b1f7ae83b67998a6437ea4a25314849c9cf2/lib/haystack/index/field.ex#L46

So you could either pass your own implementation of stop words, or remove it completely. And you can do that on a per-field basis.

Again this needs to be added to the documentation 😅

jaimeiniesta commented 1 year ago

Ah, that's cool then. I'll wait for that documentation. Thanks! 😎