Sefaria / Sefaria-Project

New Interfaces for Jewish Texts
https://www.sefaria.org
653 stars 265 forks source link

Word stem searching #262

Closed rg-net closed 6 years ago

rg-net commented 7 years ago

It seems currently that there is no way to do word stem searching. For example if I search for שמחה it will not pickup ושמחת or any other variant. This leads to pretty pathetic searches sometimes. Especially, when I'm seeking to understand a concept of a word within Torah.

Both Sphinx and Lucene (Solr, ElasticSearch, etc.) have very good stemming built in. As we continue to grow our online resources this is probably a very important thing to add in to our system so that searches can be developed that are more comprehensive in scope.

Would love to see more comprehensive searching be put into the system. - Thanks!

EliezerIsrael commented 7 years ago

Thanks for the reminder. This is certainly a painful limitation, currently, and we'd like to improve it.

We use ElasticSearch. My experience is that the built in stemming is not helpful - it's designed around English, while we're mainly concerned with Hebrew and Aramaic. There is a project that aims to support Hebrew morphology in Lucene - https://github.com/synhershko/HebMorph - but it's built for modern Hebrew. In my first tests, I found that it lead to dramatic over-matching in our corpus. Perhaps that was due to my own mistakes in configuration.

My sense is that the lowest hanging fruit is simply to use a regular expression to search our index for the requested word(s) with the addition of the common prefixes and suffixes. It wouldn't solve the full problem - internal letter additions and removals would not be caught - but it would give us something.

Certainly interested to hear if you have more insight into this issue.

rg-net commented 7 years ago

Morphology may be overkill if Regex works. Internal letters could be searched. In my estimation it is better to over-match than under-match. There is certainly a lot that can be done with Regex. Can I type in Regex today on the search line? Not really sure how to do RegEx with Hebrew and RTL LTR problems. Probably hard to read.

Will keep thinking on this.

blockspeiser commented 7 years ago

We've recently added some functionality to search that is making some improvements here with prefixes, and internals yuds / vavs as well. Searching שמחה is now catching וְשִׂמְחָה for example. If any has encountered some more specific examples of morphology they wished we would account for but don't currently, this is a good place to document them.

rg-net commented 6 years ago

This adds a lot of usefulness to the library you've built. Thanks so much.