Support exact match and phrase match in API

eyaler commented 2 years ago

how does full text search work? seems that this is some kind of fuzzy search. would like to see the documentation for this. if this is fuzzy can we get an option for exact matching?

curl -X POST "https://benyehuda.org/api/v1/search" -H "accept: application/json" -H "Content-Type: application/json" -d "{ \"key\": \"MYKEY\", \"page\": 1, \"fulltext\": \"פרויקט בן־יהודה באינטרנט\"}" total_count = 8456

curl -X POST "https://benyehuda.org/api/v1/search" -H "accept: application/json" -H "Content-Type: application/json" -d "{ \"key\": \"MYKEY\", \"page\": 1, \"fulltext\": \"פרוייקט בן־יהודה באינטרנט\"}" total_count = 8459

curl -X POST "https://benyehuda.org/api/v1/search" -H "accept: application/json" -H "Content-Type: application/json" -d "{ \"key\": \"MYKEY\", \"page\": 1, \"fulltext\": \"פרוייייייייייקט בן־יהודה באינטרנט\"}" total_count = 8454

damisul commented 2 years ago

Well, at first I want to make it clear I don't know hebrew, so it is a bit hard for me to understand exact difference between those three queries. But as far as I see with a help of Google translate it use slightly different forms of website's name.

We use ElasticSearch 6.8 under the hood to implement fulltext search.

Currently, we don't support any additional parameters and API uses simple match query against text:

"query": { "match": { "fulltext": <FULLTEXT PARAM VALUE> } }

It is documented here: https://www.elastic.co/guide/en/elasticsearch/reference/6.8/query-dsl-match-query.html

I believe we can consider adding some additional options to it. Like option to do phrase match (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html) instead of simple match.

Current implementation is useful when you look for texts by specific words (or topic). Phrase search can be useful if you know exact phrase and want to find document this phrase was taken from.

But we need to keep balance between API flexibility and ease of use.

@abartov , what do you think?

eyaler commented 2 years ago

thanks. the first query is the part of the common footer appearing in most files. the last query is the same but with one of the letters repeated 10 times. this does not exist in the corpus and should return zero for exact search. therefore i deduced this might be fuzzy search. now, if we do make this strict, we should do some normalization on the searched text, most significantly we would want to get results with diacritics (nikud and teamim) also for queries not specifying them, and normalize hebrew punctuation marks so they will be retrieved also for queries using the ascii equivalents. I was a bit involved in such efforts in the context of firefox find-in-page and python unidecode liberary. so would be happy to help if you choose to pursue this.

damisul commented 2 years ago

Well, ElasticSearch does some normalization/analysis of texts during queries, but perhaps we can do some additional hebrew-specific tweaks. I see that there were some experiments with non-standard analyzers in code, but currently they are not used (at least in API).

abartov commented 2 years ago

Yes, this is still somewhat complex and is not prioritized for now. We do intend to add an analyzer and to support phrase searches in the API, but it won't be done in the coming few months.

abartov / bybeconv

Support exact match and phrase match in API #98