Closed HAEKADI closed 3 days ago
When querying Elasticsearch, you can specify the analyzer that should be used to tokenize and filter your query : https://www.elastic.co/guide/en/elasticsearch/reference/7.17/query-dsl-match-query.html
analyzer
(Optional, string) [Analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/analysis.html) used to convert the text in the query value into tokens. Defaults to the [index-time analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/specify-analyzer.html#specify-index-time-analyzer) mapped for the <field>. If no analyzer is mapped, the index’s default analyzer is used.
Meaning that you can have choose to use different analyzers to query the same indexed content, and you can also specify your own list of stop words at the index level instead of relying on the default stop words lists :
stopwords
(Optional, string or array of strings) Language value, such as _arabic_ or _thai_. [...] Also accepts an array of stop words.
@MKCG I tried using the same custom analyser for the match
query as the one used during the indexing process, but it didn't resolve the issue. It seems to me as though, the complexity of the analyser may not work well when only a stop word is used as input, such as "DE".
I also tried using the same stop words from the custom analyzer via the _analyze
endpoint to analyze the input string. However, it feels overly complicated for this particular use case.
Do you think relying on the stop word list is too simplistic and using the analyser is a better way?
I also tried using the same stop words from the custom analyzer via the
_analyze
endpoint to analyze the input string. However, it feels overly complicated for this particular use case.Do you think relying on the stop word list is too simplistic and using the analyser is a better way?
I think this totaly makes sense and it is a good reason to code it the way you did. I approve this PR
closes https://github.com/annuaire-entreprises-data-gouv-fr/search-api/issues/374
The filters for
nom_personne
andprenoms_personne
currently use two types of queries. The first is amatch
query, which ensures that each word in the query is present in the results. The second is amatch_phrase
query, which boosts the score for exact matches, ensuring those results are prioritized.However, when handling names like "BERNARD DE SAINT AFFRIQUE", which contain French stop words (e.g., "DE"), the
match
query fails. This happens because the custom analyzer used during indexing removes stop words like "DE", meaning no match is found for them. While thematch_phrase
still correctly identifies exact matches, themust
clause of thematch
query returns no results due to the missing stop word.To fix this issue, this PR introduces a change that removes stop words (like "DE") from
nom_personne
andprenoms_personne
before adding them to thematch
query. However, the complete list of words, including stop words, is still passed to thematch_phrase
query to ensure exact matches are boosted as before.Additionally, this change removes the need to apply the entire analyzer to the query, which unnecessarily complicates the search and sometimes returns incorrect results. By removing stop words manually, we ensure the search is simpler and more accurate.