annuaire-entreprises-data-gouv-fr / search-api

MIT License
16 stars 2 forks source link

[SEARCH]fix: remove stop words from dirigeant before match query #423

Closed HAEKADI closed 3 days ago

HAEKADI commented 3 days ago

closes https://github.com/annuaire-entreprises-data-gouv-fr/search-api/issues/374

The filters for nom_personne and prenoms_personne currently use two types of queries. The first is a match query, which ensures that each word in the query is present in the results. The second is a match_phrase query, which boosts the score for exact matches, ensuring those results are prioritized.

However, when handling names like "BERNARD DE SAINT AFFRIQUE", which contain French stop words (e.g., "DE"), the match query fails. This happens because the custom analyzer used during indexing removes stop words like "DE", meaning no match is found for them. While the match_phrase still correctly identifies exact matches, the must clause of the match query returns no results due to the missing stop word.

To fix this issue, this PR introduces a change that removes stop words (like "DE") from nom_personne and prenoms_personne before adding them to the match query. However, the complete list of words, including stop words, is still passed to the match_phrase query to ensure exact matches are boosted as before.

Additionally, this change removes the need to apply the entire analyzer to the query, which unnecessarily complicates the search and sometimes returns incorrect results. By removing stop words manually, we ensure the search is simpler and more accurate.

MKCG commented 3 days ago

When querying Elasticsearch, you can specify the analyzer that should be used to tokenize and filter your query : https://www.elastic.co/guide/en/elasticsearch/reference/7.17/query-dsl-match-query.html

analyzer
    (Optional, string) [Analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/analysis.html) used to convert the text in the query value into tokens. Defaults to the [index-time analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/7.17/specify-analyzer.html#specify-index-time-analyzer) mapped for the <field>. If no analyzer is mapped, the index’s default analyzer is used.

Meaning that you can have choose to use different analyzers to query the same indexed content, and you can also specify your own list of stop words at the index level instead of relying on the default stop words lists :

stopwords
    (Optional, string or array of strings) Language value, such as _arabic_ or _thai_.  [...] Also accepts an array of stop words.

https://www.elastic.co/guide/en/elasticsearch/reference/7.17/analysis-stop-tokenfilter.html#analysis-stop-tokenfilter-configure-parms

HAEKADI commented 3 days ago

@MKCG I tried using the same custom analyser for the match query as the one used during the indexing process, but it didn't resolve the issue. It seems to me as though, the complexity of the analyser may not work well when only a stop word is used as input, such as "DE".

HAEKADI commented 3 days ago

I also tried using the same stop words from the custom analyzer via the _analyze endpoint to analyze the input string. However, it feels overly complicated for this particular use case.

Do you think relying on the stop word list is too simplistic and using the analyser is a better way?

XavierJp commented 3 days ago

I also tried using the same stop words from the custom analyzer via the _analyze endpoint to analyze the input string. However, it feels overly complicated for this particular use case.

Do you think relying on the stop word list is too simplistic and using the analyser is a better way?

I think this totaly makes sense and it is a good reason to code it the way you did. I approve this PR