annuaire-entreprises-data-gouv-fr / search-api

MIT License
17 stars 2 forks source link

[Recherche] Erreur dans la pagination #298

Open XavierJp opened 1 year ago

XavierJp commented 1 year ago

Petit glitch dans la pagination : https://recherche-entreprises.api.gouv.fr/search?q=renners&page=500 -> me dit qu'il y a 501 page, mais : https://recherche-entreprises.api.gouv.fr/search?q=renners&page=501 -> est vide

En revanche : https://recherche-entreprises.api.gouv.fr/search?q=ganymede&page=11 -> on bien total_pages = 11 et la 11e page non vide

HAEKADI commented 1 year ago

The cardinality aggregation can be resource-intensive, especially when dealing with a large dataset. Elasticsearch might use a sampling method to estimate cardinality for performance reasons, which can lead to approximate results. To get a more accurate count, you can increase the precision_threshold parameter for the cardinality aggregation (default is 3000), but this may also increase resource usage. https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html#_counts_are_approximate

Tested with precision_threshold=10000 and the results are more accurate.

It is however not very clear how this affects performance. For now, it will not be implemented, pending more testing.

MKCG commented 1 year ago

The cardinality aggregation can be resource-intensive, especially when dealing with a large dataset. Elasticsearch might use a sampling method to estimate cardinality for performance reasons, which can lead to approximate results. To get a more accurate count, you can increase the precision_threshold parameter for the cardinality aggregation (default is 3000), but this may also increase resource usage. https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html#_counts_are_approximate

Tested with precision_threshold=10000 and the results are more accurate.

It is however not very clear how this affects performance. For now, it will not be implemented, pending more testing.

Il faut ajouter un suivi des métriques suivantes pour savoir si une augmentation du threshold est viable ou non :

Ca va essentiellement dépendre du nombre de requêtes en parallèle, mais ça ne devrait pas être problématique. Ca signifie qu'il va y avoir 10KiB alloué pour les compteurs d'HyperLogLog++ au lieu de 3KiB