Open XavierJp opened 1 year ago
The cardinality aggregation can be resource-intensive, especially when dealing with a large dataset. Elasticsearch might use a sampling method to estimate cardinality for performance reasons, which can lead to approximate results. To get a more accurate count, you can increase the precision_threshold
parameter for the cardinality aggregation (default is 3000), but this may also increase resource usage.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html#_counts_are_approximate
Tested with precision_threshold=10000
and the results are more accurate.
It is however not very clear how this affects performance. For now, it will not be implemented, pending more testing.
The cardinality aggregation can be resource-intensive, especially when dealing with a large dataset. Elasticsearch might use a sampling method to estimate cardinality for performance reasons, which can lead to approximate results. To get a more accurate count, you can increase the
precision_threshold
parameter for the cardinality aggregation (default is 3000), but this may also increase resource usage. https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html#_counts_are_approximateTested with
precision_threshold=10000
and the results are more accurate.It is however not very clear how this affects performance. For now, it will not be implemented, pending more testing.
Il faut ajouter un suivi des métriques suivantes pour savoir si une augmentation du threshold est viable ou non :
Ca va essentiellement dépendre du nombre de requêtes en parallèle, mais ça ne devrait pas être problématique. Ca signifie qu'il va y avoir 10KiB alloué pour les compteurs d'HyperLogLog++ au lieu de 3KiB
Petit glitch dans la pagination : https://recherche-entreprises.api.gouv.fr/search?q=renners&page=500 -> me dit qu'il y a 501 page, mais : https://recherche-entreprises.api.gouv.fr/search?q=renners&page=501 -> est vide
En revanche : https://recherche-entreprises.api.gouv.fr/search?q=ganymede&page=11 -> on bien total_pages = 11 et la 11e page non vide