Closed trickyearlobe closed 4 years ago
Here's what we see in Elasticsearch when the query happens.
Jul 22 14:07:36 ip-10-1-1-164.us-west-2.compute.internal bash[5163]: automate-backend-elasticsearch.default(O): Caused by: org.elasticsearch.search.query.QueryPhaseExecutionException: Result window is too large, from + size must be less than or equal to: [10000] but was [80000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.
Yes, this is known with elastic search only showing 10k of the results. For now, it would be great to cap the pagination or disable the pagination after 10k results.
I read through the link provided above hoping that setting the index.max_result_window
to a larger number would do the trick, but looking around everything points to the scroll
or search after
. Which then the pagination would not work and it would be a Prev - Next button. and hitting Next a few hundred times is not desirable.
Im going to confer with the UI team on this.
As Jon has said, this is a known problem with elasticsearch. A solution we were considering is to start pulling node information from Postgres. Postgres would not have the 10,000 document limit. Issue #494 is where we started investigating this solution.
If I get chance I may play around with the index.max_result_window
but I suspect its likely set that way to avoid keeping a huge resultset in memory.
For now I'm going direct to ES using the scroll API for the info I need. It also keeps a result set around, but you have to specify exactly how long you need to keep it.
Yeah. Been thinking over on this and I really dont know how to fix this from a UX stand point. UX stand point is Yes. Would love the pagination to go beyond 10k and if you were to randomly choose a page, the correct sequenced items should show. If we were to redesign around for a prev/next paradigm, then we would have to re-design all of Automate. To reflect across Client Runs, Compliance, etc.
Some philosophical questions have come up in how we use Automate, and is pagination the right thing? The questions, for instance, "Why would we want to go to page 9,878 to see a row item?" What is the use case for that? or is there another way to use this data? And these questions also may deal with a re-design of how we display the data.
So TLDR, I hate to say it, but going to put on back burner for now from UX stand point until either tech changes or we add new designs to help solve the bigger picture of why we need to navigate to page 9000+
this work has been deprioritized, so closing.
Environment: Automate-installed Chef Server with Opensearch
You have to do two things here to work around the issue without using the scroll or whatever subsequent API(Which Chef Server clients do not use at this time. knife search, tidy, count, status, and possibly others):
Describe the bug
When an A2 server has more than 10,000 nodes reporting in, you can only see the first 10,000 nodes in the client runs tab.
Moving to a page beyond 10,000 nodes (eg. Page 798) appears to work until you examine the nodes which are displayed. It turns out it sticks on the last successful page retrieved.
In the JS console you can see Javascript errors like this:-
polyfills.67cc802c4e03653dab28.js:1 GET https://tcate.test/api/v0/cfgmgmt/nodes?pagination.page=798&pagination.size=100&sorting.field=name&sorting.order=ASC 500
and an A2 log entry like this:-
Jul 20 07:34:41 ip-10-1-1-179.us-west-2.compute.internal hab[22795]: automate-load-balancer.default(O): - [20/Jul/2019:07:34:41 +0000] "GET /api/v0/cfgmgmt/nodes?pagination.page=798&pagination.size=100&sorting.field=name&sorting.order=ASC HTTP/2.0" 500 "0.015" 250 "https://tcate.test/client-runs?page=798" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36" "10.1.1.179:2000" "500" "0.014" 121
The UI queries the
/cfgmgmt/nodes
endpoint which in turn queries elastic search. Elastic search in turn dies with{"error":"elastic: Error 500 (Internal Server Error): all shards failed [type=search_phase_execution_exception]","message":"elastic: Error 500 (Internal Server Error): all shards failed [type=search_phase_execution_exception]","code":13,"details":[]}
I believe the root cause is that ElasticSearch has a default of 10,000 for
index.max_result_window
as described here https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.htmlTo Reproduce
Fill an A2 server with 80,000 nodes.
Then either browse to a page containing nodes beyond 10,000
OR run
curl -kL -H "api-token: $TOKEN" "https://$FQDN/api/v0/cfgmgmt/nodes?pagination.size=1000&pagination.page=11"
with a valid FQDN and TOKEN (must be an admin token to have right for that endpoint)Expected behavior
I should be able to see all my nodes in the client runs tab.
Versions (please complete the following information):
Additional context
Add any other context about the problem here.