Cannot display all client runs pages if nodes > 10,000

trickyearlobe commented 5 years ago

Describe the bug

When an A2 server has more than 10,000 nodes reporting in, you can only see the first 10,000 nodes in the client runs tab.

Moving to a page beyond 10,000 nodes (eg. Page 798) appears to work until you examine the nodes which are displayed. It turns out it sticks on the last successful page retrieved.

In the JS console you can see Javascript errors like this:-

polyfills.67cc802c4e03653dab28.js:1 GET https://tcate.test/api/v0/cfgmgmt/nodes?pagination.page=798&pagination.size=100&sorting.field=name&sorting.order=ASC 500

and an A2 log entry like this:-

Jul 20 07:34:41 ip-10-1-1-179.us-west-2.compute.internal hab[22795]: automate-load-balancer.default(O): - [20/Jul/2019:07:34:41 +0000] "GET /api/v0/cfgmgmt/nodes?pagination.page=798&pagination.size=100&sorting.field=name&sorting.order=ASC HTTP/2.0" 500 "0.015" 250 "https://tcate.test/client-runs?page=798" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36" "10.1.1.179:2000" "500" "0.014" 121

The UI queries the /cfgmgmt/nodes endpoint which in turn queries elastic search. Elastic search in turn dies with {"error":"elastic: Error 500 (Internal Server Error): all shards failed [type=search_phase_execution_exception]","message":"elastic: Error 500 (Internal Server Error): all shards failed [type=search_phase_execution_exception]","code":13,"details":[]}

I believe the root cause is that ElasticSearch has a default of 10,000 for index.max_result_window as described here https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html

To Reproduce

Fill an A2 server with 80,000 nodes.

Then either browse to a page containing nodes beyond 10,000

OR run curl -kL -H "api-token: $TOKEN" "https://$FQDN/api/v0/cfgmgmt/nodes?pagination.size=1000&pagination.page=11" with a valid FQDN and TOKEN (must be an admin token to have right for that endpoint)

Expected behavior

I should be able to see all my nodes in the client runs tab.

Versions (please complete the following information):

OS: CentOS Linux release 7.6.1810
Browser: Chrome 75.0.3770.100
Automate Build Number: 20190711110747

Additional context

Add any other context about the problem here.

thomascate commented 5 years ago

Here's what we see in Elasticsearch when the query happens.

Jul 22 14:07:36 ip-10-1-1-164.us-west-2.compute.internal bash[5163]: automate-backend-elasticsearch.default(O): Caused by: org.elasticsearch.search.query.QueryPhaseExecutionException: Result window is too large, from + size must be less than or equal to: [10000] but was [80000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.

jonong1972 commented 5 years ago

Yes, this is known with elastic search only showing 10k of the results. For now, it would be great to cap the pagination or disable the pagination after 10k results.

I read through the link provided above hoping that setting the index.max_result_window to a larger number would do the trick, but looking around everything points to the scroll or search after. Which then the pagination would not work and it would be a Prev - Next button. and hitting Next a few hundred times is not desirable.

Im going to confer with the UI team on this.

lancewf commented 5 years ago

As Jon has said, this is a known problem with elasticsearch. A solution we were considering is to start pulling node information from Postgres. Postgres would not have the 10,000 document limit. Issue #494 is where we started investigating this solution.

trickyearlobe commented 5 years ago

If I get chance I may play around with the index.max_result_window but I suspect its likely set that way to avoid keeping a huge resultset in memory.

For now I'm going direct to ES using the scroll API for the info I need. It also keeps a result set around, but you have to specify exactly how long you need to keep it.

jonong1972 commented 5 years ago

Yeah. Been thinking over on this and I really dont know how to fix this from a UX stand point. UX stand point is Yes. Would love the pagination to go beyond 10k and if you were to randomly choose a page, the correct sequenced items should show. If we were to redesign around for a prev/next paradigm, then we would have to re-design all of Automate. To reflect across Client Runs, Compliance, etc.

Some philosophical questions have come up in how we use Automate, and is pagination the right thing? The questions, for instance, "Why would we want to go to page 9,878 to see a row item?" What is the use case for that? or is there another way to use this data? And these questions also may deal with a re-design of how we display the data.

So TLDR, I hate to say it, but going to put on back burner for now from UX stand point until either tech changes or we add new designs to help solve the bigger picture of why we need to navigate to page 9000+

susanev commented 4 years ago

this work has been deprioritized, so closing.

sean-horn commented 1 month ago

Environment: Automate-installed Chef Server with Opensearch

You have to do two things here to work around the issue without using the scroll or whatever subsequent API(Which Chef Server clients do not use at this time. knife search, tidy, count, status, and possibly others):

Set max_result_window to a size greater than the total possible hit count
AND, &track_total_hits=true has to be set on each request to Opensearch. Otherwise the total hits returned with each response only ever shows "10000" which clashes with the clients' view of things, as they were written to expect Elasticsearch <= 6.8.23 behavior

chef / automate