Option to disable reduce/gather phase on client nodes

chendo commented 11 years ago

We would like an option to have client nodes only do the smart load balancing, not reduce/gathering.

We run client nodes on each app server so app servers only talk to ES via the client node on localhost, so we avoid having to use a HTTP load balancer (since it's a single point of failure), and also avoid the extra hop.

An option to be able to turn off the reduce/gather phase for a client node would let us decrease memory usage and CPU burn.

Thanks!

jprante commented 11 years ago

You can setup nodes with HTTP for the networking load blancing from outside, while the other nodes don't need to have HTTP.

Other nodes without HTTP and no data can act as pass-through nodes. If you connect a transport client to such a node, you force Elasticsearch to route all the requests and response for the transport client via the pass-through node. The pass-through node will have to reduce/gather the results for the transport client. With a number of parallel http-less and data-less nodes, you are close to what you want.

From what I understand, the reduce/gathering should be performed as close as possible to the requesting client, otherwise responses can't be delivered back straight to the requester. Otherwise, it is open to me what the advantage is when nodes must be elected to perform reduce/gathering for other nodes just to pass the result back to them.

As a long term perspective, it could be viable to introduce a chunked stream response protocol between the nodes and the requesting client. This would ease some of the memory peaks of gathered large responses.

chendo commented 11 years ago

I'm not sure if I understand how this will actually work or if it applies to our scenario. How would I force the HTTP enabled nodes to route through the pass-through/client nodes?

Our goal is to avoid SPOF with regards to ES as well as try to keep non-app server specific load off our app servers, and we don't want to either use a HTTP load balancer because it's not as smart as an cluster-aware client. We have seen memory pressure from our client nodes previously and didn't understand why until I found out they were also doing the reduction step.

I'm open to suggestions that satisfy our goals, but I can see a need for cluster-aware routing with minimal CPU/memory usage.

jprante commented 11 years ago

I'm not completely sure if my suggestions fit to your environment, but I assume you are using Java app servers. With Java app servers, you have the TransportClient available. Other app servers would rely on the HTTP REST API.

The Java app server scenario is as follows. Configure some pass-thru nodes (without HTTP and without data, they also need a parameter they not become master). In each of your Java app instance, fire up a singleton TransportClient with a list of the pass-through node IPs. As a result, they only connect to those nodes and execute remotely on those nodes the Elasticsearch actions with gather/reduce load. The TransportClient manages the failover between the pass-through nodes. At least one of them must always be up, so monitor them.

The non-Java app server scenario is a little bit more challenging, since there is no automatic failover provided by Elasticsearch. A best practice is to start HTTP-enabled and data-less Elasticsearch nodes on each non-Java app server. The non-Java app connects to the Elasticsearch node at localhost, and delegates the gather/reduce and ES failover management to it.

I assume this is the case you do not prefer, the sharing of a Java Elasticsearch node and the non-Java app node on the same machine. In case of this, you would have to assign each non-Java app server node to another corresponding Elasticsearch HTTP data-less node somewhere on the local network it can connect to, and you should monitor all those pairs for connection failures. Another option is to prepare a round-robin fashion selection or a kind of a random selection of one of the HTTP-enabled data-less Elasticsearch nodes from your non-Java app instances for yourself, from a pre-configured IP list. I know, the Perl client for example comes with such an ability.

chendo commented 11 years ago

Our stack is Ruby on Rails, so we access ES over the HTTP REST interface. We're already running client nodes locally on each app server, and we're okay with that because having a client node on another host is not something we want to have to do because it's not flexible.

Round-robin is also something we wouldn't want as that involves static IP lists, and if a node goes down, the entire stack would eventually hit the missing node, and then we have to deal with HTTP timeout and mark it as down etc, so not as intelligent as a client node.

We're okay with running a local client node on the box assuming it did the least amount of work as possible, and this is why we would like the option to disable a client node from performing the reduce step.

clintongormley commented 10 years ago

The transport client doesn't do reduce, but does do sniffing and round robin, as does the ruby client. Closing

elastic / elasticsearch

Option to disable reduce/gather phase on client nodes #2448