datasalt / splout-db

A web-latency SQL spout for Hadoop.
50 stars 14 forks source link

Connections cache between QNodes and DNodes #5

Closed ivanprado closed 11 years ago

ivanprado commented 11 years ago

Currently, each time a new request is preformed to a QNode, a new connection is open with the DNode. Although this is good, because allows for flexibility, have a drawback if many requests are being performed per second. At this speeds, many connections keeps at estate TIME_WAIT (http://www.serverframework.com/asynchronousevents/2011/01/time-wait-and-its-design-implications-for-protocols-and-scalable-servers.html) up to a point where a limit is reached, and errors are received.

Having a fixed pool of connections is not possible because that would create many connections between DNodes and QNodes: (pool_size * # DNodes) . Maybe that could be possible if NIO would be used... but that is not clear.

Something that would alleviate the problem would be to use a cache of connections, so that connections that were created can be reused some seconds after it was created. That is, connections are not closed immediately, they are closed a few seconds after the initiating request was performed. Some characteristics of this cache are:

pereferrera commented 11 years ago

I don't see why having a small cache would solve the issue at all. It would alleviate it, but under high load the issue would be exactly the same. Only 10 threads could reuse connections, the rest would be creating and destroying connections.

There is always the possibility to modify the operating system's TIME_WAIT time, but it is not quite recommended.

Probably the only real solution to this is for QNodes to have exactly ONE connection to each DNode. I think that can be accomplished by using a non-blocking Thrift client, and a non-blocking Thrift server. The connection would be held by a single thread that would poll from a blocking queue.

This article explains (using a very simple example) how this can be accomplished: http://joelpm.com/2009/04/03/thrift-bidirectional-async-rpc.html

However:

I'm still investigating on it.

pereferrera commented 11 years ago

I'm closing this issue by now after c501d36 .