dmfay / massive-js

A data mapper for Node.js and PostgreSQL.
2.49k stars 158 forks source link

Connection Pool? #195

Closed rdegges closed 8 years ago

rdegges commented 8 years ago

Heyo!

Sorry if this is a stupid question (I'm a new massive user), but is there a way I can specify connection pool settings when initializing my massive db?

My use case is that I'm going to be executing a lot of concurrent queries, and want to have a connection pool of about 10 connections for each Node process to keep things fast. Is this possible?

robconery commented 8 years ago

Wow this is an interesting question! First - the driver supports connection pools but I disconnected it because I couldn't get it to work reliably. If you felt like dabbling... I'd love a PR!

Second - concurrency in Node is a fascinating thing. I don't believe you can expect concurrency because you're using a pool - you're still relying on the event loop and queries will still happen serialy.

What I might recommend is checking out Kue from @tj - it uses Redis and can do parallel execution.

Anyway - sorry for the drably reply - I've just been through the gauntlet on this one :)

rcyrus commented 8 years ago

I had the same question and looking though the code it looks like massivejs is using the 'correct' way to connect to the PG client which will create a connection pool on its own

however @robconery you say you have disabled connection pooling? So are you saying for every postgres query massivejs is making a whole new connection? Or are you saying something else?

I'd like to have a persistent connection and DB connection pools are a pretty standard thing. Can you point me or any other reader to whatever discourse you are referring to when you say you've 'been though the gauntlet?' I am curious to what is going on here. This seems like a cool library but if it can't do basic pooling then I can't use it.

xivSolutions commented 8 years ago

@rcyrus - It is my understanding that each client in the pg pool represents an open connection, and the driver maintains a pool of clients.

Earlier on, we had an issue where we were passing a "truthy" value to the done() callback which was improperly disposing of clients. See this discussion, near the end (it's long...), and this commit.

So far as I understand things, Massive maintains a client pool (representing open connections) just fine...we might want to investigate if the configs for these can be passed as configuration items in Massive, but according to the first para on the PG driver README, they are configurable.

If you want to look into the configuration for these and shoot a PR, awesome. I can also try to look into it. In any case, they should be options, but the defaults should stay if no explicit config params are passed.

xivSolutions commented 8 years ago

@rcyrus - Looks like some decent info HERE on configuring the pool size and such. I think it's just a matter of mapping some configs onto the Massive config object, and doing so cleanly. @robconery likes sensible defaults (the 80/20 rule applies here fer sher), and becomes angry when the API gets dirty or complicated ;-)

robconery commented 8 years ago

Looks like I'm wrong - it's back in there: https://github.com/robconery/massive-js/blob/master/lib/runner.js#L40

Sorry about that. As Jon mentions - we had a problem (from yours truly) because I read an old README that ... well let's say it maxed the pool out in our testing and I thought it was broken :).

The "been through the gauntlet" thing was not about Massive - it's about a pooling discussion I'm having with Moebius (an Elixir port of Massive). Essentially it was about pooling and the need for it, given Erlang.

In addition - we had a rather lengthy discussion about "parallel" vs. "async" vs. "concurrent" when it comes to connections and pooling. A fascinating discussion - broke my brain. Pooling and concurrency have nothing to do with each other - it just makes it a bit faster. In fact (as was drilled into my head) - having a pool while trying to do concurrent queries involving transactions spells massive disaster in systems that aren't single-threaded like Node.

Anyway - hopefully that clears up my thoughts on this :) and I hope I didn't scare off @rdegges :).

rcyrus commented 8 years ago

Cool, thanks for the information!

rdegges commented 8 years ago

I'm back from traveling! Yey!

So -- just to make sure I'm understanding -- Massive does not (yet?) support pooling connections? Or am I missing something?

The way it should work, I believe, is like so:

This is how I'd sort of expect it to work as a developer using this thing.

The idea would then be that I know how to linearly scale my application logic in a fast way:

Does the above stuff make sense? And am I understanding that this is NOT what massive currently does? Just a little confused by the conversation above =)

robconery commented 8 years ago

No, not correct :). We use connection pooling as provided by node-pg. Let's move on to the other things now...

While Node is waiting for a response, it will simply continue executing other instructions -- these other instructions may be other queries that are finished and can now be used via callbacks, or they may be simple instructions: add / subtract / create a variable / whatever.

Node is single-threaded, and IO happens using the Event Loop - executing one thing at a time. There is no concurrency here, no parallel execution. For that you need some kind of messaging that operates outside of Node's single thread.

So, in short - no. You will not ever have parallel execution. You can't unless you have multiple Node processes.

If you've already got 20 queries in flight, waiting for Postgres to finish generating a response, and attempt to do ANOTHER query, then Massive will open a single additional NEW connection for that query alone, execute it, then close that additional connection once that query has been completed, while NOT affecting the other 20 connections in the database pool.

No. For all the reasons stated above. You cannot execute parallel queries in Node unless you have more than one Node process.

This means I have a total of 10 * 4 * 10 = 400 open DB connections at any given moment. Assuming my application load is consistent and queries are identical, I'd have fairly consistent throughput without a lot of network overhead from opening / closing DB connections all the time.

Your math is correct, but I think there's a lot more to consider here. Before I get to that - Massive is a query tool; it uses node-pg (and it's connection pools) to execute queries.

Connection pools have nothing to do with concurrency. It's all about speed and, unfortunately, is basically (at the end of the day) caching. Caching is hard. I would never trust a Node module to effectively pool 400 connections to my database.

If this is truly what you need, have a look at pg_bouncer. It's much better at pooling connections for high horsepower stuff.

Better yet, have a look at TJ's Kue project. This will do what you want (spreading the job load concurrently) and yes, it can use Massive in exactly the way you want without all the extra Node clusters. This is by far the better choice.

rdegges commented 8 years ago

Hmm. I don't think this makes sense -- I'm not an expert w/ Node by any means, but you can most definitely have asynchronous IO in the event loop, correct? This means that you can fire off multiple queries one-after-another, and have them all get processed by Postgres while your Node process is doing other things, until the response is retrieved and then the event loop hops back to synchronously execute your callback code, right?

So, let's say I write some code like this:

var pg = require('pg');
var conString = "postgres://username:password@localhost/database";

//this initializes a connection pool
//it will keep idle connections open for a (configurable) 30 seconds
//and set a limit of 20 (also configurable)
pg.connect(conString, function(err, client, done) {
  if(err) {
    return console.error('error fetching client from pool', err);
  }
  // here we'll execute a fast query
  client.query('SELECT $1::int AS number', ['1'], function handler1(err, result) {
    //call `done()` to release the client back to the pool
    done();

    if(err) {
      return console.error('error running query', err);
    }
    console.log(result.rows[0].number);
    //output: 1
  });
  client.query('SELECT * from some_enormous_table_with_a_zillion_rows', ['1'], function handler2(err, result) {
    // do stuff with the data
    done();
  });
});

This is based on some code from the node-pg library docs.

My understanding is that:

Because of the above, it means that at some point in time, BOTH of those queries will be processed by Postgres at the same time -- even though they were executed one-at-a-time by my single threaded Node process.

Is that correct?

robconery commented 8 years ago

Yeah - no. Node will lob the calls onto the Event Loop - this is a CPU-bound process so Node is free to do that as fast as it likes and it won't wait for a response. It's the queuing that is asynchronous.

Now, imagine that your first query, for some reason, takes longer than your second. Would you expect that result to come back first? If so - how would you pick that result up?

This is called a race condition and Node cleanly (and happily) avoids all the problems of parallel/concurrent programming by being single threaded. This means everything happens in order - first in, first out.

It's confusing - Node is indeed asynchronous but that is not the same as parallel. For that to happen, by definition, you need more than one thread/process executing.

So no. You can't do parallel processing with Node.

robconery commented 8 years ago

Seriously - have a look at TJ's Kue project. It will do what you need with a lot less pain, I promise :). It even does retries and allows you to batch things...

rdegges commented 8 years ago

Will do! Thanks for the info. This is super helpful, heh.

I appreciate you taking the time to reply ^^

robconery commented 8 years ago

Hey no prob - this is something I just went through with Elixir (and what started the confusion above); it's really really confusing stuff (for me, anyway). Let me know if you find out something different of if I didn't grasp what you were trying to say :). This is a great topic.

tdzienniak commented 8 years ago

@robconery you are not exactly right or I misunderstood you. Node can't do things in parallel, but libuv, which is C++ lib to perform asynchronous I/O used by Node, can, for example using its thread pool. So, I think it is possible to send few queries one after another to Postgres, then db can execute them how it likes (maybe all at the same time), and then process results in Node. Off course, results will be processed one at a time. Interesting article 'bout thread pool: https://www.future-processing.pl/blog/on-problems-with-threads-in-node-js/

robconery commented 8 years ago

@tdzienniak What am I not right about exactly? Node is single-threaded. Yes, libuv has a thread pool - but that still won't make Node process things in parallel. Given this single thread, how would you tell the EventLoop "make it parallel"? Even if you did make it parallel (which I'll bet there's a way to do it with multiple Node processes) - what would that gain you?

I appreciate where you're going with your response - but the entire communication chain with Node/EventLoop is asynchronous, yes, but not parallel. In any processing loop - if one element is synchronous, it's all synchronous - it has to be.

The only way to effectively parallel process a set of queries is to have parallel processes. A message queue is very good for this, and works well.

timruffles commented 8 years ago

@robconery Like so often, I think there's a confusion about semantics here. Nobody is claiming Node can run bits of JS simultaneously (parallel), but it can certainly have multiple outstanding callbacks queued (concurrent), and they'll be dealt with as they come in.

While we can't do parallel processing of queries (e.g munging them with JS using > 1 CPU in one process), but it's certainly possible and desired to make concurrent queries (e.g multiple outstanding queries being sent to be handled by the DB).

var db = require("./knex");

db.raw("select 1 as qid, NOW() as started_at, pg_sleep(5)")
.then(format)

db.raw("select 2 as qid, NOW() as started_at, pg_sleep(2)")
.then(format)

db.raw("select 3 as qid, NOW() as started_at, pg_sleep(3)")
.then(format)

function format(resp) {
  const r = resp.rows[0];
  console.log(`query ${r.qid} started at ${r.started_at}, returned at ${new Date}`);
}
query 2 started at Thu Mar 03 2016 14:00:18 GMT+0000 (GMT), returned at Thu Mar 03 2016 14:00:20 GMT+0000 (GMT)
query 3 started at Thu Mar 03 2016 14:00:18 GMT+0000 (GMT), returned at Thu Mar 03 2016 14:00:21 GMT+0000 (GMT)
query 1 started at Thu Mar 03 2016 14:00:18 GMT+0000 (GMT), returned at Thu Mar 03 2016 14:00:23 GMT+0000 (GMT)
tdzienniak commented 8 years ago

@timruffles that's exactly what I was writing about. Semantics are the problem here indeed.

timruffles commented 8 years ago

@robconery what issues did you hit with connection pooling? It's been perfectly reliable for me across multiple deploys, and is always important for perf

robconery commented 8 years ago

@timruffles In the beginning we had implemented pooling the way the docs said to do it (by calling done(true) but that, for some reason, ended up with us soaking up the connection pool - which seemed odd. We dug in a bit and couldn't figure out the problem, so I figured it was a problem with the way pooling was handled.

One of our contributors figured out that the passing in of the boolean was causing the issue - which seemed a weird solution but, indeed, that was it. So we put it back in.

robconery commented 8 years ago

Oh - @timruffles that's an excellent point (and example) of queries overlapping in Postgres. I do agree semantics are an issue :). Your example illustrates nicely how operations can overlap - yet each result is queued in the order received, and each result is returned when completed in a serial way.

The net result here is, essentially, a set of concurrent queries. I will still maintain that Node doesn't do parallel execution - but I'll capitulate that your example is pretty damn close :).

tamlyn commented 7 years ago

For anyone stumbling across this issue, massive inherits pg's default connection pool size which is 10. If you need to change it, just pass poolSize as an option:

const massive = require('massive')
const db = massive({poolSize: 20})