brianc / node-postgres

PostgreSQL client for node.js.
https://node-postgres.com
MIT License
12.31k stars 1.23k forks source link

pg queries not running in parallel in redshift #1621

Open achshar opened 6 years ago

achshar commented 6 years ago

Hello, I am using node-pgsql to connect to my redshift database. (node 9.11 and 8.9) When I run my basic query sequentially, it only takes ~3-4 seconds per query. But when I run them in parallel, it jumpts to 12 seconds for each.

Somewhere, somehow, the parallel queries are still being run in sequence. I tried using redshift queues (they are 20 concurrency in 3 queues atm, so more than sufficient). Maybe someone here can help?

Here is the code

const {Pool} = require('pg')
const pool = new Pool({
    user: 'asd',
    host: 'asd',
    password: 'asd',
    database: 'asd',
    port: 5439,
});

router.get('/test', async (req, res) => {

    const
        select = [],
        length = Math.floor(Math.random() * 200);

    for(let i = 0; i < length; i++)
        select.push(Math.floor(Math.random() * 1000));

    try {

        await pool.query(`select *, ${select.join(',')} from tablethattotallyexists`);
        res.send({length});
    } catch(e) {
        res.send(e);
    }
});

(All the extra generating stuff is to force disable redshift cache. We were having some trouble with it's session wise flag.)

And the output I get from my testing is this. Notice how first 5 requests were sequential, and they took ~3 seconds each, whereas later 5 were in parallel but took ~13 seconds each. We're tried almost everything. This is sort of a Hail Mary.

image

charmander commented 6 years ago

Try selecting only one small row that will still force the same set of rows to be retrieved – md5() of the concatenated values of the original result set, for example. That should help determine whether there’s a parsing (or even network?) bottleneck. (pg’s pure JavaScript implementation only has one thread available to it, so it can’t parse messages and results from multiple clients in parallel.)

achshar commented 6 years ago

Try selecting only one small row that will still force the same set of rows to be retrieved – md5() of the concatenated values of the original result set, for example. That should help determine whether there’s a parsing (or even network?) bottleneck.

Can you elaborate on that a bit? I'm not sure I get it.

I'm now trying the native bindings. Maybe they will help.

charmander commented 6 years ago

If your original query was:

SELECT a, b, c FROM d

then a query that will use the same data but send only a small response is:

SELECT md5(string_agg(substr(md5(concat(a, b, c)), 1, 1), '')) FROM d

There’s probably a much simpler alternative, but I don’t know what Redshift optimizes compared to Postgres.

rfink commented 5 years ago

@achshar did you ever figure this out?