brianc / node-pg-query-stream

Query results from node-postgres as a readable (object) stream
MIT License
311 stars 37 forks source link

Not works for me in redshift. #48

Open jose920405 opened 6 years ago

jose920405 commented 6 years ago

I'm not sure the reason, or if simply query stream is not supported by redshift. Any idea?

My code:

const query = new QueryStream(info.sql);
const stream = client.query(query);

stream.on('end', () => {

})

stream.on('error', (err) => {

})

stream.on('data', (x) => {

});

If my query(info.sql) has 1'000.000 of records, it doesn't works.

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory

In this case never shows any log inside data listener.

Works perfect pointing to local DB with postgress.

Maybe related with https://github.com/brianc/node-pg-query-stream/issues/32 in the vitaly-t's question.

Using:

"pg-query-stream": "1.1.1", node: 8.9.1

unlucio commented 6 years ago

I'm experiencing something similar.

I'm using

➜  ~ node --version
v10.12.0

and Redshift 1.0.4515

I'm trying to read a table with ~65 billion rows with the sample code:

const { Pool } = require('pg');
const config = require('./src/lib/config');
const credentials = config.get('warehousedb');
const QueryStream = require('pg-query-stream');
const JSONStream = require('JSONStream');

const pool = new Pool({
    user: credentials.user,
    host: credentials.host,
    database: credentials.name,
    password: credentials.pass,
    port: credentials.port
});

const query = 'select * from <table_name>';
const start = Date.now();

pool.connect().then(function(client) {
    const stream = client.query(new QueryStream(query));

    stream.on('end', () => {
        console.log('\n ==> pulling all data in took: ', ((Date.now() - start)/1000));
        client.release();
    });

    stream.pipe(JSONStream.stringify()).pipe(process.stdout);
}).then(() => pool.end());

But all it does is: sits there for some minutes and crashes punching a hole in the heap 🤷‍♂️

For sake of my experiment I tried --max-old-space-size= as well, but all I obtained was the script running for longer but still booming at the end.

In all this not a single line of data is sent to process.stdout and the memory grows and grows just like as something was actually buffering the whole result.

I ran a test with a LIMIT of 1 billion rows (both with default heap settings as well as --max-old-space-size=4096): it ran until the end, with output, but is sit there until the query was actually completed before starting to give any output.

I ran a test with a LIMIT of 10 billion rows (--max-old-space-size=4096 => ok, default heap size => death): it ran until the end, with output, but is sit there until the query was actually completed before starting to give any output.

In both cases the memory grew of several G despite the original data been only 1.5G total (yet another signal that something behind is stuffing its memory like I'd do with chocolate in my mouth :P).

I can see how @jose920405 might link the JavaScript heap out of memory to what's describer in #32, but I don't think that's really related to this in any meaningful way: if that was the case I would still expect some output before the death as pages come in, get divided in rows and pushed on to the stream. From what I'm observing it seems more like things are actually some how stuck before the push point. It might very well be a Redshift limit (it does have some very surprising limits) but my limited low-level experience with both PG and RedShift prevent me from been able to do any meaningful speculation

rubenbarreiro commented 6 years ago

could it be that node-pg-cursor is the one storing all results in memory before sending them to node-pg-query-stream ? . I'll test it when I get the chance...

unlucio commented 6 years ago

Honestly, I've no idea (and not much time to dig deeper right now) 🤷‍♂️ I'll be happy to follow the thread tho if you get a chance to dig into node-pg-cursor

unlucio commented 6 years ago

So i got a bit frustrated by this and run a small experiment with node-pg-cursor which indeed seems to be holding on those results:

https://github.com/brianc/node-pg-cursor/issues/45

Looks like, as usual, AWS is the root of the problem :/

ysfdev commented 6 years ago

Hi @unlucio,

I am also looking for a streaming solution to pipe from Redshift query. Interested to know you if were able to find any further findings or an alternative approach for querying large amount of records without hitting memory limitations?

Thanks!

unlucio commented 6 years ago

well, since my investigations pointed out it was most likely a RS limitation and no one maintaining either this project or node-pg-cursor seemed to be interested in the problem., I just user AWS GLUE instead 🤷‍♂️ I know "use something else" answers are suboptimal but I didn't have much time on my hands and most of it was already gone investigating the memory problem.

dreallday commented 5 years ago

I'm in the same boat, long running Redshift queries. Can't find a way to stream them to fs without a Javascript heap out of memory error. node-opts --max-old-space-size only bandaids the problem.

unlucio commented 5 years ago

@dreallday my temporary "solution" (it's just a trick but it works) as for now is to jump the DB as much as possible:

BTW I've been a little experimenting around "streams RS and P" and, even if not a final data, I've the feeling that the cursor package tends to buffer anyway (but I've yet to set aside some time to read that package's code).

ysfdev commented 5 years ago

Hi,

It looks like the issue was on the PG cursor side. There is currently an open pg-cursor PR(https://github.com/brianc/node-pg-cursor/pull/44) , which should address the streaming problems.

I was able to run some local manual tests using the cursor updates from (https://github.com/jafl/node-pg-cursor.git#redshift-requires-portal-name), and was able to stream a large dataset with low memory footprint.

Hopefully PR(https://github.com/brianc/node-pg-cursor/pull/44) gets merged soon into pg-cursor!

christophersbarrett commented 5 years ago

Can someone test this again, the PR that @ysfdev references has been merged a month ago. I'd be interested in hearing if this works for redshift or not.

unlucio commented 5 years ago

@christophersbarrett I saw it as well, but I haven't had time to test it yet and right now I'm not dealing with RS @ work.

unlucio commented 5 years ago

well, I just got the opportunity to use it against an Aurora instance and it works. Not redshift but still Amazon stuff, and I remember having some troubles streaming from Aurora as well.

just a feedback. cheers