Open jose920405 opened 6 years ago
I'm experiencing something similar.
I'm using
➜ ~ node --version
v10.12.0
and Redshift 1.0.4515
I'm trying to read a table with ~65 billion rows with the sample code:
const { Pool } = require('pg');
const config = require('./src/lib/config');
const credentials = config.get('warehousedb');
const QueryStream = require('pg-query-stream');
const JSONStream = require('JSONStream');
const pool = new Pool({
user: credentials.user,
host: credentials.host,
database: credentials.name,
password: credentials.pass,
port: credentials.port
});
const query = 'select * from <table_name>';
const start = Date.now();
pool.connect().then(function(client) {
const stream = client.query(new QueryStream(query));
stream.on('end', () => {
console.log('\n ==> pulling all data in took: ', ((Date.now() - start)/1000));
client.release();
});
stream.pipe(JSONStream.stringify()).pipe(process.stdout);
}).then(() => pool.end());
But all it does is: sits there for some minutes and crashes punching a hole in the heap 🤷♂️
For sake of my experiment I tried --max-old-space-size=
as well, but all I obtained was the script running for longer but still booming at the end.
In all this not a single line of data is sent to process.stdout
and the memory grows and grows just like as something was actually buffering the whole result.
I ran a test with a LIMIT of 1 billion rows (both with default heap settings as well as --max-old-space-size=4096): it ran until the end, with output, but is sit there until the query was actually completed before starting to give any output.
I ran a test with a LIMIT of 10 billion rows (--max-old-space-size=4096 => ok, default heap size => death): it ran until the end, with output, but is sit there until the query was actually completed before starting to give any output.
In both cases the memory grew of several G despite the original data been only 1.5G total (yet another signal that something behind is stuffing its memory like I'd do with chocolate in my mouth :P).
I can see how @jose920405 might link the JavaScript heap out of memory
to what's describer in #32, but I don't think that's really related to this in any meaningful way: if that was the case I would still expect some output before the death as pages come in, get divided in rows and pushed on to the stream.
From what I'm observing it seems more like things are actually some how stuck before the push
point.
It might very well be a Redshift limit (it does have some very surprising limits) but my limited low-level experience with both PG and RedShift prevent me from been able to do any meaningful speculation
could it be that node-pg-cursor
is the one storing all results in memory before sending them to node-pg-query-stream
? . I'll test it when I get the chance...
Honestly, I've no idea (and not much time to dig deeper right now) 🤷♂️
I'll be happy to follow the thread tho if you get a chance to dig into node-pg-cursor
So i got a bit frustrated by this and run a small experiment with node-pg-cursor
which indeed seems to be holding on those results:
https://github.com/brianc/node-pg-cursor/issues/45
Looks like, as usual, AWS is the root of the problem :/
Hi @unlucio,
I am also looking for a streaming solution to pipe from Redshift query. Interested to know you if were able to find any further findings or an alternative approach for querying large amount of records without hitting memory limitations?
Thanks!
well, since my investigations pointed out it was most likely a RS limitation and no one maintaining either this project or node-pg-cursor seemed to be interested in the problem., I just user AWS GLUE instead 🤷♂️ I know "use something else" answers are suboptimal but I didn't have much time on my hands and most of it was already gone investigating the memory problem.
I'm in the same boat, long running Redshift queries. Can't find a way to stream them to fs without a Javascript heap out of memory
error. node-opts --max-old-space-size
only bandaids the problem.
@dreallday my temporary "solution" (it's just a trick but it works) as for now is to jump the DB as much as possible:
BTW I've been a little experimenting around "streams RS and P" and, even if not a final data, I've the feeling that the cursor package tends to buffer anyway (but I've yet to set aside some time to read that package's code).
Hi,
It looks like the issue was on the PG cursor side. There is currently an open pg-cursor PR(https://github.com/brianc/node-pg-cursor/pull/44) , which should address the streaming problems.
I was able to run some local manual tests using the cursor updates from (https://github.com/jafl/node-pg-cursor.git#redshift-requires-portal-name), and was able to stream a large dataset with low memory footprint.
Hopefully PR(https://github.com/brianc/node-pg-cursor/pull/44) gets merged soon into pg-cursor!
Can someone test this again, the PR that @ysfdev references has been merged a month ago. I'd be interested in hearing if this works for redshift or not.
@christophersbarrett I saw it as well, but I haven't had time to test it yet and right now I'm not dealing with RS @ work.
well, I just got the opportunity to use it against an Aurora instance and it works. Not redshift but still Amazon stuff, and I remember having some troubles streaming from Aurora as well.
just a feedback. cheers
I'm not sure the reason, or if simply
query stream
is not supported by redshift. Any idea?My code:
If my query
(info.sql)
has 1'000.000 of records, it doesn't works.In this case never shows any log inside
data
listener.Works perfect pointing to local DB with
postgress
.Maybe related with https://github.com/brianc/node-pg-query-stream/issues/32 in the
vitaly-t's
question.Using:
"pg-query-stream": "1.1.1",
node: 8.9.1