brianc / node-postgres

PostgreSQL client for node.js.
https://node-postgres.com
MIT License
12.27k stars 1.22k forks source link

Compression of data returned from query #2448

Open drmrbrewer opened 3 years ago

drmrbrewer commented 3 years ago

I am using the official docker postgres image for my postgres instance. The postgres instance resides on the main node, with access to persistent storage there, and worker nodes are added to a docker swarm as required.

I've been investigating some slow queries, and have determined that these slow queries only happen when (a) there are two or more nodes in the swarm; and (b) the instance requesting data from the postgres database is on a worker node (i.e. not on the same node as the postgres instance).

The difference in query time is stark: seconds (db and requester on different nodes) vs milliseconds (db and requester on same node). This is the case even when the nodes are in the same datacentre. I find this very surprising, but it is clear from the metrics... query time drops down instantly to near-zero when the number of nodes in the swarm drops down to 1, and then back up to seconds when nodes goes up to 2. I'm using New Relic to monitor these query times. It seems that query time in this context includes the time take to get the data from the db node to the requesting node.

Furthermore, the query time (when requesting instance is on a different node to the db node) has improved since I started compressing VARCHAR data prior to storing it in the database, and then decompressing it on receipt. This has not resulted in any reduction in database size, because I understand that postgres will automatically compress strings anyway, but my theory is that when I compress manually myself, it is the compressed data that is being transferred between nodes, not the decompressed data, so query time is therefore better.

I'm not doing anything complicated with setting up node-postgres... the host is just specified as the postgres docker service name and this works fine:

const { Pool } = require('pg');
const pool = new Pool({ host: 'my-postgres-service' });
const res = await pool.query(`SELECT * FROM mytable WHERE key=${lookupKey};`)

Am I missing some option that exists in node-postgres to compress the data resulting from a query? I'm not talking about internal compression/decompression of data in the postgres database (that is a postgres matter), but rather the mechanics of actually getting the queried data (which will already have been decompressed by postgres) to the requester (which may be on a different node).

Maybe I am approaching this completely wrong in the first place, and maybe the large query time with multiple nodes is a result of some other issue with my setup (e.g. causing a backlog of inter-node messages), and that compression of query data shouldn't ever be required. Am happy to receive some guidance here about how it should be done properly!

brianc commented 3 years ago

Hmmm so the overhead added by the postgres protocol to each message is very small in bytes. If you're reading out huge TEXT or VARCHAR columns that have massive amounts of text in there then 99.999% of the payload size is the text itself. pg doesn't have any built in compression since it wouldn't help much at all to compress the individual packets - if you want to compress/decompress the stuff going in/out of your DB you can do that in your app.

That being said, it's unlikely a bandwidth issue...probably something misconfigured in the network or something that's making things slow. One thing to keep in mind is establishing a new connection to postgres is orders of magnitude slower than keeping a connection open and sending queries over it, which is why postgres is almost always used with a connection pool in a high load environment. I've seen connection times in GCS between GKE and CloudSql take upwards of 1/2 a second (sometimes) in one of my apps in prod while queries themselves are very quick once connected. Anyways...since you're getting good query speed on the same box as the postgres DB and bad query speed on a separate one, it's most likely some network config or you aren't using connection pooling or something.

drmrbrewer commented 3 years ago

Thanks for the response, @brianc. Through some further investigation I had concluded that there was indeed some network issue, so it isn't a problem with pg per se, but I wondered whether there might anyway be some option in pg which would help to overcome the issue of slow network transfer by compressing the data being transferred over the slow network.

If you're reading out huge TEXT or VARCHAR columns that have massive amounts of text in there then 99.999% of the payload size is the text itself. pg doesn't have any built in compression since it wouldn't help much at all to compress the individual packets

So pg can't compress the (potentially huge) text data itself, so that there are fewer packets to send over the network? It seems to me that this would be a useful option.

if you want to compress/decompress the stuff going in/out of your DB you can do that in your app.

This is what I did try in the end... I compressed the text before putting it into the DB, and decompressed the text that came out. But my feeling was that this is wasteful and duplicative because postgres does text compression automatically anyway (see my OP), and in fact I found that the database size actually increased, so perhaps the two compressions were somehow working again each other.

By way of possible interest, I fixed the network issue... it was a problem with Docker running on Ubuntu 20.04, and downgrading to Ubuntu 18.04 fixed the network problem. But in any case, separate from that issue, it seems to me that it would make sense to compress the data being pulled from the database using pg, in case there is a cost (time/money) of transferring large amounts of data over the network between the requesting node and the node that hosts the database.

But maybe I'm misunderstanding what is actually possible... maybe pg itself cannot do this... it just asks for data and gets what it asks for?

Cactusbone commented 1 year ago

I'm also looking into compression from postgresql (yes, I'm now bound by network bandwidth due to huge amount of text being transferred).

There is a sslcompression parameter that can be used (in libpq), however, documentation states

PostgreSQL 14 disables compression completely in the backend.

so yes, the issue is not with the node client, but with postgresql backend :)

That being said, I'm not sure sslcompression could be used with node-postgres, It's not in the documentation.