Closed adriangb closed 5 months ago
Thanks for the report and the helpful MRE @adriangb. Have you checked if session pooling is subject to the same problem?
I'll meanwhile at a look.
I see you're using a very large pool size. That means the Proxy will have to maintain a connection for each pool entry which causes increased memory usage and depending on throughput, increased CPU usage.
Instead, I'd expect to see something an order of magnitude smaller like 10. See https://github.com/brettwooldridge/HikariCP/wiki/About-Pool-Sizing for details.
Could you try again with a smaller pool size? Also when running the Proxy pay attention to CPU and memory usage, which you might be hitting.
A pool size of 15 behaves the same and besides, the proxy should be able to handle a pool size of 100. I ran this in GKE and resources were not even close to being exhausted.
In my testing the issue persists with session pooling.
In that case, I'll take a closer look. Are the numbers that you've listed in the MRE what you see when using docker compose?
Yes, and it seems quite reproducible. The logs are also interesting.
I sat down for some quality time with your example -- it's a very helpful tool and I thank you for putting it together.
Initially I was worried this behavior you're seeing might be caused by a misused mutex or poorly configured buffer. So here's what I did:
First, I enabled tracing in the Proxy (with --telemetry-project=<PROJECT>
) and saw that the pgbouncer+proxy path was consistently producing traces for dialing the instance at the 60ms range (with an occasional outlier above that but still under 500ms). This eliminated establishing the connection as the problem.
Next, I changed the example to be more apples to apples. The Proxy is doing a lot more than just connecting to a local database: it dials the SQL Admin API to generate a certificate, reaches out to the Proxy server to establish a TLS handshake, and then copies bytes back and forth from localhost to the Proxy server (which connects to the Postgres instance over a Unix socket). So I changed the example to be three cases (commit here):
The numbers I saw are more in line with what I'd expect (proxy is slowest, pgbouncer+direct is slightly faster, direct is fastest). The traffic must go from my machine in Colorado to us-central1.
postgres://postgres:postgres@127.0.0.1:5432/proxy: ['10.55s', '10.63s', '10.81s', '10.98s', '11.17s', '11.34s', '11.50s', '11.68s', '11.84s', '12.00s', '12.17s', '12.33s', '12.49s', '12.68s', '12.85s', '13.02s', '13.18s', '13.36s', '13.53s', '13.70s', '13.87s', '14.03s', '14.21s', '14.37s', '14.54s', '14.72s', '14.90s', '15.08s', '15.25s', '15.43s', '15.61s', '15.78s']
postgres://postgres:postgres@127.0.0.1:5432/direct: ['10.60s', '10.74s', '10.86s', '10.99s', '11.11s', '11.24s', '11.37s', '11.51s', '11.63s', '11.76s', '11.89s', '12.01s', '12.15s', '12.27s', '12.43s', '12.58s', '12.71s', '12.85s', '12.98s', '13.10s', '13.22s', '13.36s', '13.49s', '13.61s', '13.74s', '13.87s', '14.00s', '14.13s', '14.25s', '14.38s', '14.52s', '14.64s']
postgres://postgres:postgres@35.238.106.212/postgres: ['10.41s', '10.42s', '10.43s', '10.44s', '10.45s', '10.45s', '10.47s', '10.47s', '10.47s', '10.49s', '10.48s', '10.49s', '10.48s', '10.49s', '10.48s', '10.50s', '10.50s', '10.50s', '10.51s', '10.51s', '10.51s', '10.51s', '10.51s', '10.51s', '10.51s', '10.51s', '10.51s', '10.52s', '10.52s', '10.51s', '10.52s', '10.53s']
Anecdotally, I've heard that public IP can be slower than private IP. So for good measure, I ran the same experiment on a GCE VM in us-central1 (the same region where my Cloud SQL instance lives) against a PSA IP and the difference is negligible:
postgres://postgres:postgres@127.0.0.1:5432/proxy: ['10.23s', '10.22s', '10.22s', '10.24s', '10.23s', '10.25s', '10.25s', '10.25s', '10.25s', '10.25s', '10.25s', '10.24s', '10.25s', '10.25s', '10.24s', '10.24s', '10.25s', '10.25s', '10.25s', '10.25s', '10.25s', '10.25s', '10.26s', '10.26s', '10.26s', '10.25s', '10.26s', '10.25s', '10.25s', '10.25s', '10.25s', '10.25s']
postgres://postgres:postgres@127.0.0.1:5432/direct: ['10.24s', '10.24s', '10.23s', '10.23s', '10.23s', '10.23s', '10.23s', '10.23s', '10.24s', '10.24s', '10.24s', '10.24s', '10.24s', '10.24s', '10.24s', '10.24s', '10.24s', '10.24s', '10.24s', '10.25s', '10.24s', '10.24s', '10.24s', '10.25s', '10.25s', '10.25s', '10.24s', '10.25s', '10.25s', '10.25s', '10.25s', '10.24s']
postgres://postgres:postgres@10.48.208.3/postgres: ['10.08s', '10.08s', '10.08s', '10.09s', '10.09s', '10.09s', '10.10s', '10.11s', '10.11s', '10.11s', '10.14s', '10.15s', '10.17s', '10.18s', '10.18s', '10.19s', '10.20s', '10.21s', '10.21s', '10.22s', '10.22s', '10.22s', '10.22s', '10.22s', '10.23s', '10.23s', '10.23s', '10.23s', '10.23s', '10.23s', '10.24s', '10.24s']
I don't know enough about the nuances of pgbouncer to explain the overhead it introduces in the first experiment, but do think this data shows colocating the experiment with the instance is the biggest cause for a difference.
If you haven't already done so, I'd suggest trying this same experiment as I did (pgbouncer+proxy, pgbouncer+direct connection, and just direct connection) to confirm you see the same results. Comparing pgbouncer+proxy to pgbouncer to local postgres is misleading otherwise.
If you haven't already done so, I'd suggest trying this same experiment as I did (pgbouncer+proxy, pgbouncer+direct connection, and just direct connection)
That’s why I included the proxy connecting to a remote instance in the test. It was just as fast as possible (+-1s from 10s per connection). So latency to the remote instance or the proxy’s performance is not the issue. Neither is pgbouncer itself: pgbouncer connecting to Postgres directly is fast. And anecdotally, obviously I can’t include this in an MRE, I do have pgbouncer connecting to a primary and multiple read replicas (Timescale cloud, which is in a different entire cloud than the compute, GKE, and CloudSQL instance) using transaction pooling and have zero issues. That same pgbouncer setup is unusable with CloudSQL Proxy.
If you look at the numbers it feels like there must be some sort of lock inside of CloudSQL Proxy that is only getting triggered when using pgbouncer. Because the connections are getting created sequentially.
There’s also differences in the logs, eg:
closing because: client disconnect while server was not ready (age=10s)
Can CloudSQL Proxy serve as a proxy for non CloudSQL Postgres? Maybe that would help narrow down the issue.
Where are you running pgbouncer relative to your databases? Are they in the same region?
My second experiment in the comment above shows that pgbouncer and Cloud SQL Proxy work together without issue, provided you're collocating your database server and pgbouncer deployment in the same region. Right now it seems that the further the two are apart, the more latency grows. I'd like to understand why that is so pronounced.
FWIW the Proxy has zero protocol awareness. So it doesn't distinguish between a pgbouncer connection and a direct connection. It's all just byes to the Proxy.
Can CloudSQL Proxy serve as a proxy for non CloudSQL Postgres?
It only works for Cloud SQL. However, I could probably add some instrumentation to report on how long a connection is open. That might help find where the time above the 10 second sleep is going.
Where are you running pgbouncer relative to your databases? Are they in the same region?
pgbouncer and the proxy are in GKE in GCP us-east4. The CloudSQL instance is also in GCP us-east4 as well. The other Postgres databases are in AWS is-east1, which is the closest region.
After some more investigation, I don't think this is a bug in pgbouncer or the Cloud SQL Proxy.
First, in case I wasn't clear in my comment above, you're MRE isn't an accurate comparison. To get accurate numbers you need to run it as I did in my commit linked above. That means:
Cloud SQL instance (again either public IP or private, but it should be the same as the connectivity path above)
When you do that, you'll see that the performance of 1 and 2 are comparable (1 is slightly slower but only by ~1s at worst). This makes it clear that there's no bug here in the Proxy.
Further if you run that same experiment but with private IP in the same region as the Cloud SQL instance, you'll see almost no difference in the results. This makes it clear that any variation you're seeing is caused by network latency + pgbouncer configuration details.
So to be clear, I'd recommend:
If you're still seeing this latency problem, I'd also recommend setting min_pool_size to some appropriate value. When I set min_pool_size
to 32 (matching the incoming connections from the client), I saw the results immediately fall in line with one another. This was using public IP from my local machine across regions, i.e.,
postgres://postgres:postgres@127.0.0.1:5432/proxy: ['10.26s', '10.28s', '10.28s', '10.28s', '10.28s', '10.28s', '10.29s', '10.29s', '10.29s', '10.29s', '10.29s', '10.30s', '10.29s', '10.29s', '10.29s', '10.29s', '10.29s', '10.29s', '10.29s', '10.29s', '10.30s', '10.30s', '10.30s', '10.30s', '10.30s', '10.30s', '10.30s', '10.30s', '10.30s', '10.30s', '10.30s', '10.30s']
postgres://postgres:postgres@127.0.0.1:5432/direct: ['10.34s', '10.34s', '10.34s', '10.34s', '10.34s', '10.33s', '10.34s', '10.34s', '10.34s', '10.34s', '10.35s', '10.34s', '10.34s', '10.35s', '10.34s', '10.34s', '10.34s', '10.34s', '10.35s', '10.35s', '10.35s', '10.35s', '10.35s', '10.34s', '10.34s', '10.35s', '10.35s', '10.35s', '10.35s', '10.35s', '10.34s', '10.35s']
postgres://postgres:postgres@35.238.106.212/postgres: ['10.43s', '10.42s', '10.43s', '10.42s', '10.43s', '10.43s', '10.44s', '10.43s', '10.43s', '10.43s', '10.45s', '10.44s', '10.44s', '10.45s', '10.45s', '10.46s', '10.46s', '10.46s', '10.46s', '10.47s', '10.46s', '10.47s', '10.47s', '10.47s', '10.46s', '10.47s', '10.48s', '10.48s', '10.49s', '10.49s', '10.51s', '10.51s']
I may be missing something but I don't think my MRE is invalid. I'm not saying your version is invalid but I'm fairly confident mine is valid.
If network latency is the issue, why does using cloudsql proxy without pgbouncer work fine in my experiment? Network latency should be a function of the distance between the things that are connecting, not what layers of proxying they're going through to connect.
In my test pgbouncer + cloudsql proxy
and cloudsql proxy
have have the same network latency to the cloudsql instance. Adding pgbouncer into the mix vs. only cloudsql proxy does not increase network latency to the cloudsql instance unless we're doing many, many more roundtrips on the same latency, but then that's a bug somewhere because that's not how the proxying of either of the two should work.
And the fact that the numbers for just cloudsql proxy
are comparable to those of pgbouncer + local
show that latency to the cloudsql instance itself is not the issue.
If you're still seeing this latency problem, I'd also recommend setting min_pool_size to some appropriate value.
That's kinda cheating the test isn't it? Humor me and assume CloudSQL Proxy is doing something wrong w/ a lock or something so it can only open 1 connection at a time. If you set the min pool size to 32 and then wait for that to be filled before you run a test you are no longer measuring cloudsql proxy's performance when creating new connections concurrently - they were already created before you started the test.
Running pgbouncer in the same region as your Cloud SQL instance and using private IP.
As per above I already run pgbouncer, cloudsql proxy and the cloudsql instance all in the same region. The only thing I'm not doing is using a private IP. Why would using a private IP remove 10s of latency from establishing a connection?
The key point here is this:
pgbouncer + cloud sql and pgbouncer + cloud sql proxy + cloud sql produce comparable numbers. As a result, I see no evidence of a bug in the Proxy.
The only thing I'm not doing is using a private IP. Why would using a private IP remove 10s of latency from establishing a connection?
Honestly, I don't know. My best guess is that by reducing network latency, pgbouncer with the configuration you've set does a better job. Yes, it does seem surprising that we're seeing many additional seconds to what should just be 10s. But I've also read that transaction pooling isn't meant to be used with long running connections. As a result, I see this as a problem of configuring pgbouncer well.
pgbouncer + cloud sql and pgbouncer + cloud sql proxy + cloud sql produce comparable numbers
Then why does pgbouncer + cloudsqlproxy
not produce similar numbers to just cloudsqlproxy
? If the hypothesis is shifted from "a problem combining pgbouncer with cloudsql proxy" to "a problem combining pgbouncer with cloudsql itself" then there is still a big problem, we've only discarded that it's the proxy. Since Google doesn't offer it's own connection pooling solution and pgbouncer is the defacto industry standard (yes, I know there's application connection pooling -that's not the same- and other solutions like pgcat), this is basically saying that cloudsql is incompatible with connection pooling (unless you use a private IP?).
I've also read that transaction pooling isn't meant to be used with long running connections. As a result, I see this as a problem of configuring pgbouncer well.
I don't think transaction pooling is incompatible with a 10 second query. And I use this setup with other Postgres providers with zero issues.
Would it help if I added a test using pgcat? Then we can hone in on the hypothesis being "a problem between pgbouncer and cloudsql".
Then why does pgbouncer + cloudsqlproxy not produce similar numbers to just cloudsqlproxy?
That's a good question. I think we should remove the Proxy from the equation entirely as the real issue is pgbouncer connected to Cloud SQL does worse than connecting to Cloud SQL directly.
I see two major avenues for investigation:
To be clear, I'm pretty curious now myself to understand where the extra ~4s are coming from (as I noted above) and will be doing both 1 and 2 as well.
Tentatively, I think an outcome of this issue can be some guidance for using pgbouncer (or pgcat) well.
the real issue is pgbouncer connected to Cloud SQL does worse than connecting to Cloud SQL directly
But also pgbouncer + cloudsql does worse than pgbouncer + any other postgres (assuming the issue is not in the proxy).
That depends how you're connecting. For the best performance, you'll want to use private IP (PSA or PSC).
I honestly don't know what CloudSQL proxy does behind the scenes for networking or how that relates to a private IP.
But even if using a private IP makes cloudsql instances behave similar to other postgres then there's a serious problem with the public IP setup and it should not be recommended.
I disagree with that conclusion. If anything Cloud SQL might require some more work in pgbouncer configuration.
I'm not sure what you mean by "some more work". I still don't have a logical explanation for all of this nor do I have a specific setting that it makes sense to tweak in pgbouncer. If I saw a (reasonable, not tweaking the min pool size or max pool size) configuration of pgbouncer and cloudsql proxy that worked well together I would consider this resolved. Unless we can arrive at one of those or find where the bug is I need to use a different postgres provider because I'm going to run out of connections on CloudSQL proxy or be forced to otherwise implement serious workarounds (like dropping application level pooling).
If I saw a (reasonable, not tweaking the min pool size or max pool size) configuration of pgbouncer and cloudsql proxy that worked well together I would consider this resolved.
Yes, I agree. I'd like to understand why pgbouncer + cloud sql is producing an extra ~4s of connection overhead in this particular setup.
To finally pin this down, I created a Timescale database to compare Cloud SQL and Timescale both with pgbouncer. Granted, I couldn't create a database in the same us-central1 region so the numbers are only useful relatively, but what I saw was this.
I'm using the same connection settings from your initial example:
host=<timescale URL>
user=postgres
password=<PASSWORD>
dbname=postgres
pool_mode=transaction
max_db_connections=100
pool_size=100
['10.69s', '10.88s', '11.20s', '11.47s', '11.74s', '12.12s', '12.32s', '12.63s', '12.91s', '13.19s', '13.55s', '13.78s', '14.08s', '14.36s', '14.78s', '14.94s', '15.30s', '15.54s', '15.90s', '16.12s', '16.52s', '16.72s', '17.00s', '17.29s', '17.65s', '17.87s', '18.26s', '18.48s', '18.88s', '19.07s', '19.35s', '19.69s']
['10.47s', '10.48s', '10.48s', '10.49s', '10.49s', '10.49s', '10.49s', '10.49s', '10.49s', '10.49s', '10.49s', '10.50s', '10.50s', '10.51s', '10.51s', '10.51s', '10.51s', '10.51s', '10.51s', '10.51s', '10.51s', '10.51s', '10.51s', '10.52s', '10.53s', '10.53s', '10.53s', '10.54s', '10.54s', '10.54s', '10.55s', '10.56s']
As you can see Timescale with pgbouncer behaves exactly like Cloud SQL. So this is not a Cloud SQL issue.
Next, I opened up Wireshark to watch the Postgres protocol messages when connecting through pgbouncer to Cloud SQL and here's what I saw:
So there's the answer. Pgbouncer uses one CPU and is single-threaded. All initial connections will be created sequentially.
Just for comparison, I tried the same experiment with pgcat (multi-threaded using multi-CPUs), and immediately saw a difference: pgcat sends all the startup messages concurrently just like connecting directly with psycopg.
So in summary:
I'd like to thank you for your persistence in asking for an answer here. This will hopefully be helpful for the many folks using pgbouncer and wondering about best practices.
@adriangb do you have any further comments? I'm having a "similar" issue more in the realm of dropping connection and I'm assuming must be related with a misconfigured setup.
No I don't sorry. I ended up switching to pgcat which seemed to behave better in this respect but has it's own set of issues...
Bug Description
I tried using CloudSQL Proxy in conjunction with PgBouncer w/ transaction pooling and am running into some strange performance issues that I can't explain.
I tested:
So each seems to work fine in isolation but not when combined. I'm not sure if I'm doing something wrong or if it's a bug in one of the two but I put together an MRE and am filing this as a bug report in hopes of getting some clarity. Sorry if this is a mistake on my end.
The MRE is here: https://github.com/adriangb/pgbouncer-cloudsqlproxy-bug