ged / ruby-pg

A PostgreSQL client library for Ruby
Other
784 stars 179 forks source link

Query hanging using the async api and different buffer size #588

Open mfmmq opened 2 weeks ago

mfmmq commented 2 weeks ago

I think I have found a bug related to https://github.com/ged/ruby-pg/issues/325. I am using thepg gem in a rails app running against YugabyteDb, a distributed database. Specific queries are hanging when executing against Yugabyte, which has a different buffer size than that of vanilla postgres.

If either the async api is disabled or the query is slightly updated (e.g. select is changed or limit is changed), it returns. This issue also seems to only occur with specific response sizes (81830 bytes < send bytes < 81920 bytes) over SSL/TLS.

I've pushed a repro here, let me know if it works on your end

git clone git@github.com:mfmmq/ruby-pg-async.git && cd ruby-pg-async && make

larskanis commented 2 weeks ago

Thank you @mfmmq, that should be a great help! I can reproduce the issue with your docker composition. A smaller number at generate_series succeeds, but the 1279 count blocks infinitely. I'll investigate the issue, but it will probably take some days.

Do you know if it blocks also without TLS?

mfmmq commented 2 weeks ago

Do you know if it blocks also without TLS?

Thanks for getting back to me ! No, it doesn't block with SSL. Have pushed an disabled SSL version to branch no-ssl if you want to test it

git pull && git checkout no-ssl && docker compose rm && docker compose up --build

Our vendor has pushed the following change to a fork which stops the query from hanging -- thought I would mention in case it's helpful

if (PQsslInUse(conn)) {
            for (int i = 0; i < 15; i++) {
                if ( PQconsumeInput(conn) == 0 ) {
                    pgconn_close_socket_io(self);
                    pg_raise_conn_error(rb_eConnectionBad, self, "PQconsumeInput() #%d - %s", i, PQerrorMessage(conn));
                }
            }
        }
larskanis commented 2 weeks ago

I can reproduce this issue easily and I came to the same conclusion:

  1. It doesn't happen without SSL/TLS.
  2. Multiple calls to PQconsumeInput fixes the issue, but there is no indication how many calls are required.

Calling PQconsumeInput in a loop until is_readable/PQisBusy changes, would cause a busy waiting loop, so this is no good option.

For me it looks like a bug or design failure in libpq. I'll get in touch with the Postgres people.

larskanis commented 1 week ago

I wrote a simple patch for libpq that fixes this issue and proposed it to the Postgres hackers. They confirmed that this is a bug in libpq (and not in ruby-pg), but the patch is probably incomplete and needs more investigation and discussion. So I added the patch to the next Postgres commitfest as a placeholder:

https://commitfest.postgresql.org/50/5251/