eclipse-vertx / vertx-sql-client

High performance reactive SQL Client written in Java
Apache License 2.0
877 stars 195 forks source link

Memory leak in Aurora Postgres since 3.9.1 #1435

Closed al-kudryavtsev closed 1 month ago

al-kudryavtsev commented 2 months ago

Version

3.9.1, 4.5.7

Context

Our app has been on vertx-pg-client 3.9.0 for a while until recently when we migrated to version 4.5.7.

Shortly after the upgrade we noticed our AWS Aurora PG cluster is running out of freeable memory and rebooting. It is directly correlated to the load on the service - during peak times DB can run out of memory (60GB) in 45 minutes. We employed the maxLifetime parameter for the pool, which helps but is not ideal as it causes latency spikes. We prewarm the pool on deployment so all the connections expire at about the same time.

On 3.9.0, there was no issue with DB memory consumption. We rolled back the upgrade to confirm this. On 3.9.1, confirmed the issue is present. This thread has a report of the same issue on 3.9.13 as well as 4.x branch.

Do you have a reproducer?

I don't have a reproducer; this might be AWS/aurora-specific. I can test potential solutions if that can help.

Extra

Pretty much all our queries are preparedQuery + execute or executeBatch. We also use standalone prepare with a cursor for paginated reads. Pool configuration: https://gist.github.com/al-kudryavtsev/3e6eeb3cfd200afc66df5a12932bba25

tsegismont commented 1 month ago

Have you tried using DISCARD ALL as suggested in the group discussion?

It would help to have some details about what is consuming memory, can you get that from your DBA?

In the 3.9.1 release notes, there is a link to https://github.com/eclipse-vertx/vertx-sql-client/issues/577 If you use prepare, can you make sure the statement is closed properly?

al-kudryavtsev commented 1 month ago

Thanks for looking into this!

Haven't tried DISCARD ALL; it might have a performance impact as it will flush cached query plans when connection is returned to the pool. Might try this if other options won't work.

If you use prepare, can you make sure the statement is closed properly?

The statement wasn't closed indeed. We were doing prepare(sql).coAwait().cursor(args) and then reading the cursor until it is exhausted, but never closed the result of prepare(sql). I changed this to:

flow<Row> {
    val preparedStatement = prepare(sql).coAwait()
    try {
        val cursor = preparedStatement.cursor(args)
        do cursor.read(pageSize).coAwait().forEach { emit(it) }
        while (cursor.hasMore())
    } finally {
        preparedStatement.close().coAwait()
    }
}

It looks promising so far, will see if we can keep connections open indefinitely. Based on the doc, I'm assuming that cursor doesn't have to be closed since we read it until the end.

tsegismont commented 1 month ago

Good news, keep us posted please.

al-kudryavtsev commented 1 month ago

Fully removed max lifetime today - DB memory looks good. Thanks for your help!