apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
5.86k stars 2.06k forks source link

Does JDBC connector uses any retry mechanism? #7173

Open lorransr opened 1 year ago

lorransr commented 1 year ago

Query engine

Trino

Question

As mentioned in this issue, we have an aurora serverless database, and we use it as our iceberg catalog. The serverless db scales up and down during the day and can reach to 0 instances, when it happens, all connections are lost. The expected behavior for the jdbc connection in this case would be to try to reconnect to the database, since it will be the trigger for starting the db instances.

Should it be possible to implement this connection retry feature? Or is it expected, the non retry behavior?

amogh-jahagirdar commented 1 year ago

I think it should be feasible, I took a look at the code to see how the JdbcCatalog works. By default we retry SQLNonTransientException https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/jdbc/JdbcClientPool.java#L46 and the exception that gets thrown in your case is org.postgresql.util.PSQLException so we don't retry it.

Few things:

1.) SQLNonTransientConnectionException seems to be the wrong thing to even retry. The docs say the following (and it makes sense, it's non-transient, we should only retry transient failures)

/**
 * The subclass of {@link SQLException} thrown for the SQLState
 * class value '<i>08</i>', or under vendor-specified conditions.  This
 * indicates that the connection operation that failed will not succeed if
 * the operation is retried without the cause of the failure being corrected.
 * <p>
 * Please consult your driver vendor documentation for the vendor-specified
 * conditions for which this <code>Exception</code> may be thrown.
 * @since 1.6
 */

2.) we probably want a better mechanism for specifying exception types to retry to ClientPoolImpl rather than just a single class.

I'll let others comment, cc @jackye1995 @rdblue

jackye1995 commented 1 year ago

The serverless db scales up and down during the day and can reach to 0 instances, when it happens, all connections are lost.

As Amogh suggested the code can definitely be updated to accommodate more exception cases for retry. I guess the question is if there is anything more than just org.postgresql.util.PSQLException, like an error code, that can be used to detect the issue? Using error message like "This connection has been closed" seems to be too specific.

lorransr commented 1 year ago

It seems to be related to this code:

https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/util/PSQLState.java#L29

Since it is the one used here:

https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/PgConnection.java#L991

jackye1995 commented 1 year ago

Cool, in that case I think we can add a code change to check that error code and retry. @lorransr unfortunately I am a bit short of capacity, but I can help with review, do you have time to make the code change?

lorransr commented 1 year ago

Unfortunnely I don't 😢 And also not sure how

netanelm-upstream commented 1 year ago

We are encountering the same problem as well.

lorransr commented 1 year ago

I end up solving by ditching the serverless approach and provisioning a common database on aws

amogh-jahagirdar commented 1 year ago

I should have some time to try and get a fix in for this before next release, I'm assigning to myself. @lorransr @netanelm-upstream

anfy2002us commented 8 months ago

I see the PR didn't make it to a release. Do you need any help with it?

cccs-eric commented 8 months ago

I see the PR didn't make it to a release. Do you need any help with it?

As I stated in the linked PR, I think those codes should be configurable. We don't have much cycles to spare on our side right now, but if you can help, that would be great. We can test your fix easily.

kkd927 commented 3 months ago

Is there any progress on this issue? I have the same issue.