FSharp.Data.Npgsql retry functionality and related enhancements.

symboliq commented 3 years ago

This PR contains the following enhancements courtesy of FutureLab Software Company -

Add optional type parameters to the NgpsqlConnection<> type provider to configure automatic retrying of queries in the event of NpgsqlException errors, up to N retries, with X milliseconds' delay between retries
Add the same type parameters to the NpgsqlConnection.CreateCommand<> type provider to allow users to enable or disable automatic retrying for individual queries
Retries should not abort open ambient transactions, if any. If possible: retries should check if there are open ambient transactions, and if they have been aborted, they should skip any further retries and fail immediately.
If all retries fail, an AggregateException should be thrown, wrapping all the individual errors encountered during the retries.
When a query is retried, the NpgsqlConnection<> type provider should emit an event that clients can subscribe to in order to log the retry attempt
Add an optional boolean type parameter to the NgpsqlConnection<> type provider. If enabled, any exceptions will be caught and the returned type 'T will be wrapped as a Result<'T, exn>

kerams commented 3 years ago

Why would you want to have this baked in the library rather than using something like Polly on your end? True, you have to wrap every command in a policy, but the configuration options are so much more flexible.

symboliq commented 3 years ago

Here's the explanation I received upon receiving this task -

Connection resiliency.

We use managed Azure PostgreSQL and sometimes, due to either network connection or performance / misconfiguration problems (connection pool exhausted, etc.), our SQL queries run into various kind of transient errors, often SocketExceptions.

All but a few of our queries are idempotent (we use UUID primary keys and ON CONFLICT clauses), so we would like to be able to set a global option to retry those queries in the event of transient errors, instead of having to wait for the entire job to be retried.

Fortunately, Npgsql already distinguishes between errors in the query execution (PostgresException), errors in the type mapping (usually InvalidOperationException) and transient connection errors, which fall under NpgsqlException:

Wrap connection exceptions with NpgsqlException · Issue #2495 · npgsql/npgsql (github.com)

EFCore, for example, uses an optional EnableRetryOnFailure flag that automatically retries queries in the event of transient exceptions. Our old product used raw ADO.NET and was also able to implement automatic retries at the lower level

Getting sporadic SocketExceptions/ExtendedSocketExceptions / Connection Refused using Docker · Issue #619 · npgsql/efcore.pg (github.com) Connection Resiliency - EF Core | Microsoft Docs

Unfortunately, the library FSharp.Data.Npgsql, being a type provider, generates different CreateCommand types for each query, with different AsyncExecute() methods, so as far as I can tell it's not possible to create e.g. an extension member AsyncExecuteWithRetry() that applies to all queries.

Right now, we have a generic Async.retry wrapper that we manually sprinkle on critical queries, but that's not optimal since connections errors can happen anywhere.

If further explanation would be helpful, I'll ping the gentlemen who gave me the requirements (he has a link to this PR, tho it's late where he is so I don't think he's seen it yet).

piaste commented 3 years ago

Why would you want to have this baked in the library rather than using something like Polly on your end? True, you have to wrap every command in a policy, but the configuration options are so much more flexible.

This was indeed our initial approach to the problem. We found it to be suboptimal and frustrating pretty quickly because it was not possible to automatically ensure usage of the wrapper, especially for non-query commands where the result is thrown away.

While looking for alternative solutions, I noticed that having a built-in retry policy for transient errors was a pretty standard feature among .NET database access libraries, e.g.:

We decided that contributing the feature to the library would be less work, more reliable, and less/better code than wrapping several hundred database connections + forever keeping an eye out for unsafe queries in code reviews. It would also be a non-breaking change that could be useful to other users of the library, should you choose to accept the PR.

Granted, you make a strong point that manual wrapping with a dedicated library offers more flexibility.

However, at the low level of individual database queries, I think it's OK to stick to only handling simple transient errors (as the above libraries do). Strategies such as delayed retrying, backpressure, caching etc. IMO benefit from being aware of the higher level context (e.g. sending user notifications, coordinating between multiple clients, etc.). In our case we use Hangfire to define such strategies and monitor them via dashboard, because, for example, some ETL jobs are more stressful than others and should not be queued carelessly, but it's not individual database queries that risk exhausting the DB's resources.

piaste commented 2 years ago

@kerams, since your fork is now effectively the master branch, what do you think of this PR? If I updated it to merge non-breakingly with the current master, would you be willing to merge it?

We've been using it in prod for several months now without issue, but I'd like to keep up with future library updates without maintaining a fork if possible.

kerams commented 2 years ago

your fork is now effectively the master branch

Not quite. It's true that my fork (or, to be precise, the cloned repository) was merged here in its entirety a couple of months ago, but we have sinced diverged a bit; there's even a separate Nuget package. I think I would not mind having this feature over there (if you were to switch), but after a cursory look at the PR, I do have a couple of issues with the implementation.

piaste commented 2 years ago

I'd be OK with switching. There's a feature I'd been working on (NodaTime integration) that might be easier as a breaking change in a private fork, but it's not critical by any means and I'd rather get rid of the fork instead.

Let me know which issues you would like to see addressed, thanks!

daz10000 commented 2 years ago

Hey - sorry we haven't been very chatty for a while here. Philosophically, I'd like to take in more contributions. The world has too many slightly undermaintained repos and I believe in 'less, better, software' - a few well-maintained things. If you want to address any thoughts @kerams has on improvements, I think we'd be happy to take the change (though we don't use that functionality). @kerams - I'd also like to say thanks for all the improvements and we'd be open to any suggested improvements you have (we obviously liked your fork enough to take the improvements back). We tend to have time in waves to improve this - we did a bunch of things for the net6 push recently. My only ask for contributions is to try to have decent test examples and where possible extend the doc file so there's at least one example of the feature for future generations.

That all said, I'm happy for the code to be useful in any way, including forks. If we have to put in a bit more work to help concentrate the work in fewer places though, that's a good investment for us (IMO)

demetrixbio / FSharp.Data.Npgsql

FSharp.Data.Npgsql retry functionality and related enhancements. #102