JasperFx / marten

.NET Transactional Document DB and Event Store on PostgreSQL
https://martendb.io
MIT License
2.85k stars 450 forks source link

Polly retries not working when getting connection failures. #3375

Closed Erwinvandervalk closed 1 month ago

Erwinvandervalk commented 2 months ago

Hi guys,

I've experienced some issues with using MartenDB in Azure (postgresdb flexible server). When a maintenance event happens, I'm getting a lot of transient errors in the logs that seem like they are not being picked up by the retry policy.

I've tried to simulate this issue below in a minimal repro. The code just starts to poll martendb in the background. Then, if I pause / stop my database (it's running locally in docker, so can pause / stop it from docker desktop), then the retry policy doesn't seem to be used. I can clearly see that the query is not retried.

// See https://aka.ms/new-console-template for more information

using System.ComponentModel;
using Marten;
using Marten.Exceptions;
using Microsoft.Extensions.Configuration;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;
using Microsoft.Extensions.Logging;
using Npgsql;
using Polly;
using Polly.Retry;
using Serilog;
using Weasel.Core;

var host = CreateHostBuilder(args).Build();
await host.RunAsync();

static IHostBuilder CreateHostBuilder(string[] args) =>
    Host.CreateDefaultBuilder(args)
        .ConfigureServices((builder, services) =>
        {

            services.AddSingleton<IHostedService, TimedHostedService>();
            // This is the absolute, simplest way to integrate Marten into your
// .NET application with Marten's default configuration
            services.AddMarten(options =>
            {
                // Establish the connection string to your Marten database
                options.Connection("User ID=postgres;Password=postgres;Host=localhost;Port=5432;Database=postgres;Timeout=4;Command Timeout=4");

                // Specify that we want to use STJ as our serializer
                options.UseSystemTextJsonForSerialization();

                options.Schema.For<TestDocument>();
                options.ConfigurePolly(builder =>
                {
                    builder
                        .AddRetry(new RetryStrategyOptions
                        {
                            ShouldHandle = new PredicateBuilder()
                                .Handle<NpgsqlException>()
                                .Handle<MartenCommandException>(),
                            Delay = TimeSpan.FromMilliseconds(100),
                            MaxRetryAttempts = 3,
                            UseJitter = true,
                            OnRetry = args =>
                            {
                                Console.WriteLine($"Retry Attempt Number : {args.AttemptNumber} after {args.Duration.TotalMilliseconds} ms.");
                                return default;
                            }
                        })
                        .Build();
                });
                options.AutoCreateSchemaObjects = AutoCreate.All;
            }).UseLightweightSessions();
        })
;

public class TestDocument
{
    public Guid Id { get; set; }
    public string Name { get; set; }
}
public class TimedHostedService(ILogger<TimedHostedService> logger, ISessionFactory sessionFactory)
    : IHostedService, IDisposable
{
    private int executionCount = 0;
    private readonly ILogger<TimedHostedService> _logger = logger;
    private Task _task;
    private CancellationTokenSource _cancellationTokenSource = new();

    public Task StartAsync(CancellationToken stoppingToken)
    {
        Console.WriteLine("Timed Hosted Service running.");

        _task = Task.Run(async () =>
        {
            while (!_cancellationTokenSource.Token.IsCancellationRequested)
            {
                await DoWork(_cancellationTokenSource.Token);
                await Task.Delay(1000, _cancellationTokenSource.Token);
            }
        });

        return Task.CompletedTask;
    }

    private async Task DoWork(CancellationToken ct)
    {
        Console.WriteLine("do work");
        var count = Interlocked.Increment(ref executionCount);

        try
        {
            var items = await sessionFactory.OpenSession().Query<TestDocument>()
                .ToListAsync(ct);

            Console.WriteLine($"Found {items.Count} items");
        }
        catch (Exception e)
        {
            Console.WriteLine(e.Message);
        }

        Console.WriteLine(
            $"Timed Hosted Service is working. Count: {count}");
    }

    public async Task StopAsync(CancellationToken stoppingToken)
    {
        Console.WriteLine("Timed Hosted Service is stopping.");
        if (!_cancellationTokenSource.IsCancellationRequested)
        {
            _cancellationTokenSource.Cancel();
        }

        await _task;
    }

    public void Dispose()
    {
        if (!_cancellationTokenSource.IsCancellationRequested)
        {
            _cancellationTokenSource.Cancel();
        }
    }
}

What have I tried (without success):

  1. Use a querysession, lightweightsession, dirtytracked session
  2. debug the marten sourcecode. I can see that QuerySession.ExecuteReaderAsync is called using the resillience pipeline. However, it looks like it's not catching the exception.
  3. Change the pipeline to catch all exceptions.
  4. Wrap the ToListAsync() call in a resillience pipeline myself (this does catch / retry the error)

Please let me know if you need more information or if there is something else I can do / try.

jeremydmiller commented 2 months ago

@Erwinvandervalk What's the actual exception that gets thrown in your case? If anything, this is an incorrect configuration of Polly. I'm not seeing anything that's particularly actionable right now

Erwinvandervalk commented 2 months ago

Hi Jeremy, Thanks for your reply.

Here's the exception that I get. This is the same exception that I get in Azure, but as far as I understand it's the default exception that happens if a timeout occurs.

Marten.Exceptions.MartenCommandException: Marten Command Failure:$

Postgresql timed out while trying to read data. This may be caused by trying to read locked rows
$
$
Exception while reading from stream
 ---> Npgsql.NpgsqlException (0x80004005): Exception while reading from stream
 ---> System.TimeoutException: Timeout during reading attempt
   at Npgsql.Internal.NpgsqlReadBuffer.<Ensure>g__EnsureLong|55_0(NpgsqlReadBuffer buffer, Int32 count, Boolean async, Boolean readingNotifications)
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
   at Npgsql.Internal.NpgsqlConnector.ReadMessageLong(Boolean async, DataRowLoadingMode dataRowLoadingMode, Boolean readingNotifications, Boolean isReadingPrependedMessage)
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
   at Npgsql.Internal.NpgsqlConnector.ReadMessageLong(Boolean async, DataRowLoadingMode dataRowLoadingMode, Boolean readingNotifications, Boolean isReadingPrependedMessage)
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
   at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken)
   at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken)
   at Npgsql.NpgsqlCommand.ExecuteReader(Boolean async, CommandBehavior behavior, CancellationToken cancellationToken)
   at Npgsql.NpgsqlCommand.ExecuteReader(Boolean async, CommandBehavior behavior, CancellationToken cancellationToken)
   at Marten.Internal.Sessions.AutoClosingLifetime.ExecuteReaderAsync(NpgsqlBatch batch, CancellationToken token)
   at Marten.Internal.Sessions.AutoClosingLifetime.ExecuteReaderAsync(NpgsqlBatch batch, CancellationToken token)
   at Marten.Linq.MartenLinqQueryProvider.ExecuteHandlerAsync[T](IQueryHandler`1 handler, CancellationToken token)
   --- End of inner exception stack trace ---
   at JasperFx.Core.Exceptions.ExceptionTransformExtensions.TransformAndThrow(IEnumerable`1 transforms, Exception ex)
   at JasperFx.Core.Exceptions.ExceptionTransforms.TransformAndThrow(Exception ex)
   at Marten.Exceptions.MartenExceptionTransformer.WrapAndThrow(Exception exception)
   at Marten.Linq.MartenLinqQueryProvider.ExecuteHandlerAsync[T](IQueryHandler`1 handler, CancellationToken token)
   at Marten.Linq.MartenLinqQueryable`1.ToListAsync[TResult](CancellationToken token)
   at TimedHostedService.DoWork(CancellationToken ct) in C:\Users\erwin\RiderProjects\MartenDbRetry\MartenDbRetryConsole\Program.cs:line 133

Sorry, but to me it does look like the ResilliancePipeline isn't kicking in. I've tried to configure it to catch all exceptions and still it's not retrying. Also, if I use exactly the same polly configuration and wrap the ToListAsync() with it, then the retry policy does kick in so to me it looks like the policy is correct.

Now it's a bit strange.. in the past I've tested the retry policy by doing a table wide lock and then I did see the retry policy kick in. Perhaps the difference here is that the connection itself isn't responding? (I think this is also literally what's happening in azure, after a planned maintenance event. PgBouncer still has a number of connections that are no longer valid and only after trying them will they be removed).

Anyway, if there is anything you'd like me to do to help figure out what's wrong, please let me know. Otherwise I'll proceed with a workaround which is to wrap the calls to martendb in my own retry policy.

Thanks!

jeremydmiller commented 2 months ago

@Erwinvandervalk Looked at this again, traced through the code, and it's definitely being called through the ResiliencePipeline just like I expected it to be. I tried your reproduction, and I can see where it's happily going into the resilience pipeline, throwing an exception we expect to be caught, then doing nothing.

I think it's time to talk to the Polly team about this one.

Erwinvandervalk commented 1 month ago

I did some more tracing. Initially I was as confused as you were about why the retry wasn't being hit, but I think I've found something. I've created a PR with what I think is a fix, though I haven't been able to try it out myself.

https://github.com/JasperFx/marten/pull/3384

jeremydmiller commented 1 month ago

Finally closed by #3384