JasperFx / wolverine

Supercharged .NET server side development!
https://wolverinefx.net
MIT License
1.22k stars 131 forks source link

Outbox'ed messages not dispatched after app-crash #565

Closed mhartwig22 closed 4 months ago

mhartwig22 commented 12 months ago

Hi,

I setup a simple SQS backed outbox app (based on EF core and persistence to SQL server). When a simple controller method is triggered, I use the injected IDbContextOutbox to PublishAsync(...) 1,000 messages. As expected, upon calling SaveChangesAndFlushMessagesAsync(), they get persisted in the db and delivery to SQS starts.

Crashing the app half-way through the process, leaves 955 of the messages in the SQS queue and 232 items in the [wolverine_outgoing_envelopes] table. Upon restarting the app, the outstanding messages are not delivered (I waited for over an hour).

Am I missing some configuration? I noticed the old node_id is still registered in the [wolverine_node_assignments] table. I can reproduce this every single time.

mhartwig22 commented 12 months ago

Update: The messages have now been pushed to SQS. While I didn't take note of the timing exactly, it definitely took well over an hour to do so. Can this be tweaked? Should this be tweaked?

jeremydmiller commented 12 months ago

@mhartwig22 It's really hard to know what happened without being able to see your app logs. If there was a delay like that, I'd guess that there was an issue in leadership election for some reason. Are you running multiple nodes, or just one?

mhartwig22 commented 12 months ago

Just one node - under the debugger, @jeremydmiller. I was suspecting the leader election as well. I couldn't see anything in the logs at all. (this is a standalone example project - it doesn't do anything but this test). I'm glad it eventually published the messages but the delay is far longer than I wanted it to be. Its also interesting that, when starting the app after a crash, its happily dispatching newly send messages - just not the ones which are old/waiting in the outgoing table (which is also problematic as we definitely end up with severe out-of-order messages).

mhartwig22 commented 12 months ago

And here's another interesting case. ...I know I end up stress testing Wolverine :)

Publishing 5000 messages via:

        for (int i = 0; i < 5000; ++i)
            await _ctx.PublishAsync(new ListMessage
            {
                MessageValue = "testValue"
            });
await _ctx.SaveChangesAndFlushMessagesAsync();

Ended up with 6000 (?) messages in the queue and some log output like:

Wolverine.RDBMS.DurabilityAgent: Information: Found recoverable outgoing messages in the outbox for sqs://wolverinefxtest/
Wolverine.RDBMS.DurabilityAgent: Information: Recovered 100 messages from outbox for destination sqs://wolverinefxtest/  while discarding 0 expired messages

(this showed up exactly 10 times in the logs, which explains the surplus of 1000 messages.

Is this to be expected?

jeremydmiller commented 12 months ago

No, that's absolutely not expected. The "recoverable outgoing messages" are most likely left overs from a process crashing without being able to shut down normally and "drain" the outgoing activity. The only exception case I can think of is if you were having circuit breakers pop off on sending. Are you seeing anything unusual looking in the logs? It should be screaming in the logs about any kind of circuit breaker. Should also be logging errors if it were having trouble with the leadership being slow to start up the message recovery you said was very slow before as well.

mhartwig22 commented 12 months ago

When consuming those 6000 messages now, I noticed a second node appearing in the db node table after a few minutes - while the process is running. Seems to happen every time when being super busy either pushing mass messages or consuming them.

I guess it means that some sub-process is crashing and starting a new one? (Not sure, I'm not familiar with the internals of WolveringFx).

In above example, the log is full of Fails

Wolverine.Runtime.WorkerQueues.DurableReceiver: Error: Duplicate incoming envelope detected
...
Wolverine.Persistence.Durability.DuplicateIncomingEnvelopeException: Duplicate incoming envelope with message id 018ad153-27cf-4cab-90ec-69a23021d38f
...
Microsoft.Data.SqlClient.SqlException (0x80131904): Violation of PRIMARY KEY constraint 'pkey_wolverine_incoming_envelopes_id'. Cannot insert duplicate key in object 'dbo.wolverine_incoming_envelopes'. The duplicate key value is (018ad153-271c-4c28-ac7f-0569e2709bde).
The statement has been terminated.
...

...due to "duplicated messages" - which is expected - so it's hard to see other exceptions. I did search for "breaker" or "circuit" but nothing came up. I can try and run a cleaner test with tweaked log levels.

jeremydmiller commented 12 months ago

A "node" would just be one application. Wolverine itself isn't spinning up multiple IHosts. And you might see more than one node record temporarily after an application crashes, but that should be cleaned up on restart.

Are your messages small? You might try to jack up the batch size for SQS queues in Wolverine.

The duplicated message above is Wolverine's idempotent message deliver kicking in. So you are definitely having something funky going on on the sending side of things.

jeremydmiller commented 12 months ago

How are you kicking this off and what's the set up? Admittedly, Wolverine has been load and chaos tested a lot more with the postgres storage, but the code isn't that different between the postgres and sql server support. And what caused the process crash btw?

mhartwig22 commented 12 months ago

Thanks @jeremydmiller , I would have guessed WolverineFx would have been stress tested already. I haven't ruled out yet a user error of course (and hope it is!), but this app is just a test app and so simple. The inital crash-scenario I described on the top is not a crash per-se - I just kill the app under the debugger to simulate a crash "mid-publishing" to see how the system behaves.

Anyways, this is the setup I'm using (based on standard VS WebApp template project)

            builder.Services.AddDbContextWithWolverineIntegration<SampleDbContext>(x =>
            {
                x.UseSqlServer(connectionString);
            });
            //Configure wolverineFx for durable in- and outbox
            builder.Host.UseWolverine(opts =>
            {
                opts.Discovery.IncludeAssembly(typeof(Program).Assembly);
                opts.UseEntityFrameworkCoreTransactions();

                opts.PersistMessagesWithSqlServer(connectionString);
                opts.Durability.KeepAfterMessageHandling = TimeSpan.FromHours(1);

                opts.UseAmazonSqsTransport(c =>
                    {
                        c.RegionEndpoint = RegionEndpoint.EUWest1;
                    })
                    .Credentials(new BasicAWSCredentials("xxx", "xxx"))
                    ;

                opts.PublishMessage<CustomController.ListMessage>()
                    .ToSqsQueue("wolverinefxtest")
                    .UseDurableOutbox()
                    .MessageBatchSize(5)
                    ;
                if (InConsumerMode)
                {
                    opts.ListenToSqsQueue("wolverinefxtest"
                            , opt => { })
                        .UseDurableInbox()
                        .ListenerCount(10)
                        ;
                    opts.OnException<Exception>()
                        .RetryWithCooldown(30.Seconds());
                }
            });

SQL is M$ Sql Server Express, 15.0.2. The app runs either in producer or consumer mode. Again, this is all just for testing and understanding WolverineFx.

Producing code is just a GET endpoint in a controller which I trigger by browsing there:

    public class ListMessage
    {
        public string EventId { get; set; } = Guid.NewGuid().ToString();
        public string MessageValue { get; set; }
    }

    [HttpGet]
    public async Task<ActionResult> Produce([FromServices] IDbContextOutbox<SampleDbContext> ctx)
    {
        if (!InConsumerMode)
        {
            for (int i = 0; i < 5000; ++i)
                await ctx.PublishAsync(new ListMessage
                {
                    MessageValue = "testValue"
                });
            await ctx.SaveChangesAndFlushMessagesAsync();
        }
        return OK();
    }

The consuming handler looks like this:

    public class MessageHandler
    {
        public async Task ConsumeAsync(CustomerController.ListMessage message
            , IMessageContext context
            , SampleDbContext dbContext
            )
        {
            dbContext.Add(message); //lets keep track of what we receive from the queue
            await dbContext.SaveChangesAsync();
        }
    }

Its a super simple test-setup and the problems do not occur on "occasional messaging" but when I hammer the system as per above, its failing/misbehaving every time (very slow leader election, many additional messages pushed etc). Let me know if I misconfigured anything or if I can do anything to help tracking the issue down. I love WolverineFx and am keen to use it within our services, so I'm more than happy to lend a hand.

jeremydmiller commented 12 months ago

It has been stress tested, but not in the exact way and configuration that you're using it. That gives me some concrete reproduction steps and thank you for that.

If the leader election is slow here, I might be pointing at the Sql Server Express model a bit. And you could easily improve the performance by upping the message batch size.

But also, are you really going to be publishing 5000 messages from a single request? It might be much more realistic to be stress testing this with more concurrent aspnetcore requests to simulate load. I think there's a possibility that Sql Server in specific might not like that many rows being inserted at one time, but that's also something to play with.

mhartwig22 commented 12 months ago

Hey @jeremydmiller , thank you. No of course not, I wouldn't publish messages like this (nor that amount)! This is not what one would do :) I literally knocked up this test app to stress test the library to gauge robustness. I do believe though it highlights some issue.

Btw, I just ran the "producer" mode again with producing 5000 messages. The setup is exactly as per above post, and I also deleted the database for fresh tables, still, the issue reoccurred and I ended up with 6159 messages in the queue. This time however, I got some concise logs, which may (or may not be) insightful for you. There is not a single error or failure - but messages about "node" ownership.


 info: Wolverine.Runtime.WolverineRuntime[0]
      Initializing the Wolverine AmazonSqsTransport
info: Wolverine.Transports.ListeningAgent[0]
      Started message listening at dbcontrol://34be424f-e427-47d4-a82a-383fbf0b7a04/
info: Wolverine.Runtime.Agents.NodeAgentController[0]
      Starting agents for Node 34be424f-e427-47d4-a82a-383fbf0b7a04 with assigned node id 1
info: Wolverine.Runtime.Agents.NodeAgentController[0]
      Found no other existing nodes, deciding to assume leadership in node 34be424f-e427-47d4-a82a-383fbf0b7a04
info: Wolverine.Runtime.Agents.NodeAgentController[0]
      Node 34be424f-e427-47d4-a82a-383fbf0b7a04 successfully assumed leadership
info: Microsoft.Hosting.Lifetime[14]
      Now listening on: https://localhost:7019
info: Microsoft.Hosting.Lifetime[14]
      Now listening on: http://localhost:5092
info: Microsoft.Hosting.Lifetime[0]
      Application started. Press Ctrl+C to shut down.
info: Microsoft.Hosting.Lifetime[0]
      Hosting environment: Development
info: Microsoft.Hosting.Lifetime[0]
      Content root path: C:\GitRoots\EventDrivenArchExample\EventDrivenArchExample\
info: Wolverine.Runtime.Agents.NodeAgentController[0]
      Successfully started agent wolverinedb://default/
info: Wolverine.Runtime.WolverineRuntime[0]
      Successfully started agent wolverinedb://default/ on node 34be424f-e427-47d4-a82a-383fbf0b7a04
info: Wolverine.RDBMS.DurabilityAgent[0]
      Found recoverable outgoing messages in the outbox for sqs://wolverinefxtest/
info: Wolverine.RDBMS.DurabilityAgent[0]
      Recovered 100 messages from outbox for destination sqs://wolverinefxtest/ while discarding 0 expired messages
info: Wolverine.RDBMS.DurabilityAgent[0]
      Found recoverable outgoing messages in the outbox for sqs://wolverinefxtest/
info: Wolverine.RDBMS.DurabilityAgent[0]
      Recovered 100 messages from outbox for destination sqs://wolverinefxtest/ while discarding 0 expired messages
info: Wolverine.RDBMS.DurabilityAgent[0]
      Found recoverable outgoing messages in the outbox for sqs://wolverinefxtest/
info: Wolverine.RDBMS.DurabilityAgent[0]
      Recovered 100 messages from outbox for destination sqs://wolverinefxtest/ while discarding 0 expired messages
info: Wolverine.RDBMS.DurabilityAgent[0]
      Found recoverable outgoing messages in the outbox for sqs://wolverinefxtest/
info: Wolverine.RDBMS.DurabilityAgent[0]
      Recovered 100 messages from outbox for destination sqs://wolverinefxtest/ while discarding 0 expired messages
info: Wolverine.RDBMS.DurabilityAgent[0]
      Found recoverable outgoing messages in the outbox for sqs://wolverinefxtest/
info: Wolverine.RDBMS.DurabilityAgent[0]
      Recovered 100 messages from outbox for destination sqs://wolverinefxtest/ while discarding 0 expired messages
info: Wolverine.RDBMS.DurabilityAgent[0]
      Found recoverable outgoing messages in the outbox for sqs://wolverinefxtest/
info: Wolverine.RDBMS.DurabilityAgent[0]
      Recovered 100 messages from outbox for destination sqs://wolverinefxtest/ while discarding 0 expired messages
info: Wolverine.RDBMS.DurabilityAgent[0]
      Found recoverable outgoing messages in the outbox for sqs://wolverinefxtest/
info: Wolverine.RDBMS.DurabilityAgent[0]
      Recovered 100 messages from outbox for destination sqs://wolverinefxtest/ while discarding 0 expired messages
info: Wolverine.Runtime.WolverineRuntime[0]
      Releasing node ownership in the inbox/outbox from dormant node 0
info: Wolverine.RDBMS.DurabilityAgent[0]
      Found recoverable outgoing messages in the outbox for sqs://wolverinefxtest/
info: Wolverine.RDBMS.DurabilityAgent[0]
      Recovered 100 messages from outbox for destination sqs://wolverinefxtest/ while discarding 0 expired messages
info: Wolverine.RDBMS.DurabilityAgent[0]
      Found recoverable outgoing messages in the outbox for sqs://wolverinefxtest/
info: Wolverine.RDBMS.DurabilityAgent[0]
      Recovered 100 messages from outbox for destination sqs://wolverinefxtest/ while discarding 0 expired messages
info: Wolverine.RDBMS.DurabilityAgent[0]
      Found recoverable outgoing messages in the outbox for sqs://wolverinefxtest/
info: Wolverine.RDBMS.DurabilityAgent[0]
      Recovered 100 messages from outbox for destination sqs://wolverinefxtest/ while discarding 0 expired messages
info: Wolverine.RDBMS.DurabilityAgent[0]
      Found recoverable outgoing messages in the outbox for sqs://wolverinefxtest/
info: Wolverine.RDBMS.DurabilityAgent[0]
      Recovered 100 messages from outbox for destination sqs://wolverinefxtest/ while discarding 0 expired messages
info: Wolverine.RDBMS.DurabilityAgent[0]
      Found recoverable outgoing messages in the outbox for sqs://wolverinefxtest/
info: Wolverine.RDBMS.DurabilityAgent[0]
      Recovered 59 messages from outbox for destination sqs://wolverinefxtest/ while discarding 0 expired messages
mhartwig22 commented 12 months ago

Btw above is without any crash simulation. It's literally just running the app and produce 5k messages.

jeremydmiller commented 12 months ago

Like I said, it gives me something concrete to play with. I do think it would be a more realistic test to be using many more simultaneous AspNetCore requests (that's how I've load tested Wolverine w/ Rabbit MQ & Postgresql backed persistence), but fewer messages spawned from each request. While what you're doing is showing some concerning results, it's also not terribly likely in real life usage.

Your takeaway here is that I will absolutely take what you're doing and see if I can reproduce it. I'm pointing the finger here at the Sql Server mechanics more than any particular concern about the SQS mechanics

mhartwig22 commented 12 months ago

Thanks @jeremydmiller , and agreed, it's not a real-life scenario but - as you say - the results are a little unsettling. Would be interesting to know what you find. Very much appreciate it!

jeremydmiller commented 12 months ago

"the results are a little unsettling"

Maybe just a little bit of perspective here. You're using a very unlikely scenario with a non-production database engine on a local development box.

jeremydmiller commented 4 months ago

No recent activity, no reports of this issue in the wild, I think this was all contrived.