Particular / NServiceBus.Persistence.Sql

Native SQL Persistence for NServiceBus
https://docs.particular.net/persistence/sql/
Other
36 stars 27 forks source link

make the outbox batch size configuable. #1209

Open JakkeFiejest opened 1 year ago

JakkeFiejest commented 1 year ago

Describe the suggested improvement

At the moment the batch size when the outbox is cleaned up, is set to 10000. Changing this value is not explosed in the public configuration.

This however causes issues when using Azure SQL-servers in a DTU model. These DTU models are a super nice fit in for altering commands and we can easiliy grow the application and database (more load = more dtu's and more logical message processors) It also allows for a good implementation to have one datasource per service.

But all those benifits gow out the window as soon as the outbox starts cleaning up and needs to clean more then 10000 messgaes in one cycle. This is because cleaning that amount of messages takes up all DTU's when creating the log to rollbak the query. And it can have bad results causing even a complete crash of the messaging pipeline when we have to much consecutive fails.

At the moment we only have two workarrounds: massivly overscaling the database or disabling the cleanup and moving it to a dedicated process/sql job. which are both lesser options.

It would be better if the batch size could be exposed via the endpoint configuration, escpecially because the value can be set via the internal constructor. This whay the outbox can be also configured/tweaked with the expected load of the service.

Additional Context

No response

DavidBoike commented 1 year ago

Thanks @JakkeFiejest, this is a good idea. I actually peeked at the code and found that the query is already parameterized, we just have not exposed an API for setting that value.

I'm not sure when we'll get around to doing it, as it would probably need to be grouped with other improvements to persistence stuff.

As a workaround, I actually would encourage you to disable Outbox cleanup entirely within the endpoint and implement the cleanup as a SQL Agent task to run on a schedule. The query is pretty simple and aside from being tweakable to the batch size that works for you, it also has the advantage of having one canonical place to run. When an endpoint is scaled out, the individual endpoint instances don't have any way to coordinate and so they end up competing to clean the outbox, which is arguably even worse for the DTU model.

JakkeFiejest commented 1 year ago

Will this be tackled soon? small side remark, on azure database with a dtu model, you cannot schedule tasks (we found a way around it with a cron-job)

ttran-ipkg commented 1 year ago

For reference, the batch size for the outbox cleanup was changed from 10_000 to 4_000 (along with other lock related improvements) in:

ramonsmits commented 11 months ago

FYI: https://github.com/Particular/NServiceBus.Persistence.Sql/releases/tag/7.0.3 was released. Does that bring any benefits?

Using larger values will be more likely to result in deadlocks. Even lower values would result in each cleanup interval to even require more sequential cleanup queries.

@JakkeFiejest Even if we expose the batchsize making it smaller will only increase the number of queries for each cleanup interval. Could you please clarify to what value you would want to set it and what behavior requirements you would have when using that value?

JakkeFiejest commented 11 months ago

Hi,

At the moment we run the cleanup in a different application (because the Azure DTU model does not support scheduling tasks) But there we see benefits from the rowlock (we also use it in the cleaup) and also the benifit from better using the query plan executer. However, the big reason why the setting the cleanup interval size could be useful, is because the DTU model allows you to scale the database to your throughput, etc... however deleting 4000 records from a heap table will screw up the dtu model, because the transaction log that you have, is to small. So basically you want to set your delete size to match what the dtu's allow. (but the changes do improve the issues with the application crashing due to outbox delete timeouts, because not all apps in our microservices landscape have the outbox cleanup out of process, only the business critical ones)

KInd regards, Jan

Op do 16 nov 2023 om 17:40 schreef Ramon Smits @.***>:

FYI: https://github.com/Particular/NServiceBus.Persistence.Sql/releases/tag/7.0.3 was released. Does that bring any benefits?

Using larger values will be more likely to result in deadlocks. Even lower values would result in each cleanup interval to even require more sequential cleanup queries.

@JakkeFiejest https://github.com/JakkeFiejest Even if we expose the batchsize making it smaller will only increase the number of queries for each cleanup interval. Could you please clarify to what value you would want to set it and what behavior requirements you would have when using that value?

— Reply to this email directly, view it on GitHub https://github.com/Particular/NServiceBus.Persistence.Sql/issues/1209#issuecomment-1814820238, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJGG3KDNA6EMJVLPDJ5G2B3YEY6V7AVCNFSM6AAAAAAYHOUTWGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJUHAZDAMRTHA . You are receiving this because you were mentioned.Message ID: @.***>

ramonsmits commented 11 months ago

So basically you want to set your delete size to match what the dtu's allow.

@JakkeFiejest So you would want to increase the size? The problem with increasing is that this likely will result in lock escalation.

You likely also suggest to then not only increase the batch size but also the cleanup interval duration so that is runs less?

Do you have example size for different DTU's or have a link to documentation on this?