Particular / NServiceBus.Persistence.Sql

Native SQL Persistence for NServiceBus
https://docs.particular.net/persistence/sql/
Other
37 stars 27 forks source link

Handling of control message fails when using SQL outbox with pessimistic concurrency control #1194

Open pardahlman opened 1 year ago

pardahlman commented 1 year ago

Describe the bug

Description

When a SQL outbox is configured to use pessimistic concurrency control the handling of the control message fails with duplicate key violation, but succeeds after retry.

Expected behavior

No unexpected exceptions to be thrown by NServiceBus. In theory, this can be problematic as the exception is logged and can be ingested in log aggregators and potentially be used in alerting/diagnostics.

Actual behavior

Exception thrown.

Versions

NServiceBus 8.0.3 NServiceBus.Persistence.Sql.TransactionalSession 7.0.1 NServiceBus.TransactionalSession 2.0.2

Steps to reproduce

Clone the repo app and follow instructions for Reproduce pessimistic concurrency issue.

Relevant log output

Immediate Retry is going to retry message '4bd21c06-fd66-4632-9de2-700c462eb5a0' because of an exception:
System.Exception: Error while opening outbox transaction
 ---> System.Exception: Failed to ExecuteNonQuery. CommandText:

insert into "public"."Sender_OutboxData"
(
    "MessageId",
    "Operations",
    "PersistenceVersion"
)
values
(
    @MessageId,
    '[]',
    @PersistenceVersion
)
 ---> Npgsql.PostgresException (0x80004005): 23505: duplicate key value violates unique constraint "Sender_OutboxData_pkey"

Additional Information

danielmarbach commented 1 year ago

@pardahlman I added a pull request that verifies the pessimistic locking scenarios and they all pass.

https://github.com/Particular/NServiceBus.Persistence.Sql/pull/1195

When I checkout your repro sample I can also reproduce it with PostGres but when I switch to SQL Server it works. Might be related to the way pessimistic locking is implemented in the PostGres dialect. Are you planning to use PostGres or do you have it already in use?

oskar commented 1 year ago

That's interesting! Yes, we do use both SQL server and Postgres in production in combination with UsePessimisticConcurrencyControl and we are slowly migrating towards more and more Postgres databases.

But we don't yet use Transactional Session in production.

danielmarbach commented 1 year ago

@pardahlman @oskar Sorry for the delayed response. This issue required me to do some brain gymnastics ;)

SQL Persistence offers pessimistic concurrency control on the outbox level. The pessimistic concurrency control is implemented within the outbox begin transaction by proactively inserting an empty outbox record for a given message id. With that in place concurrent execution for the same message ID would lead to a row lock on the outbox record for the given message ID until the first transaction has committed and then throw a unique key violation exception which causes the message to retry and then the subsequent retry will see the outbox record on outbox get causing it to be deduplicated.

The name pessimistic is a bit misleading because you still get optimistic retries due to concurrency conflicts in some cases. With SQL server it seems that when the outbox record is inserted in the begin transaction the key range for the record gets locked which causes concurrent outbox gets to be blocked until the transaction commits.

With postgres the story is different. Both concurrent gets will succeed and return that no outbox record has been found. Then the first begin transaction will insert the record which causes the second concurrent begin transaction to fail on inserting the outbox record which leads to retrying of the message. While the end result is that "there might be retries" the observed behavior is "pessimistic" from the perspective of the handler execution. The problem is that in the postgres case that the immediate retries will always happen (due to the mentioned non-existing lock on the key range). The exception gets logged in the pipeline with the immediate retry ultimately spamming the log.

In the regular endpoints this behavior is rarely ever visible. Not in the transactional session case though. In the transactional session we dispatch the outbox control message early and then begin the transaction on the transactional session side. When the message is handled in the receiving endpoint the outbox begin transaction will immediately fail and go through a retry without the transactional session delay behavior ever having a chance to interfere. This causes the log to be spammed for every transactional session control message with the exception you see.

We highlight the potential exception that can happen in our documentation as well. See

Once the first thread commits the transaction, the second thread is interrupted with an exception as it is not allowed to insert the outbox. As a result, the message handlers are executed only once.

https://docs.particular.net/persistence/sql/outbox#concurrency-control-pessimistic-concurrency-control

So while I think we could say: Works as design and documented I do agree with the part that mentions

In theory, this can be problematic as the exception is logged and can be ingested in log aggregators and potentially be used in alerting/diagnostics.

and I think we could do better. Yet it is hard to say when we are going to prioritize this issue.

danielmarbach commented 1 year ago

Hi @pardahlman. We've reviewed your issue and added it to our backlog. It's eligible to be picked up for a future release, but we can't provide a firm timeline.

pardahlman commented 1 year ago

Thanks for the detailed response, its much appreciated 🙏 What you say makes sense and I agree that the behavior is according to the documentation.

Here's a brief overview of the situation in our applications: In order to ensure that NServiceBus is configured the same across all applications we have a thin API on top of the official ones that configures certain aspects of the endpoint. There is one extension method used to configure Postgres:

endpoint.WithPostgresOutbox(connectionString)

This method configures the outbox to use Pessimistic concurrence control (amongst other things). We want to use this method when using Transactional session. It was when we explored this that we discovered the behavior reported here.

Would you agree that using Pessimistic concurrence control has little value when using Transactional session, as there is no risk for message duplication or side-effects outside of the transaction (comparing with a normal handler)?

One option would be to disable pessimistic concurrence as we configure Transactional session. If I read the code correctly, it should be possible to do by setting Persistence.Sql.Outbox.PessimisticMode to false. Is there a reason why there the API to configure optimistic/pessimistic concurrency mode is one way (e.g. it can be enabled but not disabled)

pardahlman commented 1 year ago

@danielmarbach I'm not sure if you saw my question above (I forgot to ping you). It would be helpful if you'd find the time to provide some guidance on the topic.

danielmarbach commented 1 year ago

@pardahlman excuse me for the delayed response. There were a number of public holidays plus a conference happening that has blocked me from paying the necessary attention to be able to type a reply :(

Would you agree that using Pessimistic concurrence control has little value when using Transactional session, as there is no risk for message duplication or side-effects outside of the transaction (comparing with a normal handler)?

I would agree that currently there is little value. Pessimistic concurrency control on the outbox level was introduced for SQLP and NHibernate to protect the handler execution against double execution in case a message lock is lost, for example. For the transactional session control message handling, there is little value for this behavior since we never execute the rest of the message handling pipeline for those. That being said, I guess the problem is that it is a global flag for the endpoint and the endpoint might be processing other messages where you might benefit for pessimistic concurrency. Currently, we do not provide the possibility to have finer grained control or send the transactional session messages to a dedicated handling endpoint that runs outside the hosting environment that uses the transactional session (for example send-only endpoints).

And once we support controlling the session id to enable retries, I guess pessimistic concurrency support might be beneficial again potentially since the session id is no longer uniquely generated per request by in the hands of the users

One option would be to disable pessimistic concurrence as we configure Transactional session

:+1:

Is there a reason why there the API to configure optimistic/pessimistic concurrency mode is one way (e.g. it can be enabled but not disabled)

That is a good question. I wasn't involved in the API design as far as I recall. I think many of our configuration APIs assume you set it up in one way or the other but don't flip the behavior. I guess in your case you could simply expose an optional bool value that enables pessimistic concurrency by default or doesn't when the flag is switched.

pardahlman commented 1 year ago

Thanks for getting back on this. It sounds like a lot of exciting features are coming to Transactional session. We currently have separate endpoints for ASP.NET Core APIs and normal message handling, which makes disabling pessimistic concurrency a viable option.