Particular / NServiceBus

Build, version, and monitor better microservices with the most powerful service platform for .NET
https://particular.net/nservicebus/
Other
2.07k stars 647 forks source link

Should core enforce behavior on the transport when onError throws? #4568

Open bording opened 7 years ago

bording commented 7 years ago

As brought up in the comments to #4561 starting here, there is a difference in how the various transports handle the scenario of the onError func throwing an exception.

Currently, MSMQ and SQL raise a critical error, but RabbitMQ and ASQ don't (not sure about what ASB does, but I'm guessing it doesn't either).

CC: @Particular/azure-maintainers @Particular/rabbitmq-transport-maintainers @Particular/sqlserver-transport-maintainers

SeanFeldman commented 7 years ago

ASB will raise a critical error and it's up to the endpoint host to decide what to do.

tmasternak commented 7 years ago

As already mentioned in the original PR SQL will trigger circuit breaker and if the problem persists raise critical error.

BTW, should the question here be more: "Should Core enforce behavior on the transport when onError throws"? (to that my answer would be 'no')

adamralph commented 7 years ago

I'd regard onError as something that the core should guarantee to never fail unless something catastrophic has occurred. It follows from that the transport should not swallow the exception, but let it propagate right out.

bording commented 7 years ago

As already mentioned in the original PR SQL will trigger circuit breaker and if the problem persists raise critical error.

@tmasternak Looking at the code it appears that SQL skips the circuit breaker and directly raises the critical error, just like MSMQ: https://github.com/Particular/NServiceBus.SqlServer/blob/develop/src/NServiceBus.SqlServer/Receiving/ReceiveStrategy.cs#L79

bording commented 7 years ago

ASB will raise a critical error and it's up to the endpoint host to decide what to do.

@SeanFeldman I took a look at the ASB code, and it appears that it actually does arm the circuitbreaker, and not raise a critical error directly:

https://github.com/Particular/NServiceBus.AzureServiceBus/blob/e88d11c562e786fe188bc0ef163a8642c6fb305e/src/Transport/Seam/MessagePump.cs#L71

https://github.com/Particular/NServiceBus.AzureServiceBus/blob/e88d11c562e786fe188bc0ef163a8642c6fb305e/src/Transport/Receiving/MessageReceiverNotifier.cs#L256

SeanFeldman commented 7 years ago

@bording we arm the circuit breaker upon unrecoverable errors. When CB is armed and triggered, the registered trigger action is to raise a critical error

bording commented 7 years ago

@SeanFeldman Right, the point I'm trying to make is that you arm the circuitbreaker. You aren't raising the critical error directly like SQL and MSMQ are.

SeanFeldman commented 7 years ago

Ah, yes. That is correct @bording .

bording commented 7 years ago

I'd regard onError as something that the core should guarantee to never fail unless something catastrophic has occurred. It follows from that the transport should not swallow the exception, but let it propagate right out.

@adamralph From what I understand the scenarios that could throw are:

  1. The message needs a delayed retry and sending that message (either to the timeout manager or natively) fails
  2. The message needs to be sent to the error queue, and sending that message fails
  3. The user has configured a custom retry policy and something in their code throws

Case 1 and 2 could be transient for most of the transports since there is a connection involved, so trying the message again could successfully send it if the connection has been reestablished.

Case 3 is the hard one, because it could either be transient or it's bad code and will always throw.

bording commented 7 years ago

@tmasternak I like your suggestion for a clearer title.

Ultimately, I wanted to show that currently we have lots of different behavior here. Is this something that is OK, or should there be consistent behavior that is enforced via a TransportTest?

adamralph commented 7 years ago

I'm leaning toward:

consistent behavior that is enforced via a TransportTest

tmasternak commented 7 years ago

@tmasternak Looking at the code it appears that SQL skips the circuit breaker and directly raises the critical error, just like MSMQ: https://github.com/Particular/NServiceBus.SqlServer/blob/develop/src/NServiceBus.SqlServer/Receiving/ReceiveStrategy.cs#L79

You are correct, my bad.

Just to add to the discussion. AFIAK the intention of CriticalError was to enable transport to rais unrecoverable errors to the Core. If my understanding is correct then if and what a transport regards as a unrecoverable error depends on concrete transport and probably should not be enforced. E.g. in Sql we made a decision that persistent problems with connection to the broker (db instance) should result in endpoint shutdown. I am not sure this can be said about all transports that we have.

timbussmann commented 6 years ago

@bording given the changes in v7 around the critical error behavior, it does seem like raising a critical error seems like a reasonable action? Can we create a new issue with some actionable tasks and close this discussion or is there further need for discussion?

bording commented 6 years ago

@timbussmann It's not immediately apparent to me how the v7 changes make any difference here.

It's been a while, but looking back over this issue, the main thing that stands out to me is deciding if onError throwing should be considering a catastrophic, stop the world failure or not.

Based on https://github.com/Particular/NServiceBus/issues/4568#issuecomment-288853679, I'm not sure that treating any and all failures here as unrecoverable is the correct approach.

Of course, given that the default v7 behavior for a critical error action is to do nothing, I suppose we need to take another look at the behavior of the transports in that light. Perhaps they aren't so different then?

timbussmann commented 6 years ago

throwing should be considering a catastrophic, stop the world failure or not.

I meant that with the v7 changes it has become more clear that we should never stop/shutdown on our behalf.

am I missing something?

andreasohlund commented 6 years ago

Talking to @timbussmann we think it would make sense to enforce this via a test and have RabbitMQ+ASQ raise a critical error since that would give the users a chance to react to this situation? (We assume that RabbitMq+ASQ currently just rolls the message back to the queue)

Note: In v7 the default critical error behavior is to just retry indefinitely