Particular / NServiceBus

Build, version, and monitor better microservices with the most powerful service platform for .NET
https://particular.net/nservicebus/
Other
2.08k stars 648 forks source link

Enabling FLR/SLR conflicts with Transactions disabled #2750

Closed ramonsmits closed 7 years ago

ramonsmits commented 9 years ago

Context

When you disable transactions then implicitly disable FLR/SLR. The expectation of the user could be that FLR/SLR are still operational.

If either FLR or SLR is enabled and DisableTransactions() is called we should at least write a warning to the log but maybe even better to fail initializing the bus.

We could also try to do a best effort in regards in enabling FLR/SLR. No data corruption can occur when handlers are idempotent and FLR would retry. The same applies for SLR. Although we cannot guarantee that SLR will be ran due to missing transactions.

Behavior in V6

Currently when running with TransortTransacitonMode.None the endpoint will disable both FLR and SLR. It will happen after putting warn entries into log. The errors will be handled in best-effort manner.

andreasohlund commented 9 years ago

If we fix #2752 would this one still be relevant?

In that case I belive we can solve this by documenting that no tx means that they get no retries

SzymonPobiega commented 9 years ago

We can have a best-effort SLR but we can't ever have FLR. On the other hand, with V6 refactorings we can imagine a behavior that would do in memory FLR by just invoking the downstream pipeline a couple of times if the first one failed. But not sure how bit impact would that feature have.

ramonsmits commented 8 years ago

This is still relevant. If you disable transactions then you implicitly disable certain features without even knowing. We cannot consider the user to read the documentation to know that those features are disabled.

If a user would do:

bus.Configuration.Transactions.Disable()

then the user also must explicitly disable other features:

busConfiguration.DisableFeature<FirstLevelRetries>()
busConfiguration.DisableFeature<SecondaryLevelRetries>()
busConfiguration.DisableFeature<ForwardToErrorQueue>()

What I ask is NOT to implicitly disable features but fail during initialization when there are conflicting features enabled.

@SzymonPobiega Yes, due to the way FLR works (don't know for v6..) we cannot rollback the transport transaction so the message is lost. We could have an alternative implementation that would just send the message back to the queue. This has similar behavior except that the message will not be retried immediately when the queue has a backlog. Yes, when the transport would be unavailable then the message is really lost.

ramonsmits commented 8 years ago

@colin-higgins Did I miss something regarding what I talking about with you about the conflicting features?

colin-higgins commented 8 years ago

@ramonsmits That looks like it sums up the conversation nicely. Really it comes down to making the disabling of transactions a lot more scary so that users think twice about doing it.

busConfiguration.SCARY_STUFF.DISABLE_ALL_EXTERNAL_SAFETY_MECHANISMS(); :ghost:

As far as the transport being unavailable, if we wanted to get fancy, we could also have a delayed retry loop in memory for putting it back on the queue if we decided having the alternate retry mechanism was a good idea. That would lessen the chance that the transport being unavailable causes message loss.

andreasohlund commented 8 years ago

This seems like another reason to merge FLR + SLR into one feature since as you correctly points out @ramonsmits we could provide SLR even with transactions off.

That said getting the consistency right would be tricky since we might produce "ghost" messages since there is no transaction to allow rollback to the queue in order to make sure that SLR is the only transport operation issued. Ie if you fail between two bus operations the first one would go out and then we request SLR => ghost message

ramonsmits commented 8 years ago

@andreasohlund If 'ghost' messages are an issue thus a concern for a customer then they should enable the outbox to mitigate this effect and even then they could still have more than once processing although the change of happening that is very low.

ramonsmits commented 8 years ago

Yes @colin-higgins indeed, that was missing. When we implicitly are going to disable reliability features then the called method should be very descriptive about its implications to be scary.

andreasohlund commented 8 years ago

If 'ghost' messages are an issue thus a concern for a customer then they should enable the outbox to mitigate this effect and even then they could still have more than once processing although the change of happening that is very low.

I don't see how the outbox would help since

  1. The ghost message would be a single message, eg. a "OrderPlaced" event that never happened
  2. If there are duplicate ghost messages they would have different MessageIds so the outbox would not be able to deduplicate
ramonsmits commented 8 years ago

@andreasohlund The OrderPlaced would only be send when the no exceptions occur. If an exception occurs the outbox will not be completed so no messages will be send.

I think you make the assumption that when we disable transactions that customers could not have any transactions. Disable transactions only means disable transport transactions. You could even have ambient transactions without having transport transactions.

Ghost message would not be there as the outbox envelope is correlated to the incoming message id right? Even if other message ids are generated these would not be send.

It could be that after storing the outbox envelope an error occurs, the message would still be processed for FLR/SLR but then this message would be ignored by the outbox deduplication.

The only way this would not work if a message could be processed by multiple threads/instances in parallel when transactions are disabled.

andreasohlund commented 8 years ago

The OrderPlaced would only be send when the no exceptions occur. If an exception occurs the outbox will not be completed so no messages will be send.

Sorry, I see now, yes in that case no ghost messages would be sent.

But what would happen if the call to request the SLR fails? (then we would have to put inmemory retries in place right)

Also what happens if power goes/endpoint crashes?

So its kind of "AtLeastOnceAsLongAsNothingBadHappensToTheQueuingSystemOrTheEndpoint" ?

Is there value for us providing such a mode of operations?

ramonsmits commented 8 years ago

@andreasohlund Correct, I see this as a best effort. If any one gets shot than the message is lost.

I think there is value in such a model. For example, gaming or high volatile info due to high volumes where message value gets lost in seconds. This probably is a good tradeof to increase performance and ease of scaling vs reliability.

ramonsmits commented 8 years ago

@andreasohlund To be clear, I don't mind the current behavior but I do think we must be making the user aware of the fact that FLR/SLR is disabled due to disabling transactions or even that they must confirm this via the API to make it explicit that they have conflicting enabled features.

indualagarsamy commented 8 years ago

@ramonsmits @andreasohlund @SzymonPobiega : In V6, when you disable transactions, the current behavior is that, if SLR is enabled, it will go through SLR. But not the configured FLR. Is that the behavior we are after?

indualagarsamy commented 8 years ago

@andreasohlund - Also, I see an acceptance test for the FLR behavior, but not for SLR https://github.com/Particular/NServiceBus/blob/develop/src/NServiceBus.AcceptanceTests/Recoverability/Retries/When_doing_flr_with_default_settings.cs

indualagarsamy commented 8 years ago

Here's my take on this: Given the new behavior in v6. With transactions disabled, messages will no longer be lost in case of an exception in the handler, they will be forwarded to the error queue.

andreasohlund commented 8 years ago

Retries (split into FLR / SLR) is a little convoluted. Retries should be either on / off. If its on, you should be able to control successive retries and delayed retries as one thing. (Is there a separate issue for this?)

Agreed, this is somewhat related to #3032 where the plan is to merge the FLR/SLR/Error into a single recoverability behavior where we can easily handle this. cc @MarcinHoppe @tmasternak

When you explicitly disable transactions and you have a conflicting (retries on) configuration set, I agree with @ramonsmits it can be confusing to the user as to what the behavior should be. Is there any better way to clarify that without stooping to documentation? Log warnings during both start and when forwarding to the error queue? Any other better ways?

Agreed as well, don't like any of the options but I can't come up with anything better. Perhaps we can detect if the user has explicit config for the FLR/SLR and then throw at startup? If user has no explicit config we can just log a INFO message?

Having FLR/SLR + transactions off seems like a dangerous unsafe combination to me, especially if it results in ghost messages and inconsistent data in the downstream database. My inclination is if you disable transactions, we try the message once, if it fails it goes to the error queue.

I like it. I don't see enough value with retries since the message are by definition not super valuable since the user turned the TX off. Just sending to error seems like a fair trade off.

indualagarsamy commented 8 years ago

@andreasohlund - I think we throw in both the following cases, if transactions are disabled:

The other option is introduce a new API, call it Transactions.DisabledAndMessageTriedExactlyOnce or some such thing. In which case, choosing that, we can automatically disable SLR and won't need to throw an exception.

Thoughts?

indualagarsamy commented 8 years ago

Having gone through the rabbit hole of transactions, I think I understand this issue better:

From a MSMQ perspective: if I want DTC, I should use TransportTransactionMode.TransactionScope (which is default) if I don't want DTC, I should use TransportTransactionMode.ReceiveOnly (this means, if something is wrong inside the handler, the message will still be rolled back to the queue) If I don't want transactions at the transport level, where I expect the message to be just processed once, then I have to use TransportTransactionMode.None.

My Proposal:

  1. Change the TransportTransactionMode from TransportTransactionMode.None to TransportTransactionMode.ReceiveOnceOnlyWithoutRetries
  2. If the user has set this explicitly, throw an exception if SLR is enabled, either by default or if the user has a CustomSLRPolicy to say that these modes conflict.

Talking through this with @bording on what this means to RabbitMQ, currently, the TransportTransactionMode.None is the exact same thing as TransportTransactionMode.ReceiveOnly except the FLR/SLR behavior. Changing the enum will also make sense for RabbitMQ. @bording please chime if that's not the case.

Thoughts?

SzymonPobiega commented 8 years ago

:-1: for ansportTransactionMode.ReceiveOnceOnlyWithoutRetries because ReceiveOnceOnlyWithoutRetries is not a transport concern. The TransportTransactionMode was meant to cover the transport side of things only.

We are still missing the user-focused reliability mode settings that covers the transport, the retries, the persistences and the outgoing operations. We agreed that persistences are not ready yet for such a top-down approach and paused that issue for a while. Is this captured anywhere @andreasohlund ?

andreasohlund commented 8 years ago

While you are correct from a user perspective @indualagarsamy I agree with @SzymonPobiega . Your name makes perfect sense at the "endpoint level". But to be able to provide such promises we need to get the storages involved as well.

My current line of thinking is:

v6: Clean up the transport level transactions (done) v7?: Introduce the concept of "storage transaction mode" vX?: Combine the above to be able to give the users "endpoint instance guarantees"

Note1:

My only doubt here is that some of the transports can actually provide native retries but since all of them can't do it I think we have to consider the retries a concept above the transports and native support will then be seen as an "optimization" of the "retries" feature.

Note2:

Arguably since the retries won't require us to know what type of transactions the storage supports we could provide some kind of endpoint level api for controlling just that part. But isn't that pretty similar to providing a config.Retries().Enable|Disable|SetCustomPolicy api

I realize I should raise separate issues for the persistence transaction mode and endpoint transaction mode. Stay tuned.

@Particular/nservicebus-maintainers thoughts?

timbussmann commented 8 years ago

:-1: for ansportTransactionMode.ReceiveOnceOnlyWithoutRetries because ReceiveOnceOnlyWithoutRetries is not a transport concern.

I fully agree.

When you disable transactions then implicitly disable FLR/SLR. The expectation of the user could be that FLR/SLR are still operational.

didn't follow the whole discussion, but when "disabling transactions" SLR is still enabled by default.

indualagarsamy commented 8 years ago

yep. Agree with the retries not being part of the transport. Going to setup a call.

MarcinHoppe commented 8 years ago

Please bear in mind that #3032 and #3031 indeed merge FLR/SLR/Faults into a single behaviour but the API surface stays the same. I think we need a separate issue to flesh this out.

andreasohlund commented 8 years ago

Please bear in mind that #3032 and #3031 indeed merge FLR/SLR/Faults into a single behaviour but the API surface stays the same. I think we need a separate issue to flesh this out.

Lets see how this plays out, I don't see the big deal since we can just flip to a `NoRetriesPolicy if retries are off but "move to error queue" is on.

In short: I still have a hunch that a single behavior is what we want since that code it tightly coupled anyway

MarcinHoppe commented 8 years ago

I agree that a single behaviour makes sense. I was only thinking that perhaps a more focused API to configure retries might be beneficial here. E.g. moving FLR config from the transport to the retry section might make sense.

We can also leave it as a is and see if the configuration API is fir for purpose when we have the merged recoverability behaviour up and running.

andreasohlund commented 8 years ago

I was only thinking that perhaps a more focused API to configure retries might be beneficial here.

Completely agree, we should have a single ".Retries" api, why burden the users with the concepts of FLR/SLR etc. They just want to do retries...

danielmarbach commented 8 years ago

We discussed this in the past. Retries is what we want to do and not bother the users with FLR and SLR terminology.

MarcinHoppe commented 8 years ago

I created #3466 to flesh out the API (assigned to myself but feel free to chime in). Terminology change is a far reaching concern that spans the doco and beyond. Not sure if we should move away from FLR/SLR terminology or just provide a coherent configuration API for them?

indualagarsamy commented 8 years ago

In a discussion with @andreasohlund @tmasternak @SzymonPobiega @ramonsmits @weralabaj and @janovesk (please feel free to add others who I have missed)

The proposed behavior is as follows: If the user explicitly turns transactions off, i.e. set the transaction mode to None, then since retries are on by default, and since this mode conflicts with the at most once delivery, then an exception will be thrown asking the user to do one of the following actions:

For example, the following code will throw an exception.

endpointConfiguration.UseTransport<MsmqTransport>()
 .Transactions(TransportTransactionMode.None);

To fix the exception, user will have to

endpointConfiguration.EnableFeature<FirstLevelRetries>();
endpointConfiguration.EnableFeature<SecondLevelRetries>();

or

endpointConfiguration.DisableFeature<FirstLevelRetries>();
endpointConfiguration.DisableFeature<SecondLevelRetries>();

If we combine the Retries into one (instead of SLR and FLR), it would be either enable retries or disable retries.

@andreasohlund - please feel free to add more details. Also, how can I tell if a feature is explicitly enabled vs enabled by default?

andreasohlund commented 8 years ago

Also, how can I tell if a feature is explicitly enabled vs enabled by default?

How about you add the check in a new "retries" feature. If we make that feature internal we can check if the users has called any of the explicit enable/disable methods?

danielmarbach commented 8 years ago

@indualagarsamy I think this proposal improves the story around Transactions and Retries a lot

On Wed, Feb 17, 2016 at 10:45 AM, Andreas Öhlund notifications@github.com wrote:

Also, how can I tell if a feature is explicitly enabled vs enabled by default?

How about you add the check in a new "retries" feature. If we make that feature internal we can check if the users has called any of the explicit enable/disable methods?

— Reply to this email directly or view it on GitHub https://github.com/Particular/NServiceBus/issues/2750#issuecomment-185122726 .

Don't miss the Async/Await Webinar Series • Learn more http://go.particular.net/l/27442/2016-02-15/681bjl

timbussmann commented 8 years ago

We agreed that when Transactions mode is set to None, it means the user probably wants at most once receive capability

I'm not sure this assumption is true. We tend to associate message processing semantics with transaction configuration but I'm not sure this is obvious to users. If a user want's at most once, he should disable retries. That's how he gets at most once on any transaction configuration.

indualagarsamy commented 8 years ago

@timbussmann Maybe my comment wasn't clear. That was the intent in the past when Transaction mode was set to None in MSMQ.

andreasohlund commented 8 years ago

Are you working on this one @indualagarsamy ?

petersgiles commented 8 years ago

I'd like to help out with this @indualagarsamy if you see me online please walk me through what you want to do

indualagarsamy commented 8 years ago

@andreasohlund - I am planning to look at it tomorrow. I was trying to wrap up changes to doco on IAdvancedSatellites and IManageMessageFailures. I can work with @petersgiles .

MarcinHoppe commented 8 years ago

@indualagarsamy @andreasohlund @petersgiles There are several items tagged Core v6 in the area of FLR/SLR/Faults. I'd love to pick your brains if you think it would make sense to form a larger TF to tackle these?

timbussmann commented 8 years ago

since ramon raised #3476, can we close this one?

andreasohlund commented 8 years ago

@ramonsmits closing, please reopen if you feel that #3476 doesn't cover this

ramonsmits commented 8 years ago

@andreasohlund @timbussmann

This is not related to #3476, that is about an alternative for FLR.

if I'm not mistaking, this issue is about implicit/explicit enabling of features and preventing conflicting configuration settings and validation.

andreasohlund commented 8 years ago

@ramonsmits you're right, reopening

andreasohlund commented 8 years ago

@MarcinHoppe @weralabaj @tmasternak this was the issue I mentioned on the SLR call today.

ramonsmits commented 8 years ago

@tmasternak mentioned here https://github.com/Particular/docs.particular.net/pull/1365#issuecomment-203608730 that in V6 FLR is still auto disabled when disabling transactions.

I really hope this will be addressed.

Conflicting features/configuration should prevent the endpoint from starting. We should never automatically disable/enable features and especially not disabling reliability features.

andreasohlund commented 8 years ago

added consider for v6 label

tmasternak commented 8 years ago

@andreasohlund If we want to have throw if FLR is not explicitly disabled I would extend the scope of https://github.com/Particular/NServiceBus/issues/3032. It's going to be 1h task.

andreasohlund commented 8 years ago

If we want to have throw if FLR is not explicitly disabled I would extend the scope of #3032. It's going to be 1h task.

Sounds pragmatic enough to me, :+1:

andreasohlund commented 8 years ago

After talking to @Particular/tf-v6-launch-plat-dev we decided to not include this in the scope for v6 since it's the current behavior and it can wait for the next major

timbussmann commented 8 years ago

I think I already saw certain saw some implementation in @MarcinHoppe PR. Are we going to remove that again then? /cc @tmasternak

MarcinHoppe commented 8 years ago

@timbussmann I think you saw this for SLR?