Particular / NServiceBus

Build, version, and monitor better microservices with the most powerful service platform for .NET
https://particular.net/nservicebus/
Other
2.09k stars 648 forks source link

Possible message loss with TimeToBeReceived on transactional MSMQ endpoints #2886

Closed janovesk closed 8 years ago

janovesk commented 9 years ago

Taskforce: @janovesk @szymonpobiega

Who's affected

This bug affects you if you have a transactional endpoint that uses the MSMQ transport and have any messages types with a non-default TimeToBeReceived.

Symptoms

Message sent in a transaction do not get the configured TimeToBeReceived(TTBR). They are assigned a seemingly random TTBR if any. All the messages in a transaction gets the same TTBR, even though they are configured to have different TTBRs.

Background

Users can set a TimeToBeReceived for any messages type. The TimeToBeReceived indicates a period of time where this message needs to be kept alive. Afterwards, it can be removed. This can be valuable when you have a high volume of messages combined with low business value in delayed processing. Messages with TimeToBeReceived is treated like any other messages sent. The TTBR is simply passed on to the underlying transport, which is responsible for implementing the timeout.

Root cause

Unfortunately, MSMQ does not support having differing TTBR when you send multiple messages in the same transaction. All the messages in the transaction get the TTBR of the first message sent. This is clearly not what end users wanted, having defined different TTBR for different message types. The failure is silent with no warnings given to the user.

The following statement from MSDN confirms the underlying issue with MSMQ, TTBR and transactions:

When several messages are sent in a single transaction, Message Queuing uses the TimeToBeReceived property of the first message.

Affected versions

Version 4 and 5 of NServiceBus

Plan of attack

Customer must be using both MSMQ, run with transactions and have message types with TTBR set. Running transactional MSMQ is very common; the latter presumed to be less so. We do have confirmed customer reports on this, but I suspect this is also happening to customers without them knowing. It can be hard to notice that messages are removed prematurely or late from queues. Given the lack of complaints, this actual number of affected customers can't be very high. That said, when it occurs, it is effectively message loss. Therefore, I say impact medium. Size is only a couple edits to put an exception in place. It should be fairly quick, I'd say size small.

Customer communication

We have discovered a bug in the TimeToBeReceived feature. It lets you specify a TimeToBeReceived for your messages. If you use this in combination with the MSMQ transport and have transactions turned on, your messages might get the incorrect TimeToBeReceived set. The messages might get removed from the receiving queue prematurely or build up over time.

Unfortunately, we have no other choice than to disable this piece of functionality when running the MSMQ transport because of underlying issues in MSMQ. If we detect that your have an endpoint with MSMQ, transactions and TTBR set on any message we will throw an exception at runtime to remove any possibility of message loss.

This means you have to either

andreasohlund commented 9 years ago

@janovesk remember how we turned off the DTC tx and had the same issue on your box?

Is this a bug in 5.2.1 and we actually didn't manage to use native TX's. Or can it be that native tx's also have this issue?

janovesk commented 9 years ago

Repro with this

using System;
using System.Messaging;
using System.Transactions;

namespace MSMQTTBRRepro
{
    class Program
    {
        static void Main(string[] args)
        {
            var path = GetFullPath(Environment.MachineName, "test1");

            //internal transaction, single - gives correct ttbr for both messages
            Send(path, MessageQueueTransactionType.Single);

            //internal transaction, shared - gives wrong ttbr for m2
            using (var transaction = new MessageQueueTransaction())
            {
                transaction.Begin();
                Send(path, transaction);
                transaction.Commit();
            }

            //send using transaction scope/DTC - gives wrong ttbr for m2
            using (var scope = new TransactionScope(TransactionScopeOption.Required))
            {
                Send(path, MessageQueueTransactionType.Automatic);
                scope.Complete();
            }
        }

        private static void Send(string path, MessageQueueTransaction transaction)
        {
            using (var q = new MessageQueue(path, false, true, QueueAccessMode.Send))
            {
                var m1 = CreateMessageWithTimeToBeReceived(TimeSpan.FromHours(1));
                var m2 = CreateMessageWithTimeToBeReceived(TimeSpan.FromHours(2));

                q.Send(m1, transaction);
                q.Send(m2, transaction);
            }
        }

        private static void Send(string path, MessageQueueTransactionType transactionType)
        {
            using (var q = new MessageQueue(path, false, true, QueueAccessMode.Send))
            {
                var m1 = CreateMessageWithTimeToBeReceived(TimeSpan.FromHours(1));
                var m2 = CreateMessageWithTimeToBeReceived(TimeSpan.FromHours(2));

                q.Send(m1, transactionType);
                q.Send(m2, transactionType);
            }
        }

        public static string GetFullPath(string machine, string queue)
        {
            return "FormatName:" +  "DIRECT=OS:" + machine + "\\private$\\" + queue;
        }

        public static Message CreateMessageWithTimeToBeReceived(TimeSpan timeToBeReceived)
        {
            return new Message {TimeToBeReceived = timeToBeReceived};
        }
    }
}
janovesk commented 9 years ago

Worse than I thought. This is broken with BOTH MessageQueueTransaction and DTC :(

janovesk commented 9 years ago

This is a message loss scenario. It's not just affecting auditing.

Any message handler running under a transaction that sends two messages will get the TTBR from the FIRST message on both the messages. Let's say the first message is a low-value message with short TTBR, like a stock ticker update event with a 10-minute TTBR. The second message is an important message with a long/infinite TTBR that you don't want to lose. Say a "Charge customer $500 000 for stock purchase" command with an infinite TTBR. 10 minutes later that stock purchase is gone!

andreasohlund commented 9 years ago

Given that the TTBR feature

should we consider obsoleting it from the core and let each transport provide support for this individually should they be able to provide it reliably? (this is possible as of v6 given then better extensibility options we have for our public API)

On Fri, Sep 11, 2015 at 9:32 PM, Jan Ove Skogheim notifications@github.com wrote:

This is a potential message loss scenario. It's not just affecting auditing.

Any message handler running under a transaction that sends two messages will get the TTBR from the FIRST message on both the messages. Let's say the first message is a low-value message with short TTBR, like a stock ticker update event with a 10-minute TTBR. The second message is an important message with a long/infinite TTBR that you don't want to lose. Say a "Charge customer $500 000 for stock purchase." command with an infinite TTBR. 10 minutes later that stock purchase is gone!

— Reply to this email directly or view it on GitHub https://github.com/Particular/NServiceBus/issues/2886#issuecomment-139642275 .

danielmarbach commented 9 years ago

@janovesk should we also include this http://docs.particular.net/nservicebus/messaging/discard-old-messages in the discussion?

danielmarbach commented 9 years ago

Since this is essentially message loss I think we should assign a task force to it and get it going. Thoughts @udidahan?

udidahan commented 9 years ago

The issue of TTBR being broken sounds like a separate issue from this one.

andreasohlund commented 9 years ago

Agreed, @janovesk do we create another one or reword this one?

janovesk commented 9 years ago

I will create a separate issue for general discussion of MSMQ and TTBR, @andreasohlund.

janovesk commented 9 years ago

Added whos affected, sympthoms, root cause, plan of attack and impact assesment.

andreasohlund commented 9 years ago

Since we now have ServiceControl as part of our offering removing this all together across the board seems like the best option to me.

Another example where this is broken is the SqlTransport where the TTBR implementation relies on the consuming endpoint to discard the message when it's received. So in this scenario adding this to the audit messages is almost pointless since this feature was built for users that's not consuming their audit queue.

@Particular/azure Pretty sure that the above is true for ASQ, not sure about ASB, please confirm?

@udidahan I propose we approve this one with and include removal of the feature all together in the POA

danielmarbach commented 9 years ago

@andreasohlund we also need to take into consideration what this means for customers which don't have licenses including SI/SP. With this change we would force customers to upgrade their licenses

SeanFeldman commented 9 years ago

@andreasohlund ASB and ASQ support TTBR. With ASQ it's tricky since it's no more than maximum defined by service == 7 days. If users ask for more than that, we throw.

andreasohlund commented 9 years ago

With this change we would force customers to upgrade their licenses

They could write their own job/script that prunes the audit q? (we can provide guidance for this)

janovesk commented 9 years ago

TimeToReachQueue suffers from the same problem, but we don't allow per-message override for it.

janovesk commented 9 years ago

This is not a new bug, but I think we previously failed to understand the ramifications.

janovesk commented 9 years ago

Going to need some wordsmithing for this and https://github.com/Particular/NServiceBus/issues/2941. Up for it @DavidBoike? image

janovesk commented 9 years ago

After discussing different implementation options with @SzymonPobiega we probably will need to go with a run-time check for this. A startup check can potentially produce false-positives when you reference a message assembly with TTBR-attributed messages. Referencing the assembly is not the same as actually sending messages from it.

The only time we now the type sent, and if it has TTBR, is when Bus.Send/Bus.SendLocal/Bus.Reply/Bus.Publish is called.

janovesk commented 8 years ago

Can you take a look at the fix for V5 @SzymonPobiega?

DavidBoike commented 8 years ago

Sorry I'm a little late to the party, but I may be confused.

So we're essentially killing TTBR on MSMQ? Almost everybody on MSMQ is using transactional MSMQ, and we're not allowing ANY time to be received?

TTBR still works properly if you only use a single TTBR, and not a combination of normal messages and TTBR messages, right? So why not only throw in the cases where the outgoing transport operations contains a mix?

I've personally written handlers that send a single TTBR message and do nothing else, and would have been very upset if that capability was stripped away.

janovesk commented 8 years ago

That is a pretty heavy check to do runtime @DavidBoike.

The problem is also that we support override on audit and forwarding, so it's not only the messages you send in the handler that can affect the outcome.

andreasohlund commented 8 years ago

So why not only throw in the cases where the outgoing transport operations contains a mix?

Since we have audit on by default wouldn't this be true for pretty much all the messages?

DavidBoike commented 8 years ago

HashSet<int?> of TTBR values tracked by transport in the receive context wouldn't do the trick?

(Comment from @andreasohlund arrives)

Ahhh, yes. So audit message would be no TTBR, single message would have one, even in simplest use case.

The only full solution I can come up with then is a EndpointName.Outgoing, all messages with no TTBR value but could have one in a header, and then have a satellite responsible for actual dispatch on a message-by-message basis. But yuck.

janovesk commented 8 years ago

Also very NOT nice is the fact that we send the audit AFTER the handlers in v4, but before them in v5. So either you overrite the TTBR of the audit or the audit TTBR overrites your messages TTBR. :(

udidahan commented 8 years ago

As this qualifies as a bug, it doesn't require my approval. It also looks like this is already in progress. If so, please label it as such.

ramonsmits commented 8 years ago

@DavidBoike @janovesk Just to understand, this only affects messages with a TTBR set correct? So if you first send a message with a TTBR and then a message without then this will not have a TTBR set?

When several messages are sent in a single transaction, Message Queuing uses the TimeToBeReceived property of the first message.

When I read that MSDN quote then I read it as all messages share the TTBR of the first message. Meaning, that if no TTBR is set on the first message all other message won't get a TTBR either.

ramonsmits commented 8 years ago

@janovesk As discussed during the platform sync a suggestion would be give guidance on how to get the same effect:

This would not prevent the 'storage' problem but if the customer is not using service control than this is pretty easy to fix by just listening to audit, then forward it to audit.log with a TTBR set

janovesk commented 8 years ago

All the message in a transaction get's the TTBR of the first message sent in it, @ramonsmits.

There is no such thing as not having a TTBR set for a message. That just means the message gets the default TTBR, which is INFINITE.

The problem is still there. Any other messages sent in a transaction after a message with "no TTBR set" will have their TTBR overridden by the first message's INFINITE TTBR.

ramonsmits commented 8 years ago

Thanks for the INFINITE explanation @janovesk

bording commented 8 years ago

@janovesk What's the status on this?

janovesk commented 8 years ago

Hi, @bording! Status is that fix is code complete for hotfix-5.0.8, hotfix-5.1.6, hotfix-5.2.10. I'll port it the latest changes to hotfix-4.7.9 later tonight combined with milestoned issues.

Hopefully releasable tomorrow.

janovesk commented 8 years ago

Code is ready, customer communications is ready. V4 and V5 releases go out today.

mikeminutillo commented 8 years ago

@janovesk @andreasohlund There is a subtle issue here with error forwarding.

An endpoint:

The same happens with audit forwarding but as that occurs for every message it will be immediately apparent. Errors are rarer.

chandmk commented 8 years ago

We use this feature to control some non critical business messages. Original email failed because mail service was down. And Ops didn't get a chance to process the message in error queue. Now the message might be time sensitive. Ex: No point of sending an email after a discount period ended.

May be it is a naive question, why can't NSB provide a feature in the box that checks if the message has expired or not and mark it as processed (by not forwarding the message to the handler)? This allows the admin to reprocess the messages in error queue without any cognitive overload?