Pub-sub topology limits performance for systems with many event types

mikeminutillo commented 9 months ago

Azure Service Bus transport publishes all events to a single topic (by default), and each endpoint adds subscriptions to it. These subscriptions include filters to only route the messages that the endpoint can handle. These filters are implemented as SQL filters using LIKE statements against the EnclosedMessageTypes header. This is done to support scenarios (consumer-driven contracts, event polymorphism) where the header contains multiple message types.

Unfortunately, SQL-LIKE filters are operationally expensive, consuming a lot of CPU. When CPU usage gets too high (measured across the Azure Service Bus namespace) the service gets throttled for all operations. This means that under high-load, a complex system with more event types will become significantly slow. The exact level of CPU usage that triggers throttling is not documented but we have observed it happening at the 60-70% range. This slowdown triggers an increase in critical time which saturates the namespace and keeps it throttled.

Azure Service Bus reports being able to handle 2,000 SQL filters on a topic, but we have observed this throttling being applied with as few as 450. Reducing the number of filters to ~350 seems to have disabled the throttle but this is not an easy task on a system that has organically grown over time. The length of event type names (including namespace) may be a factor, as longer names require more CPU to filter.

This means that users using NServiceBus to build complex systems in Azure Service Bus are not getting the performance numbers advertised by Microsoft. When opening a support case with Microsoft, the user is advised that the usage of so many SQL LIKE filters is inhibiting performance and that scaling out will not help.

Users are unlikely aware of the need to monitor CPU usage on the Azure Service Bus namespace and have very little recourse when they are throttled for the first time.

The current topology design exists to support features (Consumer driven contracts and event polymorphism) that the user may not need.

mikeminutillo commented 9 months ago

Potential solution - Use correlation filters

We can keep the topology exactly the same and switch to correlation filters on the subscriptions. Correlation filters are significantly less CPU intensive and should increase the number of subscriptions we can add to the system before throttling kicks in.

There is a spike showing this approach:

https://github.com/Particular/NServiceBus.Transport.AzureServiceBus/pull/884

Unfortunately, this will prevent the use of several features and there may not be an easy way to detect that they are in use. If a message contract changes (adding an interface that is identified as a message type by convention, moving to a new namespace or assembly, etc.) then the subscription will silently stop working, leading to message loss. This cannot easily be detected because a change in the publisher might break an existing subscriber (and vice-versa).

We might be able to mitigate this a little by putting each message types as separate header with a simple value (a single character or empty string may be sufficient ). With that in place, the correlation filters could be applied for a specific message key being present. This increases the number of headers. In most cases, probably this is a single additional header. The spike does not demonstrate this approach and it would break wire compatibility with older versions.

mikeminutillo commented 9 months ago

Workaround - Replace the NSB created filters with correlation filters

NServiceBus is designed to operate in a minimal access mode. With this in mind, it does not need to create filters and will only process the messages that come to it's input queue. This means that you can replace the subscription filters with correlation filters. Once these correlation filters are in place, the system will operate as usual, with all of the caveats that apply to that solution.

On v2 of the transport (core v7), the transport will update the rule so you need to disable autosubscription for the impacted event type on all subscribers before changing the filter. This also means that new subcriptions would need to be set up by hand. This cannot be done with the command line as this would create a SQL-LIKE filter (replacing the Correlation Filter). You also cannot manually subscribe to events or the filter will be replaced.

On v3 of the transport (core v8), the transport will attempt to create the rule and quietly swallow exceptions if the rule already exists. This means that you could allow the transport to create the SQL-LIKE rule and swap them for Correlation filters when the CPU usage gets above a threshold.

mikeminutillo commented 9 months ago

Reported case: 82136

The user experienced throttling and opened a case with Microsoft after scaling out to 16 MUs for 24 hours without effect. Microsoft informed the user that 465 SQL-LIKE filters in the namespace is the culprit. The user removed a number of unused endpoints and cleaned up the filters to bring it down to 335 which disabled the throttling and allowed the backlog to clear.

The user reports having 25 endpoints and 200 event types. They are not using polymorphic dispatch or consumer driven contracts.

danielmarbach commented 7 months ago

Potential solution - Combine Correlation Filter with Event Mapping

We have already proven with SQS that it is possible to leverage some kind of mapping approach that allows associating events meta information (in the SQS/SNS case sns topics) to support message inheritance for those that need it.

It might be worthwhile investigating whether we can leverage a similar approach with correlation filters in case we require to support inheritance.

danielmarbach commented 4 months ago

Reported case: 85909

The user is already running on 16 messaging unit and hits throttling issues.

The user reports having 25 endpoints with 25 subscriptions and roughly 1500 rules in total. They are not using polymorphic dispatch or consumer driven contracts. They are using though an interface with a concrete type per message, for example IMyMessage and MyMessage

johnsimons commented 1 month ago

Reported case: 00087940

The user has 324 SQL Filters (and growing). EventNames Length is between 40-160 characters.

The user reports that it affects all their services, and since no isolation is guaranteed, they can't predict when it will happen again. The user is planning to implement correlation filters themselves.

Particular / NServiceBus.Transport.AzureServiceBus