Particular / NServiceBus

Build, version, and monitor better microservices with the most powerful service platform for .NET
https://particular.net/nservicebus/
Other
2.1k stars 647 forks source link

Support AutoDeleteOnIdle #6717

Open Timovzl opened 1 year ago

Timovzl commented 1 year ago

Describe the feature.

It might be useful to support Azure Service Bus' AutoDeleteOnIdle feature. If I'm not mistaken, this deletes the queue automatically once it has not been listened to for a certain time period.

Additional Context

This feature would be especially useful for replying to some originating instance. In the cloud, supporting this tends to be tricky, because of the problem of cleaning up instance-specific queues that get orphaned when instances fail to shut down cleanly. With self-deleting queues, that problem goes away, and replying to the originator instance is once again feasible.

Timovzl commented 1 year ago

A bit of further research shows that for the Amazon SQS Transport, the same could be achieved using SQS Temporary Queues.

Timovzl commented 1 year ago

I believe RabbitMQ's auto-delete option could be used to achieve the same thing there.

The semantics for these settings offered by the various providers align pretty well. They share one concern that the requested feature would need to handle: ensuring that we get a new queue if our instance loses its connection while itself surviving.

timbussmann commented 1 year ago

This feature would be especially useful for replying to some originating instance.

Can you share some more details about your use case?

In general, auto-deleting queues are a potential source of message loss and other problems that are hard to predict because it might also impact senders that suddenly fail to send messages to a target queue. Scaled-out endpoints use a shared queue per logical endpoint, so assuming that the logical endpoint isn't being completely decommissioned, the shared queue should doesn't typically need to be removed. Deleting queues typically come up in scenarios where endpoints are tied to short-lived clients instead, in which case it often makes more sense to build them as send-only endpoints or use publish/subscribe mechanisms that are more focused on client-server communication like signalR etc.

of course, queue creation can be completely controlled by you if you disable installers but it would require some additional deployment automation to set up the queues with the desired settings.

Timovzl commented 1 year ago

Can you share some more details about your use case?

@timbussmann Absolutely! I have also discussed it in detail with @dvdstelt, should you wish to talk to him.

I fully understand that, normally, a single queue is shared by all instances of an endpoint and must always exist.

Broadly speaking, the use case for auto-delete queues is NServiceBus's existing feature of unique addressability (documented here and here), and all features that rely on it. This includes callbacks, RouteReplyTo, and RouteReplyToThisInstance.

Features depending on unique addressability have become tricky with the advent of containers and cloud computing. To address a specific instance, that instance needs its own queue. But when instances may come and go, those queues risk sticking around forever, because they to get unique (non-reused) IDs. This unresolved discussion on the forum is looking for a solution to the same problem.

Auto-delete queues make the relevant features as accessible as they once were, back when we could rely on a handful of predetermined, reusable IDs.

Let's address the issue of volatility. Irrespective of the queues used, we know that endpoint instances come and go, and that we cannot count on their IDs being reused. Therefore, any message addressed to a specific instance, by definition, might never get handled. Consequently, one would use such messages only for use cases where that scenario is acceptable. (Note that the callbacks documentation points towards this same truth: "Because callbacks won't survive restarts, use callbacks when the data returned is not business critical and data loss is acceptable.")

For an example of a use case where volatility is both expected and acceptable, consider an HTTP request handler that delegates its work via messaging. Let's take creditcard reservations (AKA authorizations) as our scenario. The HTTP caller attempts to reserve money on a credit card, and receives a response with the resulting reservation or rejection. (Later, the reservation can be collected, cancelled, or be allowed to expire.) The instance handling the HTTP request has good reason for using messaging to have the internal landscape handle the request. For example, the bounded context that handles the card's balance may use sagas and other features to achieve concurrency protection. Eventually, the result needs to be communicated back to the originating instance, so that it can send the HTTP response.

Note that, by nature, the above example is volatile. If the HTTP request times out, or the caller disconnects, or the HTTP server crashes, no response will reach the caller. All of this is reflected in the business domain itself: credit card reservations expire.

Finally, to mitigate the risk of queues being lost due to transient errors (while the instance is actually still alive), I propose that a long time to live be configured. It seems perfectly acceptable for an orphaned queue to live for another 48 hours before being cleaned up. The goal is to avoid unbounded waste and to do so in a safe manner.

In summary, features depending on unique addressability can be made cloud-friendly with auto-delete queues, and the resulting volatility is harmonious with the inherent volatility of such features.

timbussmann commented 1 year ago

Hi @Timovzl

Sorry for taking so long to get back to you. Thanks for your detailed comment, that is very helpful 👍

From what I understand, the auto-delete feature you're asking for seems to be mostly related to the instance-specific queues, not the shared queues. I do agree that instance-specific queues are a bit of a pain to work with in serverless environments when there hosting environment doesn't provide good integration points to setup and teardown instance-specific infrastructure and where there is a constantly changing number of active instances.

However, for a generic concept in NServiceBus, I'm feeling that the different implementations and behaviors, if supported at all, of the supported transports vary quite a lot to build a consistent feature across all endpoints.

Generally speaking, I'd say the best way to avoid these issues is to avoid instance-specific queues for highly dynamic scaling environments for the reasons you described. As you also said, callbacks are probably the most common reason to use instance-specific queues. While this is clearly easier said than done, a general recommendation is to avoid callbacks because they introduce very time-sensitive coupling into messaging that typically doesn't go well hand-in-hand for various reasons. I'm aware that there are reasons to use callbacks and often this is due to constraints and limitations of the application landscape, but generally speaking I'd definitely recommend to consider these points:

One approach that might be helpful for your current needs might be to implement your own installer logic that can create/modify the necessary resources in a way that you want to have them rather than letting NServiceBus create the resources. This can also be done directly in your host code before starting the endpoint, or using a custom IHostedService when using the Microsoft generic host. These approaches allow you to use the queue specific SDKs to provide your infrastructure the way you need them.

Timovzl commented 1 year ago

I appreciate your thoughtful response, @timbussmann.

If the team indeed considers the feature too niche and/or hard to uniformly get right, I will close this issue and look into using a custom IHostedService, as you proposed. I might create an extension package to add the feature, for specific transports.

I will address a few of the details you mentioned.

However, for a generic concept in NServiceBus, I'm feeling that the different implementations and behaviors, if supported at all, of the supported transports vary quite a lot to build a consistent feature across all endpoints.

I believe that the implementations for the three "big ones" that I linked were surprisingly (or perhaps unsurprisingly) consistent across the board.

Avoid messaging for time-sensitive dependencies.

I believe this can be reasonably covered by the documented best practice to "CONSIDER grouping message handlers by SLA". It seems to consider 1-second-SLA messages reasonable (and I'd settle for a 10-second SLA): "When SLAs are mixed, it becomes possible for a 1-second-SLA message to get stuck in line behind a batch of 60-second-SLA messages, causing the SLA of the shorter message to be breached."

E.g. Amazon doesn't fully validate your credit card when you place an order.

I'm a fan of this feature, where the business can get away with it. Yet complexities arise especially in situations where we cannot. For example, a payment service provider (PSP) might not have a banking license and thus not be permitted to give out advances to its clients, i.e. merchants. When merchants want to issue refunds to consumers, they need to have sufficient balance with their PSP to do so. So, if we must avoid advancing money and we also opt to validate in a deferred fashion, that means we cannot send a definitive result in the response. That is the kind of feature lack that causes merchants to go to a different PSP - we'd lose clients.

Sagas with built-in concurrency-safety are incredibly useful. NServiceBus truly offers a very powerful feature in that. They can help even medior developers implement complex use cases like the above correctly. The only missing ingredient is the bridge between API calls and messaging, specifically getting the result back into the API request handler.

timbussmann commented 1 year ago

If the team indeed considers the feature too niche and/or hard to uniformly get right, I will close this issue and look into using a custom IHostedService, as you proposed. I might create an extension package to add the feature, for specific transports.

to be clear, my comments aren't a prioritization decision. I primarily wanted to share some thoughts about what makes this difficult to implement, especially from a generic (NServiceBus Core) perspective. I'd definitely keep this issue open so it can be further evaluated the next time the team is looking into core or ASB specific feature requests. I'll even open an issue in the ASB repo specifically, as a transport-specific feature might be an easier starting point compared to a generic feature that expected to be supported across all our transports.

If you do implement this on your own, please do share your learnings with us!

The only missing ingredient is the bridge between API calls and messaging, specifically getting the result back into the API request handler.

Agreed, this is indeed a common struggle when trying to bridge from the synchronous communications world (e.g. HTTP requests) to internal asynchronous communication patterns. Typically, this bridging usually works best when the overall process can be redesigned to be more asynchronous even across the synchronous API boundaries, which is arguably one of the biggest challenges and something that can't be solved just with code.