Azure / azure-sdk-for-java

This repository is for active development of the Azure SDK for Java. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/java/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-java.
MIT License
2.35k stars 1.99k forks source link

Do not serve expired messages to the user #26632

Closed ramya-rao-a closed 2 years ago

ramya-rao-a commented 2 years ago

As reported in #25259, there is a chance for ServiceBusReceiverClient.receiveMessages() to return expired messages to the user even when prefetch is disabled. One of the side effects of this is that the user gets such messages served one more time. This issue is logged after a discussion with @shankarsama to explore the option of the client dropping such expired messages to

Update: The approach chosen is to release messages as soon as they are received by the client if there is no active receive call from the user. This will apply only when prefetch is disabled.

The check of whether a message is expired can be made by looking at the lockedUntil property on the message.

The potential downside of this approach is that a message might get dropped in such a manner multiple times reaching the max delivery count and get pushed to the dead letter queue without giving the user a chance to process it. This invalidates the at-least once promise. Therefore, it might be beneficial to add this feature behind an option and document that it is recommended to have a higher delivery count when using this option.

Adding @shankarsama and @yvgopal to provide any additional thoughts

cc @conniey, @anuchandy, @JoshLove-msft

richardpark-msft commented 2 years ago

Is time sync between clients and the service an issue? I don't think we're super reliant on it now (aside from doing auto-lock-renewal, but we don't delete or remove the message based on that expiring).

It kind of breaks our "let the service validate" model and seems a bit error prone if there's clock drift.

richardpark-msft commented 2 years ago

Another little thought - the entire point of prefetching is that you (as the caller) are so performant that you want credits to just be continually added, etc...

If they are getting messages served up that are expiring then they are failing there, and should consider tuning their setup. If we just filter these messages out they have no signal that's occurring.

JoshLove-msft commented 2 years ago

The way it works in .NET is that messages are released if there is no receiver waiting for it when the message comes over the wire. This makes it unlikely for a customer to get served an expired message. This is specifically for the case where Prefetch is NOT enabled. If prefetch is enabled, then it is possible for customers to get expired messages.

ramya-rao-a commented 2 years ago

Thanks @JoshLove-msft!

@shankarsama, If we can support the releasing of message, then that should help avoid delivery count problem. Do you know if the Java AMQP implementation supports this?

@richardpark-msft This issue is for the case where prefetch is not enabled.

ramya-rao-a commented 2 years ago

@yvgopal just confirmed that the proton-j library does indeed support setting the message state as released

void disposition​(DeliveryState state) is the same method that is used to settle messages and it can be used to set the state as released

richardpark-msft commented 2 years ago

I still wonder why we don't just want to use drain here. It's built for this exactly purpose.

JoshLove-msft commented 2 years ago

As discussed in a recent Service Bus office hours, one issue with drain is that it can result in inefficiencies when dealing with concurrent receives. For instance, when using the processor in .NET, the default behavior is to have concurrent receive calls. If we drained credits when one receive call completes, we may end up stomping on other receive calls that are still in flight. We would also end up sending credit flows more often. As such, it may be more performant to leave the credits on the line when there could be concurrent receives. Releasing the messages also has the benefit of being very simple to implement as it is a standard AMQP disposition that we already have API support for.

richardpark-msft commented 2 years ago

A lot of good points @JoshLove-msft, as expected! :)

1 . Releasing the messages also has the benefit of being very simple to implement as it is a standard AMQP disposition that we already have API support for.

This is a good point. Drain is also a part of AMQP (it's a bit set on the flow frame, and is documented as part of the AMQP spec). Either way is equivalent over time, but drain has the benefit of avoiding excess transfers (even if you ultimately just Release() them).

  1. One issue with drain is that it can result in inefficiencies when dealing with concurrent receives.

I think this very much depends on how your client API is structured. Ultimately adding more credits on a single link isn't going to result in faster parallel receives, unless you're faster than the caller's dispatch/handling rate, similar to a prefetch scenario. So this is really more about how clients retrieve messages from your receiver client.

I think there are two styles here:

I don't know if either approach is better than the others, but I think the strategy you choose here does depend on what you consider to be idiomatic to your client.

JoshLove-msft commented 2 years ago

I think this very much depends on how your client API is structured. Ultimately adding more credits on a single link isn't going to result in faster parallel receives, unless you're faster than the caller's dispatch/handling rate, similar to a prefetch scenario. So this is really more about how clients retrieve messages from your receiver client.

Well you would be avoiding superfluous credit transfers and stomping on existing receives, so it does seem like it should be more efficient to avoid this when you are in the concurrent receive model.

Imagine that receive call A is still in flight, and receive call B drains the link. I think in most libraries, you'd end up with receive call A still waiting for the full timeout despite the fact that there are no longer any credits on the link.

richardpark-msft commented 2 years ago

Well you would be avoiding superfluous credit transfers and stomping on existing receives, so it does seem like it should be more efficient to avoid this when you are in the concurrent receive model.

Imagine that receive call A is still in flight, and receive call B drains the link. I think in most libraries, you'd end up with receive call A still waiting for the full timeout despite the fact that there are no longer any credits on the link.

Yes, this is why I'm being very explicit - there are two different models for this API. If you want to handle locking/dispatching in the client and allow parallel ReceiveMessages() calls then drain after each receive completes doesn't really make sense. (ie, we are in agreement).

I don't know which particular style Java has, but they might have choices in this matter depending on what they've already published as a guarantee.

conniey commented 2 years ago

For the synchronous receiver, we have an outer loop that "while there is work" && "while the message buffer" is not empty, keep emitting messages downstream. The inner loop, for each buffered message, we try to emit it downstream. If we can't, currently we push it back into the message buffer.

I think in this case we'd want to update this section to instead, "Release" the message and break out of that while loop. https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/servicebus/azure-messaging-servicebus/src/main/java/com/azure/messaging/servicebus/SynchronousMessageSubscriber.java#L175-L177

It's possible that the next iteration of the outer while loop, a sync work item was added, so we can't just clear the entire message buffer.

Note: We should double-check the "REQUESTED" to make sure the number of messages we request upstream aligns with the current state of our subscriber.

yunhaoling commented 2 years ago

I would echo Richard's point on "If they are getting messages served up that are expiring then they are failing there, and should consider tuning their setup. If we just filter these messages out they have no signal that's occurring."

For Python users who encounter the situation of receiving expired messages, chances are that the message handling part blocks the program (as per Python singe thread model), leading messages in the local buffer expiring. Usually, I would recommend users to receive message one by one or handle the message in a separate thread.

I do think it's important to let users know/think why messages are coming expired:

and after this thinking process, users can tweak the configs and test out the application.


However, I don't think what I mentioned above turns down the idea of dropping expired messages. I can't think of ways how users could utilize those expired message except for logging -- I assume most users would just ignore and re-receive the message.

I think providing this option would simply users' life, however, it should be based on the fact users understand their application pattern and the consequence (resend messages), yet still wants to ignore the errors.

richardpark-msft commented 2 years ago

Are we sure that clock drift is not going to be an issue here? Until this point the drift between the client and the service hasn't mattered too much (maybe some inaccuracy with lock renewal).

We'd be actively and silently not returning things back to the user if we activated this feature.

JoshLove-msft commented 2 years ago

Are we sure that clock drift is not going to be an issue here? Until this point the drift between the client and the service hasn't mattered too much (maybe some inaccuracy with lock renewal).

We'd be actively and silently not returning things back to the user if we activated this feature.

Can you clarify where the expiry time comes into play here? I think the proposal is to just not serve messages when no is listening for them - not to filter based on expiry time.

richardpark-msft commented 2 years ago

Are we sure that clock drift is not going to be an issue here? Until this point the drift between the client and the service hasn't mattered too much (maybe some inaccuracy with lock renewal). We'd be actively and silently not returning things back to the user if we activated this feature.

Can you clarify where the expiry time comes into play here? I think the proposal is to just not serve messages when no is listening for them - not to filter based on expiry time.

Possible I'm confused.

The original proposal in the issue is talking about dropping messages based on time, not if there's an active listener.

(line from the issue text): "The check of whether a message is expired can be made by looking at the lockedUntil property on the message."

@yunhaoling's latest reply (not trying to single you out Adam!):

However, I don't think what I mentioned above turns down the idea of dropping expired messages.

Indicates that we're still thinking about dropping them based on some expiration criteria. My point is that that's dangerous territory and I don't think that should be on the table.

The issue has had a few forks/threads though, so maybe the actual discussion has morphed a bit. Might be worth updating the issue summary to indicate what we're dropping and what we're keeping from it.

ramya-rao-a commented 2 years ago

Issue description has been updated. Thanks @richardpark-msft!

ramya-rao-a commented 2 years ago

This has now been released as part of version 7.6.0 of the azure-messaging-servicebus package