Azure / azure-sdk-for-java

This repository is for active development of the Azure SDK for Java. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/java/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-java.
MIT License
2.31k stars 1.97k forks source link

[Bug] too many pull requests send to Service Bus from spring-cloud-azure-starter-servicebus-jms library #30192

Closed yiliuTo closed 1 year ago

yiliuTo commented 2 years ago

Query/Question When using the @JmsListener API with spring-cloud-azure-starter-servicebus-jms library, customers observed there are ongoing and high value of requests of tryReceive call sending from the spring-cloud-azure-starter-servicebus-jms library to Azure Service Bus even there are no messages in the broker. From the customer's application, the request rate is 250/min.

Here the tryReceive call is a request type that is received on the server side, which stands for pull-message requests.

Customer scenario

  1. Library version and framework:
  2. API:
    • Using @JmsListener to receive messages from Service Bus topic subscription.
  3. Configuration:     connection-string: Endpoint=xyz;SharedAccessKeyName=test-key-name;SharedAccessKey=key;EntityPath=path     topic-client-id: topic-client-id     client-Id: TestConnection     topic: topicName     subscription: subscriptionName     idle-timeout: 20000     pricing-tier: standard
  4. Env The application is deployed on AKS with replica number as 2.

Why is this not a Bug or a feature Request? @JmsListener API with spring-cloud-azure-starter-servicebus-jms by default leverages DefaultJmsListenerContainer to pull messages from Service Bus, so to have background pull request might be normal by default.

But there are still some gaps that why the request number can reach 250/min, and if we can change the volume or the pull-message behavior? We need to investigate them.

ghost commented 2 years ago

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

hui1110 commented 2 years ago

When using two applications to connect to the Service Bus, the requests are close to 250/min. For each additional application, the requests increase by about 120.

As shown below:

image

hui1110 commented 1 year ago

Add the following properties to the configuration:

spring.jms.listener.receive-timeout=

Timeout to use for receiving calls. Use -1 for a no-wait receive or 0 for no timeout at all. The latter is only feasible if not running within a transaction manager and is generally discouraged since it prevents clean shutdown.

When there are no messages, timeout = 0, or a larger value is given, the number of Service Bus requests is close to 0. But the number of pull messages is not the same as the number of requests.

hui1110 commented 1 year ago

Use arthas to monitor the number of executions of the method schedule which is the same as the number of requests on the Service Bus.

stliu commented 1 year ago
hui1110 commented 1 year ago

Use the following environment to try to reproduce the issue.

Environment 1

Environment 2

Project used: https://github.com/Azure-Samples/azure-spring-boot-samples/tree/spring-cloud-azure_v4.4.1/servicebus/spring-cloud-azure-starter-servicebus-jms/servicebus-jms-topic

Expect

In the receive() method, the timeout defaults to 1s, and in the method of pulling messages: performPullIfRequired(long timeout, boolean treatAsPullConsumer), the timeout is the valid time of the pull request. View Metrics in Service Bus, the number of requests should be around 60 per minute.

Check

The number of Successful Requests is also the number of pull requests.

image
yiliuTo commented 1 year ago

@hui1110 I still have some doubts towards your reproducing information:

  1. Since there is "Environment 1", so can I expect other different environments?
  2. Please confirm that the library of spring-cloud-azure-starter-servicebus-jms you are using are 4.3.0 or 4.4.1.
  3. Please confirm that whether you deployed the project to ASA or AKS? I believe that we have decided using AKS for the reproduction.
  4. Assuming you used ASA, could you provide the configuration information of your ASA service?
  5. For your local test env, please provide information about the number, memory of your CPU.
  6. The expected behavior is not clear, what does the timeout mean? Could you provide the complete description. And why will it be related with the number of requests?
  7. For the chart of Service Bus metrics, there are multiple metrics you are showing , which one stands for the tryReceive call we are talking about?
  8. Assuming you are using ASA, it's ok that you don't provide another chart to show the number since you might not have that env right now. But please delete the uncompleted sentence As shown below:.
yiliuTo commented 1 year ago

Cause analysis

@JmsListener underlyingly is a polling listener which is built for repeated polling attempts.

The listener sits on an ongoing loop of polling, each invoking the JMS MessageConsumer.receive() to poll the local consumer for messages to consume. By default, each poll operation triggers the local consumer to send 2 pull requests to the message broker.

[1343427689:1] -> Flow{nextIncomingId=1, incomingWindow=2047, nextOutgoingId=420, outgoingWindow=2147483647, handle=1, deliveryCount=209, linkCredit=1, available=null, drain=false, echo=false, properties=null}
[1343427689:1] -> Flow{nextIncomingId=1, incomingWindow=2047, nextOutgoingId=420, outgoingWindow=2147483647, handle=1, deliveryCount=209, linkCredit=1, available=null, drain=true, echo=false, properties=null}
[1343427689:1] <- Flow{nextIncomingId=420, incomingWindow=5000, nextOutgoingId=1, outgoingWindow=2047, handle=1, deliveryCount=210, linkCredit=0, available=0, drain=true, echo=false, properties=null}

The semantic of the two flow frames(pull requests) is as image

The concrete polling process is decided by several properties including receiveTimeout, prefetchSize and receiveLocalOnly,

The complete flow is described in the below picture:

jms-pull

In the customer's case which is also the default configuration, it is case 2.1 that for each receive() call, it sends pull requests to Service Bus. Therefore, the client keeps sending pull requests in a loop even though no messages in Service Bus. And the frequency is up to the CPU status, performance and how many tasks to be handled, which can influence how fast to run a loop. That's why Yonghui can reproduce the similar number of requests as the customer on Azure, but when running his own local machine, the request number changes.

yiliuTo commented 1 year ago

Solution

For how to deal with the issue, there are 2 solutions:

Solution 1. Change to push consumer and local-check only.

By changing the mode as push, the consumer is now an Asynchronous Notification consumer that it doesn't pull messages from the broker, but maintains a target amount of link credit. The amout is decided by a prefetch property. As Service Bus(sender) pushes messages, the sender’s link-credit decreases and when the sender’s link-credit falls below a threshold, the client will send a request to the server to increase the sender’s link-credit back to the desired target amount.

To accomplish it, users can add the following configuration: First, configure the prefetch number as non-zero which configures the consumer as non-pull, there are several prefetch properties each controls different Service Bus entities, users should choose one(s) that applies with their cases:

   spring.jms.servicebus.prefetch.all=<Fallback value for prefetch option in this Service Bus namespace>
   spring.jms.servicebus.prefetch.queue-prefetch=<The number of prefetch for queue.>
   spring.jms.servicebus.prefetch.queue-brower-prefetch=<The number of prefetch for queue browser.>
   spring.jms.servicebus.prefetch.topic-prefetch=<The number of prefetch for topic.>
   spring.jms.servicebus.prefetch.durable-topic-prefetch=<The number of prefetch for durable topic.>

Second, configure the non-local-check by adding a configuration class for the factory customizer:

   @Configuration(proxyBeanMethods = false)
   public class CustomJmsConfiguration {

       @Bean
       ServiceBusJmsConnectionFactoryCustomizer customizer() {
           return factory -> {
               factory.setReceiveLocalOnly(true);
               // Configure the below ReceiveNoWaitLocalOnly instead if you have specified the property 
               // spring.jms.listener.receive-timeout with negative value. Otherwise, configure the above `ReceiveLocalOnly`.
               //factory.setReceiveNoWaitLocalOnly(true);
           };
       }
   }

The value of prefetch can affect how fast messages are dispatched to the consumer's local buffer. Users should adjust the value according to their consuming performance and message volumes. Suitable value can speed up the consuming process, while too larget prefeth can cause the locally buffered messages are outdated and be dispatched again. For low message volumes, where each message takes a long time to process, the prefetch should be set to 1. This ensures that a consumer is only processing one message at a time.

Solution 2. Change the receive timeout to extend each pull-blocking time.

The receive timeout property decides the strategy of how long the consumer blocks there to wait for a pull request. So, by extending the timeout, it can reduce the pulling frequency then reduce the number of pull requests further when users choose pull mode still. And in an extreme case, users can set the strategy to be infinitely waiting until a message arrives, which means the consumer only pulls after consume a message, so when there are no messages in the server, it will block for waiting.

To accomplish this, users can configure the below property, which is of java.time.Duration type and the default value is 1 second.

spring.jms.listener.receive-timeout=

The meaning of the value is

Seting the receive-timeout as 0, means the pull blocks infinitely till a message is dispatched. Setting the receive-timeout as positive value, means the pull blocks up to timeout amount of time. Setting the receive-timeout as negative value, means the pull is a no-wait receive , it returns a message immediately or null if none available.

Note, a high timeout value can bring some side effects: it will also extend the time when the main thread is in a block status. So the container will be less responsive to stop() calls - the container can only stop between receive().

Besides, since the container can only send requests after the receive-timeout, so if the interval is longer than 10 minutes, Service Bus will close the link and cause the listener which by default uses a CachingConnectionFactory cannot send/receive anymore. So if you require a high receive-timeout, please use the JmsPoolConnectionFactory alongside.

yiliuTo commented 1 year ago

Hints about how to read the request metrics in Azure Portal for Service Bus:

  1. Difference between sum and count: Referring to the doc, the sum aggregation means how many requests were collected in the last interval. While count presents how many measurement were capured in the interval. So the sum can be caculated by adding the results of each measurement happened in an interval.

  2. What is the aggregation interval? Service Bus sends data to metrics pipeline at 1 min aggregation. But when you view the metrics in the portal based the total interval you choose the metrics are aggregated to fit in the interval. For example, if you pick 30 days duration then it may have to use daily aggregation to fit the samples in the graph. If you pick 24 hours duration then it may have to pick hourly aggregation. Aggregation window changes automatically based on the duration. A quick way to figure out the interval with your current chart is to view it as a scatter chart, and the interval between two adjacent records is the actual aggregation interval.

yiliuTo commented 1 year ago

Close this issue since both the cause and solution are clear now.