[Service Bus] Investigate performance of Service Bus API

ramya0820 commented 5 years ago

This issue is to track

[x] general investigation of Service Bus API performance and user experience in light of customer reported problems on slowness of API
[ ] identify and confirm actionable items to take for addressing gaps, if any

ramya0820 commented 5 years ago

About receive,

In receiveAndDelete mode, following are the time taken to receive 1000 messages from a topic subscription using Standard tier entity with no partitioning
- Using Service Bus SDK with maxConcurrentCalls set to, this took 135415.899 ms. On setting the config to 1000, this took 2482.895ms
- Using amqp10 with is native credit and buffer management, this took 3496.847ms
- Using rhea with default credit_window size of 1000, this took 3387.434ms
On replicating sample with .NET SDK and setting maxConcurrentCalls and prefetch on client to 1000, the same scenario took 3248ms
Based on above, it seems like performance is on par with .NET, and recommendation would be to set maxConcurrentCalls with a default value of 1000.

About send,

If user is using async await to execute the operations, recommendations can be made to use Promise.all() for maximizing the interleaving of operations (since this is the crux of concurrency in general, at OS level as well)

cc @AlexGhiondea @ramya-rao-a

AlexGhiondea commented 5 years ago

What is the impact of setting the maxConcurrentCalls to 1000?

What does that impact?

What is the current default?

ramya0820 commented 5 years ago

What is the impact of setting the maxConcurrentCalls to 1000?

On the receive operations, this would result in faster throughput.

What does that impact?

If included as the default for Track 2, using the SDK and APIs out of the box for new users would be faster (in case setting this value was missed) Since this is configurable, it shouldn't impact any other feature. Stress and performance tests suite may surface issues - CPU usage may remain or go high and would need further investigation.

What is the current default?

The current default is set to 1.

AlexGhiondea commented 5 years ago

Do we know why a default of 1 was selected?

What happens when we have the maxConcurrentCalls set to 1000 -- what are the side-effects of that?

ramya-rao-a commented 5 years ago

Using any value greater than 1 creates unpredictable behavior for the user in case they were not expecting the second message to arrive wheel the first one is still being processed. We will be re-visiting this design in the coming months based on what we learnt in Event Hubs.

What I am concerned at the moment is the supposed difference in perf in ReceiveAndDelete mode when compared to PeekLock mode which has been reported.

ramya0820 commented 5 years ago

About using maxConcurrentCalls value of 100 and comparing the peekLock and receiveAndDelete modes, following are the results on it -

In peekLock mode (autoComplete set to true), receiving of 1000 messages with maxConcurrentCalls set to 100 took 8385.942 ms
In receiveAndDelete mode, receiving of 1000 messages with maxConcurrentCalls set to 100 took 5136.766 ms
However, using peekLock with autoComplete not set to true and explicitly completing message takes 14012.349 ms

Sample used and tweaked to switch between modes is as below

import { ServiceBusMessage } from "@azure/service-bus";

const { ServiceBusClient, ReceiveMode } = require("@azure/service-bus");

const connectionString =
  "";

const topic = "performance-test-topic";
const subscription = "performance-test-1";

const sleep = (waitTimeInMs: number) => new Promise((resolve) => setTimeout(resolve, waitTimeInMs));

async function main() {
  let messageCounter = 0;

  const ns = ServiceBusClient.createFromConnectionString(connectionString);

  const subscriptionClient = ns.createSubscriptionClient(topic, subscription);

  const receiver = subscriptionClient.createReceiver(ReceiveMode.receiveAndDelete);

  const onMessageHandler = async (brokeredMessage: ServiceBusMessage) => {
    messageCounter++;
    console.log(`Received message: ${messageCounter}`);
    if (messageCounter === 1000) {
      console.timeEnd("receive");
    }
    // await brokeredMessage.complete();
  };
  const onErrorHandler = (err: Error) => {
    console.log("Error occurred: ", err);
  };

  try {
    console.time("receive");
    receiver.registerMessageHandler(onMessageHandler, onErrorHandler, {
      maxConcurrentCalls: 100
    });

    await sleep(500000); // arbitrary delay

    await receiver.close();
    await subscriptionClient.close();
  } finally {
    await ns.close();
  }
}

main().catch((err) => {
  console.log("Error occurred: ", err);
});

ramya-rao-a commented 5 years ago

The difference between auto complete enabled and disabled in peek lock mode is understandable. We add credit for the next message right after user code is done executing.

When auto-complete is disabled, the user code pays the tax for the time taken to complete the message. When auto-complete is enabled, we the library complete the message after the credit has been added.

We do manual credit management because we wanted to control the number of messages in flight at any given point in time instead of flooding the user with messages.

ramya0820 commented 5 years ago

Agree, although it's unclear why receiveAndDelete was found to be slower, as this seems to be fastest among the 3 scenarios.

ramya0820 commented 5 years ago

Following are results of experiments carried out in local and on VMs to receive 1000 messages:

autoComplete disabled): 1212.628ms | 1504.236ms | 1128.053ms | 1260.830ms
peekLock mode (autoComplete enabled): 711.803ms | 715.551ms | 721.626ms
receiveAndDelete mode: 1713.555ms | 1144.770ms | 960.334ms | 899.197ms | 844.466ms | 789.568ms | 933.328ms | 930.445ms

The results do not reflect any 10x difference. The originally perceived difference by customer seems to be based on default value of maxConcurrentCalls and related defaults in rhea. We will keep the current defaults and investigate this further based on actionable items coming out of Event Hubs and Service Bus Track 2 work.

Closing as the performance related improvements and specific actionable items are being tracked in Performance Tests epic and individual customer issues.

Azure / azure-sdk-for-js

[Service Bus] Investigate performance of Service Bus API #4704