aws / aws-sdk-js

AWS SDK for JavaScript in the browser and Node.js
https://aws.amazon.com/developer/language/javascript/
Apache License 2.0
7.6k stars 1.55k forks source link

[SQS performance] High CPU usage #4471

Closed dungvv-qsoft closed 2 months ago

dungvv-qsoft commented 1 year ago

Describe the bug

We are facing with an issue when turn on SQS on our server which is they use 100% of the available CPU compared around 60% when turn off queue (with the same traffic).

We run a profiler:

On the local server to be able to generate flame graph

With Queue enabled, It shows the initial spike is too high on resources occupied image

This is non queue test, the initial spike is not too high compared with queue turning on: image

Evidence: .cpuprofile and .text report files for 2 cases (turn on/off queue when do stress test) https://github.com/dungvv-qsoft/cpu-profile

On our production server

Queue on: (CPU/Ram usage graph - From 3:20 - 3:30) Maximum average CPU usage is almost 100% image

Queue off: (CPU/Ram usage graph - From 4:15 to 4:30) Maximum average CPU usage is around 60% image

Environment Node version: v16.20.1 AWS-SDK version: v2.1106.0 SQS-Consumer version: v5.7.0

Snippet to initialize SQS consumer

export const sqsProvider = {
  provide: SQS_PROVIDER,
  useFactory: (): SQS => {
    const sqs = new SQS(getOptionsForAWSService(EAWSService.SQS));
    return sqs;
  },
};

....

export const getOptionsForAWSService = (service: EAWSService) => {
  return {
    credentials: new ChainableTemporaryCredentials({
      params: {
        RoleArn: config.get<string>(`aws.${service}.roleArn`),
        RoleSessionName: `zaapi-chat-${service}-${uuidV4()}`,
      },
      masterCredentials: new Credentials({
        accessKeyId: config.get('aws.accessKeyId'),
        secretAccessKey: config.get('aws.secretAccessKey'),
      }),
    }),
    region: config.get<string>('aws.region'),
    httpOptions: {
      agent: new https.Agent({
        keepAlive: true,
        maxSockets: config.get<number>('aws.maxSockets'),
      }),
    },
  };
};
constructor(
    @Inject(SQS_PROVIDER)
    private readonly sqs: SQS,
    private readonly sagaConsumer: SagaConsumer,
  ) {
    const queueConfigMap: Record<EChatChannel | string, string> = {
      [EChatChannel.LINE]: 'aws.sqs.queues.lineWebhookQueueFifo.url',
      [EChatChannel.FACEBOOK]: 'aws.sqs.queues.fbWebhookQueueFifo.url',
      [EChatChannel.INSTAGRAM]: 'aws.sqs.queues.instaWebhookQueueFifo.url',
      [EChatChannel.SHOPEE]: 'aws.sqs.queues.shopeeWebhookQueueFifo.url',
      [EChatChannel.LAZADA]: 'aws.sqs.queues.lazadaWebhookQueueFifo.url',
      [INACTIVE_STORE_SQS_INSTANCE_NAME]:
        'aws.sqs.queues.inactiveStoreWebhookQueue.url',
    };
    Object.entries(queueConfigMap).forEach(([channel, configKey]) => {
      this.queueUrls.set(channel, config.get<string>(configKey));
    });

    const baseSqsConfig = {
      handleMessageBatch: async (messages: SQS.Message[]) => {
        const jobGroups = _.groupBy(
          messages,
          (msg) => msg.MessageAttributes?.['Name'].StringValue || '',
        );
        await Promise.all(
          Object.entries(jobGroups).map(async ([jobName, events]) => {
            await this.sagaConsumer.consume(
              {
                name: WEBHOOK_EVENTS_RECEIVER_SAGA_NAME,
                jobName,
              },
              {
                events: events.map((event) => tryParseStringToJson(event.Body)),
              },
            );
          }),
        );
      },
      messageAttributeNames: ['Name'],
      batchSize: 10,
    };
    if (WebhookReceiverSqsDuplex.webhookReceiverQueueEnabled) {
      this.queueUrls.forEach((queueUrl, name) => {
        const sqsConsumer = this.createSqsConsumer(queueUrl, baseSqsConfig);
        this.sqsConsumers.set(name, sqsConsumer);
      });
    }
  }

  private createSqsConsumer(queueUrl: string, baseConfig: any): SQSConsumer {
    const sqsConsumer = SQSConsumer.create({
      queueUrl,
      sqs: this.sqs,
      ...baseConfig,
      pollingWaitTimeMs: config.get<number>('aws.sqs.pollingWaitTimeMs'),
    });

    sqsConsumer.on('error', (error) => {
      errorLog(error, { queueUrl }, `Error on Webhook Receiver SQS`);
    });

    return sqsConsumer;
  }

  onApplicationBootstrap() {
    this.sqsConsumers.forEach((sqsConsumer) => sqsConsumer.staqrt());
  }

  beforeApplicationShutdown() {
    this.sqsConsumers.forEach((sqsConsumer) => sqsConsumer.stop());
  }

Expected Behavior

CPU usage should be not high compared when turn off queue (with the same traffic)

Current Behavior

We are facing with an issue when turn on SQS on our server which is they use 100% of the available CPU compared around 60% when turn off queue (with the same traffic).

Reproduction Steps

Try to run a stress test on the api that using sqs-consumer lib -> The CPU usage increase significantly (It's not that high if we turn off the queue)

Possible Solution

No response

Additional Information/Context

No response

SDK version used

v2.1106.0

Environment details (OS name and version, etc.)

Fargate

aBurmeseDev commented 2 months ago

Hi there - apologies for the long silence here. I'm not sure if this issue is still ongoing for you but I'd like to share my thoughts on it.

The high CPU usage when the SQS queue is enabled could be due to several reasons:

Firstly, the polling mechanism employed by the SQS consumer library could be a contributing factor. The frequent polling for new messages from the queue can be computationally intensive especially if the polling interval is set to a low value or if multiple queues are being polled simultaneously. Adjusting the pollingWaitTimeMs configuration to a higher value or reducing the number of concurrent queue polls might help mitigate this issue.

Additionally, the batch size parameter batchSize determines the maximum number of messages retrieved from the queue in a single batch. A larger batch size could potentially lead to higher CPU usage due to processing a greater number of messages concurrently. Experimenting with a lower batchSize value may help identify if this parameter is a significant factor in the increased CPU usage.

The way you configure the SDK could also impact CPU usage. For example, if you're creating a new SDK instance for each request, it can be inefficient and lead to higher CPU usage. Consider creating a shared instance of the SDK and reusing it across requests.

Hope it helps, John