Azure / azure-sdk-for-js

This repository is for active development of the Azure SDK for JavaScript (NodeJS & Browser). For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/javascript/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-js.
MIT License
2.09k stars 1.21k forks source link

Out-of-memory & 134 Error while update the new SDK Version #26591

Closed vikas-mundra closed 1 year ago

vikas-mundra commented 1 year ago

Describe the bug Event Hub Configurastion:- 32 partition Throughput Units 2

As I have Migrate from 5.8.0 SDK to 5.11.1 I have added const subscribeOptions: SubscribeOptions = { maxBatchSize: +config.get('eventHubSubscribeOptions.MAX_BATCH_SIZE'), // 20 maxWaitTimeInSeconds: +config.get('eventHubSubscribeOptions.MAX_WAIT_TIME_IN_SEC'), // 1 prefetchCount: +config.get('eventHubSubscribeOptions.PREFETCH_COUNT') // 60 }; My POD configurastion 1 core CPU 6 GB Memory 5 GB MAX_OLD_SPACE_SIZE

I got out of memory error after deploying the new sdk also time getting 134 SIGABORT.

Let me know if I have added something wrong or some issue in the new SDK

To Reproduce Steps to reproduce the behavior:

  1. update the new sdk from 5.8.0 to 5.11.1 and pass the following config const subscribeOptions: SubscribeOptions = { maxBatchSize: +config.get('eventHubSubscribeOptions.MAX_BATCH_SIZE'), // 20 maxWaitTimeInSeconds: +config.get('eventHubSubscribeOptions.MAX_WAIT_TIME_IN_SEC'), // 1 prefetchCount: +config.get('eventHubSubscribeOptions.PREFETCH_COUNT') // 60 }; Kubernetes Pods restart automatically with oomemory and 134 error

Expected behavior A clear and concise description of what you expected to happen.\

Pods not crashed

Screenshots If applicable, add screenshots to help explain your problem.

image

See the above image the read will increase but not able to process data in pods due to out-of-memory and 134 error

jeremymeng commented 1 year ago

@deyaaeldeen @HarshaNalluru

deyaaeldeen commented 1 year ago

Thanks for the report! We're running tests with the exact values you shared and will keep you posted if we find anything. In the meantime, did you check whether the processEvents callback is leaking?

github-actions[bot] commented 1 year ago

Hi @vikas-mundra. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

vikas-mundra commented 1 year ago

Hi, The issue was we didn't get time to check which method gave an error. Scenario:- Our Event hub has a long queue. For we need to clear the queue we have updated sdk from 5.8.0 to 5.11.1 and with the following configuration 1 Core CPU with 6 GB ram and 5 GB max old space size with 4 Pod deployed in PRD environment SDK configuration is as follows:-

  1. maxBatchSize : (100 / 50 / 20 / 10 / 5)
    1. maxWaitTimeInSeconds (1)
  2. prefetchCount: ( 300 / 150/ 60/ 30/ 15)
  3. CHECKPOINT_FREQUENCY: - (300 / 200 / 100 / 50 / 30)

Observation:- Once we decreased the prefetchCount my pod will work for 10 min then the same issue happened out of memory (FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory) and had 134 errors and my pod was repetitively restarted.

Now if we use the same configuration ( 1 Core CPU with 4 GB ram and 3 GB max old space size with 4 Pod deployed in PRD environment) with 5.8.0 SDK and the following configuration of SDK is:-

  1. maxBatchSize : (600)
    1. maxWaitTimeInSeconds (1)
  2. CHECKPOINT_FREQUENCY (30/ 50/ 100/ 250)

It's working fine and no pod was restarted till now for the last 4 days.

deyaaeldeen commented 1 year ago

@vikas-mundra Thanks for sharing! Could you share more how the processEvents callback is defined? We'll try to repro on our side and any insights into how you consume the events will help a lot.

@harshanalluru I understand you started a stress test with the original values shared above, how is that going?

vikas-mundra commented 1 year ago

processEvents: async (events, context) => { try { app_logger_1.AppLogger.info(Received events array length is: ${events === null || events === void 0 ? void 0 : events.length}, context: ${this.getContextString(context)}, EventHubConsumerService.name); for (const event of events) { await eventHubConfiguration.eventHubEventHandler(event, context); if (eventHubConfiguration.blobCheckpointStore) { await this.updateEventCheckpoint(context, event, eventHubConfiguration); } } if (events.length > 0) { this.trackEventhubError[eventHubConfiguration.name].errorCount = 0; this.trackEventhubError[eventHubConfiguration.name].restartCount = 0; } } catch (error) { app_logger_1.AppLogger.error('processEvents: Failed', EventHubConsumerService.name); app_logger_1.AppLogger.error(error); } }, processError: async (error, context) => { try { app_logger_1.AppLogger.error(Error in listening on eventHub, context: ${this.getContextString(context)}, error: ${error}}, EventHubConsumerService.name); await this.onEventHubError(error, context, eventHubConfiguration); } catch (error) { app_logger_1.AppLogger.error('processError: Failed', EventHubConsumerService.name); app_logger_1.AppLogger.error(error); } }, processInitialize: async (context) => { try { app_logger_1.AppLogger.info(Start listening on eventHub: ${context.eventHubName} and partitionId:${context.partitionId}, context: ${this.getContextString(context)}, EventHubConsumerService.name); if (eventHubConfiguration.eventHubInitializerCallback) { await eventHubConfiguration.eventHubInitializerCallback(context); } } catch (error) { app_logger_1.AppLogger.error('processInitialize: Failed', EventHubConsumerService.name); app_logger_1.AppLogger.error(error); } }, processClose: async (reason, context) => { try { app_logger_1.AppLogger.info(Stopped listening on eventHub: ${context.eventHubName} and\ partitionId: ${context.partitionId} with reason: ${reason}, context: ${this.getContextString(context)}, EventHubConsumerService.name); if (eventHubConfiguration.eventHubCloseCallback) { await eventHubConfiguration.eventHubCloseCallback(reason, context); } } catch (error) { app_logger_1.AppLogger.error('processClose: Failed', EventHubConsumerService.name); app_logger_1.AppLogger.error(error); } }

deyaaeldeen commented 1 year ago

@vikas-mundra Thanks for sharing! Any insights into what await eventHubConfiguration.eventHubEventHandler(event, context); does? Also, could you open a support ticket with logs so we can take a closer look?

vikas-mundra commented 1 year ago

below are the function code of that

`this.logMsgOnEventReceived(eventData); const deviceId = eventData?.systemProperties?.['iothub-connection-device-id'] as unknown as string; const messageSource = eventData?.systemProperties?.[config.get('eventHub.MESSAGE_SOURCE_KEY') ] as unknown as string; const message = eventData?.body; const hubName = eventData?.properties?.hubName as unknown as string; const isEventGrid = ((eventData?.properties?.['aeg-event-type'] as unknown as string) === 'Notification'); const moduleId = eventData?.systemProperties?.['iothub-connection-module-id']; const messageType = (eventData?.properties?.drs?.type || messageSource?.toLowerCase()) as string; const eventProperties = eventData?.properties;

    try {
        if (this.messageTypes[messageType]) {
            const eventHandler: EventHandler = this.moduleRef.get(
                // eslint-disable-next-line @typescript-eslint/no-unsafe-argument
                this.messageTypes[messageType]);
            await eventHandler?.handleMessage(message, eventData?.properties);
        } else if (messageSource === config.get<string>('eventHub.MESSAGE_SOURCE_VALUE_TWIN') &&
            deviceId && message && hubName && !moduleId) {
            await this.twinChangeEventHandler.handleDeviceTwinEvent(
                deviceId,
                hubName,
                message as unknown as IoTDeviceTwin,
                new Date(eventData?.systemProperties?.['iothub-enqueuedtime'] as unknown as Date)
            );
        } else if (isEventGrid) {
            await this.deviceLifeCycleEventHandler.handleDeviceLifeCycleEvent(message, eventProperties as unknown as EventProperties);
        } else {
            this.appLoggerService.info(
                `Unhandled/Invalid event with given parameters: ${JSON.stringify({
                    systemProperties: eventData?.systemProperties,
                    eventProperties
                })}`,
                EventHubService.name
            );
        }
        // increment message processed success metric
        this.appMetricsService.increment(
            `${context.eventHubName}.${context.consumerGroup}.${context.partitionId}.message.processing.success.count`,
            1,
            (statsDError) => {
                if (statsDError) {
                    this.appLoggerService.error(EventHubService.name, 'Unable to send statsD metric');
                    this.appLoggerService.error(EventHubService.name, statsDError);
                }
            }
        );
    } catch (error) {
        this.appLoggerService.error(
            `Failed to process event for deviceId: ${deviceId}`,
            EventHubService.name
        );
        // increment message processed failed metric
        this.appMetricsService.increment(
            `${context.eventHubName}.${context.consumerGroup}.${context.partitionId}.message.processing.failed.count`,
            1,
            (statsDError) => {
                if (statsDError) {
                    this.appLoggerService.error(EventHubService.name, 'Unable to send statsD metric');
                    this.appLoggerService.error(EventHubService.name, statsDError);
                }
            }
        );
        this.appLoggerService.error(error);
    }`
vikas-mundra commented 1 year ago

Aldo created the support request and the support id was 2308010050000437 also added the logs file in that only

HarshaNalluru commented 1 year ago

I understand you started a stress test with the original values shared above, how is that going?

I triggered the stress test last week. So far, we have not seen any memory increment. image

The test that ran for a week terminated last night perhaps due to resource deletion. (Looking into it now)

I've updated the test to closely replicate @vikas-mundra's setup/config and triggered another run in an attempt to replicate. Let me know if the test here needs to be updated @deyaaeldeen @vikas-mundra https://github.com/Azure/azure-sdk-for-js/pull/26599

Let's see if the new run can help us with the replication :)

vikas-mundra commented 1 year ago

Hi,

currently, we are using the following version of the node that I see the difference in your chart root@data-ingestor-devstable-846d6966c6-hgfql:/# node --version v14.15.4

root@data-ingestor-devstable-846d6966c6-hgfql:/# npm -version 6.14.10

We also deployed the same code in our dev environment and we still face the same issue.

image

See on above image it kept restarting the pod around 3K + times

image

Now again it restarted with out-of-memory error

HarshaNalluru commented 1 year ago

currently, we are using the following version of the node that I see the difference in your chart

The previous run was with node 18, the new test run is with node 14, if you check the PR https://github.com/Azure/azure-sdk-for-js/pull/26599. Will keep you posted with the updates from the test run :)

vikas-mundra commented 1 year ago

We have the following version of the node.

root@data-ingestor-devstable-846d6966c6-hgfql:/# node --version v14.15.4

root@data-ingestor-devstable-846d6966c6-hgfql:/# npm -version 6.14.10

can you try on the above mention of node version

HarshaNalluru commented 1 year ago

Hey @vikas-mundra,

Updates from the recent stress test run.. our test has been running just fine so far, we don't see the memory leak. image Ignore the node18 in the image, I missed updating the label, the test is indeed using node14 :) Updated the label now for future runs.

https://github.com/Azure/azure-sdk-for-js/pull/26599 - This is the stress test and the config closely replicating yours. Please take a look at the draft PR and let us know if you feel we're doing something drastically different.


From the code you shared above in the comments, I don't see anything out of the ordinary from the eventHubConfiguration.eventHubEventHandler method. I suspect there could be some other code in your application that is responsible for the memory leak and is affecting the event-hubs client eventually.

Also, since you have mentioned, the pod restarted 3K times upon seeing the error, I am starting to think of the other callbacks that might be masking the problem. Could you also share what you have at eventHubConfiguration.eventHubInitializerCallback and eventHubConfiguration.eventHubCloseCallback?

It would also be great if you could provide the logs at the support ticket. @deyaaeldeen mentioned offline that the support team did not get the logs.

github-actions[bot] commented 1 year ago

Hi @vikas-mundra. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

github-actions[bot] commented 1 year ago

Hi @vikas-mundra, we're sending this friendly reminder because we haven't heard back from you in 7 days. We need more information about this issue to help address it. Please be sure to give us your input. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!