Closed vikas-mundra closed 1 year ago
@deyaaeldeen @HarshaNalluru
Thanks for the report! We're running tests with the exact values you shared and will keep you posted if we find anything. In the meantime, did you check whether the processEvents
callback is leaking?
Hi @vikas-mundra. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.
Hi, The issue was we didn't get time to check which method gave an error. Scenario:- Our Event hub has a long queue. For we need to clear the queue we have updated sdk from 5.8.0 to 5.11.1 and with the following configuration 1 Core CPU with 6 GB ram and 5 GB max old space size with 4 Pod deployed in PRD environment SDK configuration is as follows:-
- maxBatchSize : (100 / 50 / 20 / 10 / 5)
- maxWaitTimeInSeconds (1)
- prefetchCount: ( 300 / 150/ 60/ 30/ 15)
- CHECKPOINT_FREQUENCY: - (300 / 200 / 100 / 50 / 30)
Observation:- Once we decreased the prefetchCount my pod will work for 10 min then the same issue happened out of memory (FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory) and had 134 errors and my pod was repetitively restarted.
Now if we use the same configuration ( 1 Core CPU with 4 GB ram and 3 GB max old space size with 4 Pod deployed in PRD environment) with 5.8.0 SDK and the following configuration of SDK is:-
- maxBatchSize : (600)
- maxWaitTimeInSeconds (1)
- CHECKPOINT_FREQUENCY (30/ 50/ 100/ 250)
It's working fine and no pod was restarted till now for the last 4 days.
@vikas-mundra Thanks for sharing! Could you share more how the processEvents
callback is defined? We'll try to repro on our side and any insights into how you consume the events will help a lot.
@harshanalluru I understand you started a stress test with the original values shared above, how is that going?
processEvents: async (events, context) => {
try {
app_logger_1.AppLogger.info(Received events array length is: ${events === null || events === void 0 ? void 0 : events.length}, context: ${this.getContextString(context)}
, EventHubConsumerService.name);
for (const event of events) {
await eventHubConfiguration.eventHubEventHandler(event, context);
if (eventHubConfiguration.blobCheckpointStore) {
await this.updateEventCheckpoint(context, event, eventHubConfiguration);
}
}
if (events.length > 0) {
this.trackEventhubError[eventHubConfiguration.name].errorCount = 0;
this.trackEventhubError[eventHubConfiguration.name].restartCount = 0;
}
}
catch (error) {
app_logger_1.AppLogger.error('processEvents: Failed', EventHubConsumerService.name);
app_logger_1.AppLogger.error(error);
}
},
processError: async (error, context) => {
try {
app_logger_1.AppLogger.error(Error in listening on eventHub, context: ${this.getContextString(context)}, error: ${error}}
, EventHubConsumerService.name);
await this.onEventHubError(error, context, eventHubConfiguration);
}
catch (error) {
app_logger_1.AppLogger.error('processError: Failed', EventHubConsumerService.name);
app_logger_1.AppLogger.error(error);
}
},
processInitialize: async (context) => {
try {
app_logger_1.AppLogger.info(Start listening on eventHub: ${context.eventHubName} and partitionId:${context.partitionId}, context: ${this.getContextString(context)}
, EventHubConsumerService.name);
if (eventHubConfiguration.eventHubInitializerCallback) {
await eventHubConfiguration.eventHubInitializerCallback(context);
}
}
catch (error) {
app_logger_1.AppLogger.error('processInitialize: Failed', EventHubConsumerService.name);
app_logger_1.AppLogger.error(error);
}
},
processClose: async (reason, context) => {
try {
app_logger_1.AppLogger.info(Stopped listening on eventHub: ${context.eventHubName} and\ partitionId: ${context.partitionId} with reason: ${reason}, context: ${this.getContextString(context)}
, EventHubConsumerService.name);
if (eventHubConfiguration.eventHubCloseCallback) {
await eventHubConfiguration.eventHubCloseCallback(reason, context);
}
}
catch (error) {
app_logger_1.AppLogger.error('processClose: Failed', EventHubConsumerService.name);
app_logger_1.AppLogger.error(error);
}
}
@vikas-mundra Thanks for sharing! Any insights into what await eventHubConfiguration.eventHubEventHandler(event, context);
does? Also, could you open a support ticket with logs so we can take a closer look?
below are the function code of that
`this.logMsgOnEventReceived(eventData);
const deviceId = eventData?.systemProperties?.['iothub-connection-device-id'] as unknown as string;
const messageSource = eventData?.systemProperties?.[config.get
try {
if (this.messageTypes[messageType]) {
const eventHandler: EventHandler = this.moduleRef.get(
// eslint-disable-next-line @typescript-eslint/no-unsafe-argument
this.messageTypes[messageType]);
await eventHandler?.handleMessage(message, eventData?.properties);
} else if (messageSource === config.get<string>('eventHub.MESSAGE_SOURCE_VALUE_TWIN') &&
deviceId && message && hubName && !moduleId) {
await this.twinChangeEventHandler.handleDeviceTwinEvent(
deviceId,
hubName,
message as unknown as IoTDeviceTwin,
new Date(eventData?.systemProperties?.['iothub-enqueuedtime'] as unknown as Date)
);
} else if (isEventGrid) {
await this.deviceLifeCycleEventHandler.handleDeviceLifeCycleEvent(message, eventProperties as unknown as EventProperties);
} else {
this.appLoggerService.info(
`Unhandled/Invalid event with given parameters: ${JSON.stringify({
systemProperties: eventData?.systemProperties,
eventProperties
})}`,
EventHubService.name
);
}
// increment message processed success metric
this.appMetricsService.increment(
`${context.eventHubName}.${context.consumerGroup}.${context.partitionId}.message.processing.success.count`,
1,
(statsDError) => {
if (statsDError) {
this.appLoggerService.error(EventHubService.name, 'Unable to send statsD metric');
this.appLoggerService.error(EventHubService.name, statsDError);
}
}
);
} catch (error) {
this.appLoggerService.error(
`Failed to process event for deviceId: ${deviceId}`,
EventHubService.name
);
// increment message processed failed metric
this.appMetricsService.increment(
`${context.eventHubName}.${context.consumerGroup}.${context.partitionId}.message.processing.failed.count`,
1,
(statsDError) => {
if (statsDError) {
this.appLoggerService.error(EventHubService.name, 'Unable to send statsD metric');
this.appLoggerService.error(EventHubService.name, statsDError);
}
}
);
this.appLoggerService.error(error);
}`
Aldo created the support request and the support id was 2308010050000437 also added the logs file in that only
I understand you started a stress test with the original values shared above, how is that going?
I triggered the stress test last week. So far, we have not seen any memory increment.
The test that ran for a week terminated last night perhaps due to resource deletion. (Looking into it now)
I've updated the test to closely replicate @vikas-mundra's setup/config and triggered another run in an attempt to replicate. Let me know if the test here needs to be updated @deyaaeldeen @vikas-mundra https://github.com/Azure/azure-sdk-for-js/pull/26599
Let's see if the new run can help us with the replication :)
Hi,
currently, we are using the following version of the node that I see the difference in your chart root@data-ingestor-devstable-846d6966c6-hgfql:/# node --version v14.15.4
root@data-ingestor-devstable-846d6966c6-hgfql:/# npm -version 6.14.10
We also deployed the same code in our dev environment and we still face the same issue.
See on above image it kept restarting the pod around 3K + times
Now again it restarted with out-of-memory error
currently, we are using the following version of the node that I see the difference in your chart
The previous run was with node 18, the new test run is with node 14, if you check the PR https://github.com/Azure/azure-sdk-for-js/pull/26599. Will keep you posted with the updates from the test run :)
We have the following version of the node.
root@data-ingestor-devstable-846d6966c6-hgfql:/# node --version v14.15.4
root@data-ingestor-devstable-846d6966c6-hgfql:/# npm -version 6.14.10
can you try on the above mention of node version
Hey @vikas-mundra,
Updates from the recent stress test run.. our test has been running just fine so far, we don't see the memory leak. Ignore the node18 in the image, I missed updating the label, the test is indeed using node14 :) Updated the label now for future runs.
https://github.com/Azure/azure-sdk-for-js/pull/26599 - This is the stress test and the config closely replicating yours. Please take a look at the draft PR and let us know if you feel we're doing something drastically different.
From the code you shared above in the comments, I don't see anything out of the ordinary from the eventHubConfiguration.eventHubEventHandler
method.
I suspect there could be some other code in your application that is responsible for the memory leak and is affecting the event-hubs client eventually.
Also, since you have mentioned, the pod restarted 3K times upon seeing the error, I am starting to think of the other callbacks that might be masking the problem.
Could you also share what you have at eventHubConfiguration.eventHubInitializerCallback
and eventHubConfiguration.eventHubCloseCallback
?
It would also be great if you could provide the logs at the support ticket. @deyaaeldeen mentioned offline that the support team did not get the logs.
Hi @vikas-mundra. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.
Hi @vikas-mundra, we're sending this friendly reminder because we haven't heard back from you in 7 days. We need more information about this issue to help address it. Please be sure to give us your input. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!
Describe the bug Event Hub Configurastion:- 32 partition Throughput Units 2
As I have Migrate from 5.8.0 SDK to 5.11.1 I have added const subscribeOptions: SubscribeOptions = { maxBatchSize: +config.get('eventHubSubscribeOptions.MAX_BATCH_SIZE'), // 20
maxWaitTimeInSeconds: +config.get('eventHubSubscribeOptions.MAX_WAIT_TIME_IN_SEC'), // 1
prefetchCount: +config.get('eventHubSubscribeOptions.PREFETCH_COUNT') // 60
};
My POD configurastion
1 core CPU
6 GB Memory
5 GB MAX_OLD_SPACE_SIZE
I got out of memory error after deploying the new sdk also time getting 134 SIGABORT.
Let me know if I have added something wrong or some issue in the new SDK
To Reproduce Steps to reproduce the behavior:
Expected behavior A clear and concise description of what you expected to happen.\
Pods not crashed
Screenshots If applicable, add screenshots to help explain your problem.
See the above image the read will increase but not able to process data in pods due to out-of-memory and 134 error