Azure / azure-sdk-for-java

This repository is for active development of the Azure SDK for Java. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/java/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-java.
MIT License
2.35k stars 1.99k forks source link

[BUG] Azure Event hub | One partition suddenly stops receiving messages #15164

Closed yuhaii closed 4 years ago

yuhaii commented 4 years ago

Describe the bug We use the below SDK to receiving message from event hub.

com.azure azure-messaging-eventhubs 5.1.1 com.azure azure-messaging-eventhubs-checkpointstore-blob 1.1.1

But one partition #3 suddenly stop receiving messages at 9/11 1:22 UTC. We can see its checkpoint didn't update.

image

The outgoing message would drop accordingly.

image

It recovered at 9/11 5:02 UTC. We can see the #3 partition checkpoint recover update at this time.

image

We checked the sending messages and confirmed that there were message continue sending to event hub partition #3 from 9/11 1:22 to 5:02 UTC. But we checked the log in customer code and confirmed that the partition #3 receive call back function processContext didn't been called at this time range.

__public EventProcessorClient eventProcessorClientBuilder( @Autowired CheckpointStore checkpointStore, @Autowired EventHubRecordProcessor eventHubRecordProcessor) {

return new EventProcessorClientBuilder()
    .connectionString(connectionstring, eventHubName)
    .consumerGroup(consumerGroupName)
    .processEventBatch(
        eventHubRecordProcessor::processContext, batchSize, Duration.ofSeconds(maxWaitTime))
    .processPartitionClose(eventHubRecordProcessor::closeContext)
    .processPartitionInitialization(eventHubRecordProcessor::initContext)
    .processError(eventHubRecordProcessor::errorContext)
    .checkpointStore(checkpointStore)
    .buildEventProcessorClient();

}__

_public void processContext(EventBatchContext eventContext) {

List<EventData> eventDataList = eventContext.getEvents();
int msgListSize = eventDataList.size();
if (msgListSize == 0) {
  return;
}
LOGGER.info("Batch received of size: {}", msgListSize);
String partitionId = eventContext.getPartitionContext().getPartitionId();
EventData lastEvent = eventDataList.get(msgListSize - 1);
long sequenceNo = lastEvent.getSequenceNumber();

// Logs request
LOGGER.trace(
    "EventHubRecordProcessor-onEvents msg-list-size, {}, partition-id, {}",
    msgListSize,
    partitionId);

try {

  LOGGER.info("[EVENT-HUB] before batch pid {} sequence {}", partitionId, sequenceNo); /// ---> log show this function not called for #3 partition

  // If all the processing is done without error then only we do checkpointing
  eventDataBatchProcessor.processBatch(eventDataList);

  LOGGER.info("[EVENT-HUB] after batch pid {} sequence {}", partitionId, sequenceNo);

  // Checkpoint after checkpointIntervalMillis
  long now = System.currentTimeMillis();
  if (now > checkpointTimeMap.getOrDefault(partitionId, 0L)) {
    doCheckpointing(eventContext, partitionId, sequenceNo);
    checkpointTimeMap.put(partitionId, now + checkpointIntervalMillis);
  }

} catch (Exception e) {
  LoggerUtil.logErrorMessage(
      LOGGER,
      "[DATA-DROPPED-CTS-WRITER] iot-cts-writer-eph-failed attempting retry with Exception : [{}]",
      e);
}

}_

image

We tried to update SDK to following latest beta vesion. But issue still exits.

https://mvnrepository.com/artifact/com.azure/azure-messaging-eventhubs/5.2.0-beta.2 https://mvnrepository.com/artifact/com.azure/azure-messaging-eventhubs-checkpointstore-blob/1.2.0-beta.2

Exception or Stack Trace No exception. When message sending to hub. The receiver callback processContext didn't been called for that specified partition #3. The issue partition number is random. According to latest reproduce, it is on partition #3

To Reproduce Steps to reproduce the behavior: please run attached code for 2-3 days, it will reproduce.

Code Snippet I attached the code snippet for reference.

Expected behavior The call back should be called normally for all partitions

Screenshots see those screenshot in description

Setup (please complete the following information):

yuhaii commented 4 years ago

code.zip I attached the code for reference

srnagar commented 4 years ago

We recently released newer versions of the Event Hubs libraries that contain a fix for this issue. Could you please try updating the version and see if you still have this issue?

azure-messaging-eventhubs - 5.2.0 azure azure-messaging-eventhubs-checkpointstore-blob - 1.2.0

This issue is related to #13785

yuhaii commented 4 years ago

Thanks! Srnagar. We will try this version SDK.

yuhaii commented 4 years ago

Hello @srnagar, before migrating to new version can you get a confirmation on if the issue is fixed. Because switching between version consumes lot of resource bandwidth and has a significant amount of additional cost associated with testing the new version for our quality & load performance.

I would request a confirmation on the subject from relevant product team. So that we can move ahead with confidence and prevent any unwanted migration ahead.

vinceve commented 4 years ago

@yuhaii we had similar issues on our setup and for now it seems resolved. So the update worked for us.

yuhaii commented 4 years ago

Got it. Thanks for your confirmation. Vinceve

srnagar commented 4 years ago

Thanks for the confirmation @vinceve! Closing this issue.

vinceve commented 4 years ago

@srnagar this night it stopped working for us. I guess the bug is still persistent.

Screenshot 2020-10-03 at 08 43 21

The blue line are incoming messages. And the orange line is outgoing after a reboot.

I will send you the logs.

pbsf commented 4 years ago

@vinceve, any updates on that? The same issue just occurred in one of my consumers. We are using version 5.2.0.

yuhaii commented 3 years ago

Happy new year, @srnagar. This issue reproduced again on 5.2.0.

We observed that the checkpointing of the partition was stuck for couple of days and it was reset by us manually. Please find the attached screenshot of the metric.

image

When we use old SDK, we can break the lease of that checkpoint file to mitigate the issue. But in new SDK, the checkpoint file already been un-released. We have to restart the application. This is our production application, is there any good workaround if you can't fix this issue immediately? We don't want to restart the production application each time when such issue happens.

Could you please help double checking this issue? Thanks in advance.

srnagar commented 3 years ago

@yuhaii as discussed offline, please use version 5.3.1 as it contains a fix for this issue.

yuhaii commented 3 years ago

understand, let us try v5.3.1. thanks for your confirmation, @srnagar!

yuhaii commented 3 years ago

Hello @srnagar , good day. Our customer reported that they did load testing with the latest event hub sdk version, still we are facing checkpoint related issue.

compile group: 'com.azure', name: 'azure-messaging-eventhubs', version: '5.4.0' compile group: 'com.azure', name: 'azure-messaging-eventhubs-checkpointstore-blob', version: '1.4.0'

The checkpoint and ownership blobs not getting updated. PFB details for reference: image

Could you please help double us double check this issue? Thank you.

srnagar commented 3 years ago

@yuhaii could you please share logs when this issue happened? This is not the case when partitions stopped receiving events. In this case, the ownership is not updated which requires logs for further investigation.

lovababu commented 3 years ago

we started seeing same issue with java SDK (azure-eventhubs-eph v2.1.0), is this has been addressed in eph library too? @srnagar