Azure / azure-sdk-for-net

This repository is for active development of the Azure SDK for .NET. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/dotnet/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-net.
MIT License
5.47k stars 4.8k forks source link

[QUERY] Event Hub EnqueueTime latency #28534

Closed piotrzbyszynski closed 2 years ago

piotrzbyszynski commented 2 years ago

Library name and version

Azure.Messaging.EventHubs 5.6.2

Query/Question

Hi,

I have a following setup:

I am sending messages in consecutive batches (lets say 1k of batches one after another) using SendAsync method. Each batch consists of 3592 messages. I also log:

Now, SendAsync is super fast, it takes like milliseconds to execute. But what is strange is an enqueue time. It seems like messages in the batch are sent to 4 different partitions, and enqueue time for each partition is very different from one another. For example 892 messages sent to one partition are enqueued immediately, another 892 messages sent to other partition are enqueued after (respectively): 34 seconds, 1min 7sec, and 1min 40sec. This happens with every single batch.

Here are some examples (logs ordered by Batch Index and Enqueue time): Batch Index Message created After SendAsync EnqueueTime Messages Count
0 09:18:26.6965485 09:18:27.0697804 00:00:00.0000000 898
0 09:18:26.6965485 09:18:27.0697804 00:00:34.7122196 898
0 09:18:26.6965485 09:18:27.0697804 00:01:07.3632196 898
0 09:18:26.6965485 09:18:27.0697804 00:01:40.0412196 898
1 09:18:27.0704663 09:18:27.4179192 00:00:00.0000000 898
1 09:18:27.0704663 09:18:27.4179192 00:00:34.3940808 898
1 09:18:27.0704663 09:18:27.4179192 00:01:07.0530808 898
1 09:18:27.0704663 09:18:27.4179192 00:01:39.7370808 898

And so on and so on. My question is: how this can be explained?

Any help would be highly appreciated

azure-sdk commented 2 years ago

Label prediction was below confidence level 0.6 for Model:ServiceLabels: 'Storage:0.21762913,Service Bus:0.17609812,Event Hubs:0.14006387'

jsquire commented 2 years ago

Hi @piotrzb-tr. Thank you for reaching out. You don't mention how you're reading the enqueue time; since this information is not available when an event is published, I'm going to assume that you're inspecting it when consuming events. Given that events in your batch are being routed to multiple partitions, I'm also going to assume that you're not specifying a partition key or identifier, allowing Event Hubs to assign a partition via round-robin.

The SendAsync call from the client communicates with the Event Hubs gateway service when no specific partition was requested during batch creation. When the gateway acknowledges receipt of the batch, the SendAsync call completes; delivery of the batch events to partitions is now responsibility of the gateway. At this point, the events are safely held in a durable state on the service and are guaranteed but not yet available to read. The gateway does its work and transfers events to the assigned partitions, safely negotiating a transfer of ownership for each. It is not until the partition commits the event that the enqueue time is set. This is the point when the event is available to be consumed.

The length of time needed for communication between the gateway and partitions can vary; each partition may be on a separate host machine and may share time with other different Event Hubs namespaces running on that host. I do not know for certain whether the enqueue time is set via a global clock or according to local time on the partition host, but I suspect the latter. Assuming that I'm correct, some amount of clock skew would also be possible.

ghost commented 2 years ago

Hi @piotrzb-tr. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text “/unresolve” to remove the “issue-addressed” label and continue the conversation.

piotrzbyszynski commented 2 years ago

Hi @jsquire

Thanks a lot for your answer, it sheds a lot of light at the problem.

As it comes to our setup, your assumptions are absolutely correct. We get enqueue time on the consumer side. We also do not specify partition id. However we did some tests with specifying partition keys, but it had no effect at the enqueue time as I recall.

As it comes to time skew, I think that time can be skewed but I hope that it is a matter of milliseconds rather then minutes, so I hope it is not the case here.

Also, what is strange here, that enqueue time latency is very consistent (0, ~34sec, ~1min 7sec and so on), but not partition specific. Sometimes partition 1 has 0 latency, sometimes it has ~1min 7 sec, it is all over the place with no apparent pattern.

@jsquire is there any way to improve enqueue time latency here (as it hugely influences system wide latency)? Do you think that assigning each batch to specific partition using round robin in producer code would help in any way? Or we should assume that Event Hub gives us huge throughput on the producer side but some system wide latency is inevitable?

/unresolve

jsquire commented 2 years ago

I'll start with a small disclaimer; I can speak authoritatively about the client library. Most of the behavior that we're discussing at this point is part of the Event Hubs service. I'm happy to discuss my understanding, but I'm not privy to the implementation details and may not be fully aware of changes that were made recently.

From the client library perspective, your best bet for reducing latency is to perform partition assignment within the application and publish batches to a specific partition. This bypasses the gateway, reducing contention and load. It also avoids the need for the gateway to negotiate transfer of ownership; the partition may still have work to do after ACK, but it should be less. This also tends to increase the ability to parallelize publishing activities.

Looking at non-client things, the biggest consideration is that a namespace for a standard SKU is sharing resources. Its partitions may be on different hosts which may have different demands on them. Increasing the SKU level is very likely to help with latency, with the obvious trade-off being additional cost. Premium provides better isolation and predictability; the dedicated tier isolates you completely on your own host. More information can be found here.

That all said, to my knowledge, there is no defined SLA for end-to-end latency within Event Hubs. The service endeavors to provide the fastest throughput possible within its constraints. I would expect some system-wide latency; even in the best case, the service has to guarantee durability with redundant writes, often across availability zones.

piotrzbyszynski commented 2 years ago

@jsquire Thanks again, for your help

As it comes to SKU, it should not be a problem as we have dedicated Event Hub Instance

We tried suggested solution and pushed all messages to a specific partition using CreateBatchOptions.PartitionId property. It worked when we run producer locally. Enqueue latency was around 0.5sec and the batch was not divided into chunks. Each batch index had 898 messages.

Unfortunately, after publishing the solution to AKS, the exact same problem reappeared. Batch messages count was 5 * 898, and posting batch to partition was divided into 5 stages as to say, each with around 30sec enqueue latency from one another. The only difference was that each stage was published to the same partition. Also traffic generated on AKS was much larger then generated locally.

@jsquire any idea how this behavior can be explained? Does amount of traffic play a role here? I would assume that using specified partition will decrees throughput, will increase SendAsync execution time and will decrease enqueue latency, but it was not the case.

At this point as I understand, Event Hub does not provide SLA for system wide latency, and is designed primarily for ingesting huge amount of data (throughput) not for low latency (please correct me if I am wrong). I just would like to be sure that I understand it's behavior

jsquire commented 2 years ago

I'm not sure that I understand what a batch index is nor the behavior that you're describing in AKS. Are you saying that you're seeing duplication and messages from the batch moving between partitions when sent with a partition id? Or, are you saying that the end-to-end latency for being able to read messages appears to demonstrate that the service is committing messages from the batch in chunks?

The latter seems reasonable given what I know of the service. The former would indicate a problem.

At this point as I understand, Event Hub does not provide SLA for system wide latency, and is designed primarily for ingesting huge amount of data (throughput) not for low latency (please correct me if I am wrong).

That is my understanding. That said, its important to mention that the Azure SDK team owns the client libraries but we do not own the service. I can only offer my thoughts, I shouldn't be considered the authoritative voice for the service design or behavior. If that's a conversation that you'd like to pursue, your best starting point would be to open an Azure support request so that the right folks from the service team can assist.

piotrzbyszynski commented 2 years ago

@jsquire So we basically create batch of messages using EventHubProducerClient.CreateBatchAsync and adding as much messages to the batch as possible using TryAdd method. We assign each batch a distinct index and pass this index to each message inside the batch for debug purposes. Also, we log SendAsync complete time and each message enqueue time. We compute producer latency as EnqueueTime - SendAsync complete time

And now, what we see on a producer deployed to AKS vs run locally:

Now, what we are trying to do is understand:

@jsquire Hope that my explanation is clear now. And of course thank you for your help so far. Please let me know if you are able to suggest any other solution (if not, I will open Azure support ticket). Also, stating that this is just the way EH works would be also of a great help for us ;) But to be frank, 2.5min of latency is a little bit extreme IMO ;)

jsquire commented 2 years ago

Hi @piotrzb-tr. Thanks for additional context. I think I understand the end-to-end scenario that you're describing now. Unfortunately, there's nothing else that I can suggest from the client perspective, you're already following the patterns that we'd recommend maximizing throughput.

I can't speak to norms on latency, especially with a dedicated tier. Unfortunately, you'll need the support ticket there so that you can be connected with the service team. I'm going to mark this as addressed from the client-side. Please feel free to unresolve if you'd like to continue the discussion.

ghost commented 2 years ago

Hi @piotrzb-tr. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text “/unresolve” to remove the “issue-addressed” label and continue the conversation.

ghost commented 2 years ago

Hi @piotrzb-tr, since you haven’t asked that we “/unresolve” the issue, we’ll close this out. If you believe further discussion is needed, please add a comment “/unresolve” to reopen the issue.