Azure / azure-sdk-for-net

This repository is for active development of the Azure SDK for .NET. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/dotnet/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-net.
MIT License
5.25k stars 4.6k forks source link

[BUG] EventHubListener causes message lost in shutdown #41784

Closed yfujiwara-sansan closed 7 months ago

yfujiwara-sansan commented 7 months ago

Library name and version

Microsoft.Azure.WebJobs.Extensions.EventHubs 6.0.2 and 5.5.0

Describe the bug

EventHubListener.PartitionProcessor progresses the checkpoint even when the application is shutting down (for example, configuration change, scaling, etc.). It causes message lost. Note that I found that this issue is occurred occasionally.

I and my colleague guess that this issue should be fixed #36432, but reintroduced with #38067 as following out investigation results.

Our investigation results

In this situation, while _functionExecutionToken.IsCancellationRequested became true, linkedCts.IsCancellationRequested had not become true. So, the checkpoint was progressed even if the application execution had been cancelled.

By LinkedCancellationTokenSource source code, the following facts were found:

By watching the callstack in the checkpointing, following sequence was occurred:

  1. WebHost calls listener's StopAsync()
  2. The listener calls CancellationTokenSource.Cancel() (source of _functionExecutionToken)
  3. The application cancellation is occurred as continuation of async / await, then awaits in function runtimes are finished as part of registered callback execution. Note that this callback should be occurred before setting linkedCts.IsCancellationRequested to true as described above. So, the checkpoint is progressed because linkedCts.IsCancellationRequested has not been true yet.

Expected behavior

The checkpoint is never progressed when application process is shutting down ( _functionExecutionToken is cancelled).

Actual behavior

The checkpoint is progressed occasionally.

Reproduction Steps

  1. Use event hub trigger with following in local.
  2. After Recieve method started, press Ctrl + C to shutdown process.
  3. The checkpoint should be progressed. You can investigate linkedCts and _functionExecutionToken states with break point in the checkpointing.
[FunctionName("Receive")]
[ExponentialBackoffRetry(5, "00:00:10", "00:10:00")]
public async Task Receive(
    [EventHubTrigger("%EventHubName%", Connection = "EventHubConnectionString")] EventData[] events,
    CancellationToken cancellationToken)
{
    await Task.Delay(TimeSpan.FromMinutes(3), cancellationToken);
}

Environment

jsquire commented 7 months ago

@JoshLove-msft: Would you please advise on whether this is related to the ask on the Functions team to expose whether an execution would be retried if the host is not shutting down? If so, please transfer to the Functions host repo.

jsquire commented 7 months ago

Thank you for your feedback. Tagging and routing to the team member best able to assist.

JoshLove-msft commented 7 months ago

@yfujiwara-sansan, I couldn't reproduce having _functionExecutionToken be canceled with linkedCts still not being cancelled. That said, it looks like you are correct that cancellation is not atomic when it comes to linked token sources, so it is possible that one of the sources can be canceled while the linked source is still not canceled. I will make a fix to explicitly check each of the token sources when making the checkpointing decision.

yfujiwara-sansan commented 7 months ago

@JoshLove-msft Thank you for your work! I guess that the repro code sometimes fails to repro due to scheduling of continuations.

By the way, as far as I read #38067 again, it looks that lInkedCts now can be passed to TryExecute bacause it is not lInked to the token which is cancelled In drain mode. I think there is a better option to pass linkedCts to TryExecute, because the app should be cancelled when the load balancer detects ownership lost.

JoshLove-msft commented 7 months ago

I think there is a better option to pass linkedCts to TryExecute, because the app should be cancelled when the load balancer detects ownership lost.

The semantics of the token passed to the function are that it is signaled only when shutting down in a way that is not guaranteed to allow the function to complete execution. In terms of whether it makes sense to also cancel this token when partition ownership is lost, I would have to defer to @jsquire on that.

yfujiwara-sansan commented 7 months ago

I understand the semantics of the token passeed to the app. The cancellation behavior for ownership lost should be discussed as a different issue because it is intentional behavior.

Thank you for your answer! I and my colleagues are waiting for fixking the race condition.

jsquire commented 7 months ago

I think there is a better option to pass linkedCts to TryExecute, because the app should be cancelled when the load balancer detects ownership lost.

The semantics of the token passed to the function are that it is signaled only when shutting down in a way that is not guaranteed to allow the function to complete execution. In terms of whether it makes sense to also cancel this token when partition ownership is lost, I would have to defer to @jsquire on that.

@JoshLove-msft : The token that the processor passes to OnProcessingEventBatch will get canceled when partition ownership is lost. I would assume that we're flowing that token into the Executor so that the Function is notified. Looking over the implementation, it seems that we're not passing that along, for some reason. Any idea why?

JoshLove-msft commented 7 months ago

I don't think there was a good reason - it may have just been an oversight. Updated the PR to pass linkedCts token to TryExecute.

yfujiwara-sansan commented 7 months ago

@JoshLove-msft

Thank you for fix, but let me confirm just in case.

Is it intentional to fix only single dispatch path? It looks following arguments should be fixed, too:

JoshLove-msft commented 7 months ago

You are right again. Apologies for the oversight. I will add these updates in.

JoshLove-msft commented 7 months ago

I think this instance also needs updating - https://github.com/JoshLove-msft/azure-sdk-for-net/blob/4a51f3567f5da030a964494807aa25e1aba888aa/sdk/eventhub/Microsoft.Azure.WebJobs.Extensions.EventHubs/src/Listeners/EventHubListener.PartitionProcessor.cs#L289