Azure / durabletask

Durable Task Framework allows users to write long running persistent workflows in C# using the async/await capabilities.
Apache License 2.0
1.47k stars 287 forks source link

[Proposal] Adding control-queue monitor with dedicated orchestrator to each control-queue. #1058

Open pasaini-microsoft opened 3 months ago

pasaini-microsoft commented 3 months ago

Motivation:

Issue: Unpredictable time to detect (TTD) for any orchestration being stuck in control-queue.

Issue: Manual efforts in identifying the impacted worker and mitigation steps.

With both these systems in place, we are able to run DTF with no orchestrations stuck, without any manual steps for detection and mitigation. To bring this capability to all users of DTF users, proposing below change.

Proposals:

Adding control-queue monitor with dedicated orchestrator to each control-queue.

This change consists of 2 main portions:

  1. Addition of orchestration and its instances exactly one for each control-queue.

    • This orchestrator (aka ControlQueueHeartbeatTaskOrchestrator) is very simple and quick to run.
    • It just waits for some interval (configurable) and logs a heartbeat message and continue itself as new.
    • It validates if the context is correct, just in case partition count changes (it assumes partition count doesn't change, otherwise it may result in closing off the orchestrators).
    • It provides a way for user to provide a delegate (callback) to run with each heartbeat.
    • This one gets the partition-count information and creates instance-ids such that one is allotted to each control-queue.
  2. Addition of a monitoring of these orchestrations from last processed time.

    • Now, with the instance id for each control-queue, it checks for last updated time of orchestration instance and uses it measuring against threshold.
    • If orchestration found stuck, appropriate message is logged.
    • This one does a bit more to find point in time owners of each control-queue, to pin the impacted taskhubworker owning the control-queue.
    • It also provides a way for user to provide delegate (callback) to run when it detect any anomaly, like stuck orchestration, fetching owner fails, fetching orchestration instance fails, or any of these times out).
davidmrdavid commented 2 months ago

Hi @pasaini-microsoft: Thanks for opening this PR. As we discussed internally, I'll loop in the other DTFx/DF maintainers for discussion regarding the scope of the PR, as well as the design. I'm personally most excited about the ability to create an instanceID targetting a specific partition, so at the very least I'd like to get that in.

It would help greatly if you could update the PR description to contain not just a list of the changes, but also a small motivation and background behind this change. For example, you can explain that you're using this utility in your own app to help you detect stuck orchestrations, and so on. Thanks!

jviau commented 2 months ago

I love the idea of monitoring partition queue processing, but I don't think via an orchestration is the right way to do this. This is a health check after all, and there is a health check ecosystem in .NET. We should evaluate leveraging that. Metrics (#785) may also be a good avenue for this feature.

https://www.nuget.org/packages/Microsoft.Extensions.Diagnostics.HealthChecks/9.0.0-preview.2.24128.4

pasaini-microsoft commented 2 months ago

I love the idea of monitoring partition queue processing, but I don't think via an orchestration is the right way to do this. This is a health check after all, and there is a health check ecosystem in .NET. We should evaluate leveraging that. Metrics (#785) may also be a good avenue for this feature.

https://www.nuget.org/packages/Microsoft.Extensions.Diagnostics.HealthChecks/9.0.0-preview.2.24128.4

Thanks jviau. The reason for using orchestration was to keep the detection as independent from control-queue processing system as possible. This helps detect if control-queue is stuck for any reason including if taskhubworker is just absent or some control-queue just missed to be owned by any worker. This basically helps avoid false negatives.