Azure / durabletask

Durable Task Framework allows users to write long running persistent workflows in C# using the async/await capabilities.
Apache License 2.0
1.47k stars 287 forks source link

Allow user to specify MaxHistoryEvents and constraint history loading to include the latest generation only #1119

Open davidmrdavid opened 1 week ago

davidmrdavid commented 1 week ago

This PR makes two small (diff-wise) but significant changes to the Azure Storage provider.

New MaxHistoryEvents configuration

First - it introduces a new MaxHistoryEvents configuration. It allows the user to specify the maximum number of events they want to allow an orchestration to have before it is automatically terminated. Many incidents, and sometimes outages, result from orchestrators having too large of a history, creating application instability. This configuration should give users peace of mind so that, even if they accidentally generate a large history, the framework will protect itself and terminate offending orchestrations. This is an opt-in feature.

ContinueAsNew bug fix

It also fixes a longstanding bug that was only recently discovered - it appears that, in many cases, an orchestrator that was "continued as new" will still load fragments of it's old history (those that were not already overridden by the new history). This is because we don't always know the latest executionID of an orchestrator (which allows us to distinguish between generations), meaning that we loaded more data than necessary. The end result was this: if an orchestrator instanceID ever generated a history of 10k records, it did not matter that later generations only generated 1k records, we would always load 10k (the 1k of the current generation, and 9k stale records that haven't been overridden yet).

The fix for this bug is simple: we need to get the latest instanceID by first reading only sentinel row of the History table, which reports the latest executionID. Then, using that executionID as an extra filter, we read the history as normal. This extra table read will be a new cost to our users, but it is necessary for correctness. I'm hoping the cost is low enough to be 'ok'.

I did try to make this work without the extra table read, by a paginated query and read the history table "in parts" and stopped as soon as the executionID of the results changed, which tells us that now we're reading a stale history. However, if we did this, we would stop before reading the sentinel row (the sentinel row is the last row of the table), which DTFx needs for it's CheckpointCompletedTimestamp which it uses to determine out of order messages. In the end, I found it simpler and safer to make that extra request, but this is up for debate.

AnatoliB commented 1 week ago

How do we expect the user to choose a good MaxHistoryEvents value? How would they know that their value is not too high (so it doesn't matter) and not too low (so it terminates orchestrations prematurely)? Can we provide any guidance?

An alternative idea to consider: measure time required to read the history, and let the user configure the maximum time.

davidmrdavid commented 1 week ago

@AnatoliB - I also considered measuring time instead, but I think that's a much more unstable statistic - the time to read from Azure Storage can vary greatly depending on network availability, IO resource constraints, throttling from Azure Storage, etc. So I think that, as a signal, it can vary greatly between measurements (one load of some history might be slow, but it might be fast again a bit later).

So I think measuring the history size might provide more predictable behavior for users, and for ourselves. It also correspond more closely to the terminology we use in our best practices doc - to keep history sizes small, not "fast".

As for guidance - I think that'll be a matter of documentation. For instance - the IntelliSense could suggest a default (in my experience - anything in the low thousands is fine, ideally below 1k, but apps are generally healthy enough up to ~10k events) I'd be happy to add that here, and in the DF-level PR (which will need to re-export this feature)

AnatoliB commented 1 week ago

As for guidance - I think that'll be a matter of documentation. For instance - the IntelliSense could suggest a default (in my experience - anything in the low thousands is fine, ideally below 1k, but apps are generally healthy enough up to ~10k events) I'd be happy to add that here, and in the DF-level PR (which will need to re-export this feature)

@davidmrdavid I'm sure we can recommend a number, but my concern is about being confident that using this number is better than having no limit at all. By suggesting any specific number, we can be easily off by orders of magnitude. If the number is significantly higher than what a specific configuration can manage (considering all the factors: I/O performance, memory, etc.), this limit will not matter. If the number is significantly lower, it will unnecessarily hurt a potentially healthy app. So, why would the customer want to set any specific number we suggest? What value are they getting from it? Where would their peace of mind come from?

As a customer, I would want to set this limit only if it is connected to a metric that is truly meaningful to me: e.g. time, memory consumption, etc. Based on my use case, I can often claim that, for example, spending more than 5 minutes on loading history feels way too long. We can address timing fluctuations by measuring averages, medians, or percentiles instead of acting on any single measurement.

I'm not strictly against introducing a limit in the proposed form, but I do worry that it's easy to measure but difficult to meaningfully apply. At the very least, we need to provide enough guidance for the user to understand when 1k is better than 10k or 100k, and not just suggest a seemingly arbitrary number.

davidmrdavid commented 6 days ago

I see your point and agree that history size much less tangible to customers than measuring time. Let me make a few comments on your points to help provide more context on how I view this.

We can address timing fluctuations by measuring averages, medians, or percentiles instead of acting on any single measurement.

Yes, that's true. However, for me to measure averages/median/etc I need to be able to save multiple history timing samples per instanceID. Today, we don't have the storage for that - we only have the history and instances table for long-term storage. In theory, I suppose we could augment either table to start storing this sort of data, but that would come at an extra customer cost (more columns in some table) as being a more intrusive change. In general, if we have a good place to store these measurements (so we can calculate averages/medians and other aggregate statistics), I'd be more open to making this configuration work in terms of time.

At the very least, we need to provide enough guidance for the user to understand when 1k is better than 10k or 100k, and not just suggest a seemingly arbitrary number.

Yeah, I see your point. One possible reason why 1k is particularly good, is that Azure Storage can only load 1k rows at a table from tables. Therefore, any history larger than 1k rows requires multiple reads to load into memory. So perhaps that's a principled reason to suggest 1k rows as having some special significance? I still believe the ultimately users need to handle this on a case-by-case basis, but Azure Storage is already giving special significance to 1k rows so perhaps we can rely on that precedent.

So, why would the customer want to set any specific number we suggest?

One way to help them decide what number to choose is to write a detector that uses telemetry to determine what history sizes take too long to load, and have the detector dynamically suggest a number depending on the size of those histories and how long they took to load. We already have a "slow histories" detector, so we could easily augment that to provide personalized guidance.

jviau commented 6 days ago

I think we could avoid the extra query for execution ID if we create a special ExecuteQueryAsunc. Azure storage has pagination, but we are buffering it all into a list. With a special implementation, we could start the query, grab the first execution ID that appears, then stop loading more pages when the execution ID changes.