Replace DurableTask.AzureStorage partition management with more cost effective approach for V2 storage accounts

ConnorMcMahon commented 3 years ago

Our current blob-lease per partition approach (or 2 blobs per partition in the non-legacy approach) means that the blob costs scale multiplicatively with both the number of partitions and the number of TaskHubWorkers. Even a relatively idle app with the default 4 partitions costs ~.15 cents per day on a V1 storage account, and with V2 storage accounts increasing the cost of transactions by an order of magnitude, this makes using V2 storage accounts cost prohibitive for many customers.

With some networking features being tied to V2 storage accounts only, it would be great to have a partition management strategy that is more cost effective.

Ideally this partition management algorithm would have the following properties:

As reliable as our ownership/intent lease based algorithm
Provide a cheaper experience for customers on V1 and V2 storage accounts.

davidmrdavid commented 3 years ago

This seems really important and relatively urgent, so I wanted to get the ball rolling with some discussion. Do we already have any ideas for what this new, more cost-effective, might look like?

ConnorMcMahon commented 3 years ago

Problem

Currently the two partition management strategies we have involve many Blob transactions (mainly ReadBlob, ReadBlobMetadata, RenewLease) in order to successfully ensure each orchestration is only handled by one worker at a time. This is exacerbated by the new partition management technique, which is more reliable but involves two blobs for each partition. The problem with the current algorithm is that it scales by a factor of the following:

Number of partitions (need to increase for more orchestration/entity concurrency at high worker counts)
Number of workers (need to increase for scalability for all function types)
Frequency of lease renewal (increasing frequency reduces downtime for a partition during load balancing) The fact that to have high partition count and a high number of workers to increase concurrency of orchestration/entities means that we get a quadratic increase in storage cost as we increase the number of VMs that process orchestrations/entities. We effectively cap this by capping the number of partitions to 16, but this limits concurrency.

Proposal 1 (Coordinator pattern)

There is a single blob lease (separate from the app lease) per application , the owner of which is the “coordinator”. Each follower must regularly write to a common source with a “heartbeat” timestamp. The coordinator then uses this information to determine how many workers are alive, and how to load balance partitions across the alive workers. It has exclusive write access to a source that maps target partitions to the alive workers. All workers regularly poll this partition mapping to realize what queues they should be polling.

Table Design

WorkerStatusTable

WorkerName - Unique name of worker
LastHeartbeat - Timestamp of last heartbeat ping from worker

PartitionMappingTable

PartitionId - Unique PartitionId
CurrentLeaseOwner - Worker name of current lease holder
TargetLeaseOwner - Worker name of ideal lease holder

Algorithm

Every 30(?) seconds, workers update LastHearbeat value in their WorkerStatusTable row.
Every 30(?) seconds, the coordinator takes a look at what workers are still alive and load balances partitions across them by setting TargetLeaseOwner column in PartitionMappingTable. Only the coordinator edits this column.
Every 30(?) seconds, workers poll the PartitionMapping table. a. If they are the current lease owner, but are no longer the target lease owner, they stop processing the control queue. Once they have finished processing all in-flight orchestration executions, the worker assigns CurrentLeaseOwner to the target lease owner. b. If they are not the current lease owner, but they are the target lease owner, start more aggressively polling this table (every 5 seconds?). Once they are the CurrentLeaseOwner and the TargetLeaseOwner, they can begin processing the control queue.
Every 30 seconds (or maybe just when performing the load balancing?), the coordinator will perform various cleanup operations. This mainly involves purging the WorkerStatus table of workers we are fairly certain are dead (LastAlive ping was longer than ping frequency + buffer). When it detects a dead worker, it also determines whether that worker was the CurrentLeaseOwner of any leases, and manually assign CurrentLeaseOwner to the TargetLeaseOwner.

Open questions:

How frequent do keep alive / partition polling / load balancing operations take place?
How does the coordinator handle workers that are “possibly dead” from a load balancing perspective?
Can we reduce storage transactions by somehow combining WorkerStatus and PartitionMapping table without introducing more complexity?

Pros:

Fixed number of operations per partition.
Table operations are cheap on v2 storage accounts ($.00036 per 10,000 operations unless encrypted, $.0019 per 10,000 read/lease operations and $.0228 per 10,000 write operations for v2 blobs) and same costs for v1 storage accounts if tables not encrypted.

Cons:

Essentially writing our own leasing mechanism as opposed to built in one, so inherently more complex/error prone.
Table operations can become far more expensive if tables encrypted.

Estimated Costs (very rough!) Loadbalance: 1 table read per load balance. (1 read by coordinator) Up to 1 + p writes per load balance, where p is the partition count. (1 batch write by coordinator + up to p writes by each old lease owner to release lease to new owner) Up to ~10p reads per load balance (assuming 10 extra polls by target lease owner before they own the lease) Coordinator Lease Lease renewal by coordinator (every 30 seconds?) 1 lease acquisition attempt from each other worker (every minute?) Heartbeats
1 table writes per heartbeat Polling 1 table reads per poll Total Cost Analysis Assuming each worker does a heart beat every 30 minutes, a partition poll every 30 seconds, and the coordinator load balances every 30 seconds, lease acquisition every minute, lease renewal every 30 seconds, you get the number of operations per minute = 4 + 3x (where x is the number of workers) + C where C is the number of operations caused by partition moves, at a maximum of 2 + 22p Per day, that results in: 2880 + 4320x + C where C is dependent total number of partition movements in the day (maxes at 2880 + 31680*p) That means in partition management terms, each additional worker only costs ~$0.00016 per day ($.0007 for v2 blobs). Note that on consumption and premium C may be relatively nontrivial, but each partition costs a maximum of $.001 per day. Fixed costs (coordinator lease renewals mainly) will be on the order of $.001 to $.003 a day.

davidmrdavid commented 3 years ago

I like this design, and it seems to lower costs sufficiently. I'd be worried about trying to optimize it further at the risk of adding complexity, as you noted. That said, I'll provide a potential optimization below, in case you think it's worth pursuing.

Just a question: You mentioned that costs would increase if these tables were encrypted but, since these are DF-managed tables, aren't we in control of whether they'd be encrypted? If so, why is this a concern?

And regarding your open questions:

How frequent do keep alive / partition polling / load balancing operations take place?

My intuition is that having each of these operations take place at the same frequency-intervals is likely to introduce race conditions, so I'd prefer to avoid that.

Just to throw some numbers out there, here's this is what I have in mind.

Operations done by Coordinator

All occur in 30 seconds intervals, for simplicity.

Operations done by Workers

WorkerStatus write: every 20 seconds, ample time to finish before a coordinator read
PartitionMapping read: every 40 seconds, to avoid a race with the coordinator.

How does the coordinator handle workers that are “possibly dead” from a load balancing perspective?

Perhaps this is overly aggressive, but I think any worker without a heartbeat log in the last minute, so in 2 reads of the coordinator, can be deemed "dead".

Can we reduce storage transactions by somehow combining WorkerStatus and PartitionMapping table without introducing more complexity?

An alternative (optimized?) schema could be a single table with the following columns:

(1) WorkerName (GUID) (2) LastHeartbeat (Date) (3) LeasedPartition (GUID) (4) LeaseInheritor (GUID)

In this case, LeaseInheritor is corresponds to "targetLeaseOwner".

So, consider the example where we have two workers Bob and Alice, and one partition named A. One of these two workers is the coordinator, it doesn't matter which of them.

If Bob owns A and is also its "target owner" according to the Coordinator, then the table would look like this.

WorkerName	LastHeartbeat	LeasedPartition	LeaseInheritor
Bob	...	A	null
Alice	...	null	null

However, if A is load balanced to Alice, then the coordinator would first update the table as follows:

WorkerName	LastHeartbeat	LeasedPartition	LeaseInheritor
Bob	...	A	Alice
Alice	...	null	null

Then, once A is done with its in-flight orchestrations, then A would update the table as follows:

WorkerName	LastHeartbeat	LeasedPartition	LeaseInheritor
Bob	...	null	null
Alice	...	A	null

And that's it! For data integrity, I'd say

First, that non-coordinator workers can only write to this table so long as:

They're modifying their own row, the one with their WorkerName
or they're writing to the LeasedPartition cell of their LeaseInheritor row

Second, coordinators can make any changes they want.

Let me know what you think!

gorillapower commented 2 years ago

@ConnorMcMahon @davidmrdavid Wondering if there are any updates on this item?

davidmrdavid commented 2 years ago

Hi @gorillapower, I believe this hasn't been prioritized yet. I'll add it as a discussion item for our next team meeting. Also, Connor is no longer working on the Durable Functions project, so just a heads up that they will most likely not be responding to pings on GitHub about this project :)

cgillum commented 2 years ago

I'd like to propose an alternate design, which I think might be simpler. It uses table storage as Connor recommended but involves only one table. Let's called it {TaskHub}Partitions. It has the following columns:

Name	Type	Description
RowKey	String	The name of the control queue
PartitionKey	String	Empty string
CurrentOwner	String	The name of the worker that owns this partition
NextOwner	String	The name of the worker that is stealing the partition or null if nobody is trying to steal it
OwnedSince	DateTime	The date/time at which the partition was originally acquired
LastRenewal	DateTime	The date/time at which the partition was last renewed
ExpiresAt	DateTime	The date/time at which the partition lease expires

There is 1 row for each partition, so if there are 4 partitions (the default) then there will be 4 rows. When the task hub is created, these rows will be created with empty values, except for the RowKey which will have the partition names.

In a background loop, each worker does the following in intervals of 5 seconds (configurable):

Read the entire contents of the table (e.g., the 4 rows) and make a note of the eTag value.
If any partition has a null CurrentOwner, the worker takes it (a worker can take all of them if available).
Otherwise, if any partition has an expired lease (using the ExpiresAt column), the worker takes it (can take all if available).
Otherwise, if there's an imbalance in assignments (we'll define this later), the worker can add itself to the NextOwner column to one or more rows to gracefully "steal" the lease.
If the worker owns any "stolen" leases, it removes itself by setting CurrentOwner to null.
If the worker owns any leases that haven't been stolen, it updates the ExpiresAt time to renew the lease.
If any changes are made, the updates are saved back to the table as a batch, using the eTag to prevent two writes from happening concurrently.
If any write fails, delay for a random number of seconds in addition to the normal renewal time to reduce the chance of additional conflicts.

I've left out some details, like updating OwnedSince and LastRenewal for brevity but those should also be updated at the appropriate times. I've also left out details around specific time intervals for lease renewal, expiration, etc. Presumably we can reuse the configuration we already have.

Some of the details could be optimized. For example, it might not be necessary to query the full table each time. We could potentially just query a subset based on some of the conditions mentioned above. However, I suggest we start by just querying all of them since the number is limited (no more than 16) and this likely keeps the code simple, mitigating the risk of query bugs.

I haven't done the cost analysis yet, but during steady state, the number of table operations per interval would be some factor of M * N where M is the total number of workers and N is the number of partitions. Writes would occur for each lease renewal and each assignment.

The thing I like about this design is that it's dramatically simpler than what we have now. It's also easier to debug since you can easily read the state of partition assignments in a single table. Of course, it's also much cheaper since we don't have to do any blob scanning.

gorillapower commented 2 years ago

Great! This looks much simpler and I suspect it will be much cheaper. Is there a concern of the workers racing to update the table and hitting etag failures? I guess once a worker has a lease, it will hold on to it for a while (until stolen or scale in), so maybe its less of a concern.

Makes me think, would this mean we could go beyond the 32 partitions we have available today?

olitomlinson commented 2 years ago

Really like the simplicity of this new load balancing algo!

gorillapower commented 2 years ago

@cgillum do you know if this will be part of any of the teams upcoming work?

cgillum commented 1 year ago

Update: we're starting on this work now! We currently estimate it taking 3-4 months to complete, but always take estimates with a huge grain of salt.

Tealons commented 1 year ago

After starting an investigation about our relative high blob operation costs for our azure functions on consumption plan we found this issue and hope we can benefit from it soon. Maybe a short update about the ETA and needed migration path for our azure functions? This would help me plan our work and update my stakeholders.

cgillum commented 1 year ago

@Tealons ETA is still roughly the same - we're targeting June-July for this being released. The PR tracking this work is here. We're currently planning to have this be a transparent update - i.e., all apps with the Azure Storage backend will have this new partition manager by default.

/FYI @nytiannn

nytian commented 1 year ago

[heart] Naiyuan Tian reacted to your message:

From: Chris Gillum @.> Sent: Friday, May 19, 2023 4:25:34 PM To: Azure/durabletask @.> Cc: Naiyuan Tian @.>; Mention @.> Subject: Re: [Azure/durabletask] Replace DurableTask.AzureStorage partition management with more cost effective approach for V2 storage accounts (#555)

@Tealonshttps://github.com/Tealons ETA is still roughly the same - we're targeting June-July for this being released. The PR tracking this work is herehttps://github.com/Azure/durabletask/pull/899. We're currently planning to have this be a transparent update - i.e., all apps with the Azure Storage backend will have this new partition manager by default.

/FYI @nytiannnhttps://github.com/nytiannn

— Reply to this email directly, view it on GitHubhttps://github.com/Azure/durabletask/issues/555#issuecomment-1554818281, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A2IIOROE3OQ7Z7I4R2UH5WTXG6NH5ANCNFSM47EXVIJA. You are receiving this because you were mentioned.Message ID: @.***>

Tealons commented 1 year ago

@Tealons ETA is still roughly the same - we're targeting June-July for this being released. The PR tracking this work is here. We're currently planning to have this be a transparent update - i.e., all apps with the Azure Storage backend will have this new partition manager by default.

/FYI @nytiannn

Thanks for the update! Looking forward to the results in production!

Tealons commented 1 year ago

Any update about the rollout for consumption based functions? I still see very high storage costs for my functions in comparison to the computational costs.

nytian commented 1 year ago

@Tealons Hi, I am sorry for this. Are you using the Partition Manager V3 with the consumption plan?

It's more recommended to apply this new feature to dedicated/premium plans for now. For consumption plan, the feature will be enabled once we update all the scale controller repo, which the ETA will be soon.

Tealons commented 1 year ago

@nytian No. Would it be possible to test it via the Partition Manager V3 setting? Otherwise we will wait until the scale controller has been updated. I'm just hoping this will be soon. Currently the costs of our storage accounts are 2500 times more than our compute costs of the consumption functions. It completely defies the purpose of consumption based functions :(

cgillum commented 1 year ago

Currently the costs of our storage accounts are 2500 times more than our compute costs of the consumption functions.

@Tealons can you tell us a bit more about why your storage bill is so high? Is there a breakdown that tells you where most of the costs are coming from? The expectation is that most of the savings are in the form of reduced background blob storage transactions.

Also, are you using a legacy V1 general purpose storage account or a V2 general purpose storage account? V2 storage accounts are significantly more expensive when using Durable Functions, which is why we strongly recommend using the legacy V1 accounts (for now) until the new table partition manager is GA.

Tealons commented 1 year ago

@cgillum I discussed with my team which example we should post here. The 2500 times example is on the extreme end of our functions, so we decided to post a more common scenario here. Please note that although the shown costs are not very high, we have a single tenant architecture per customer, so there are many more instances that have this problem, making the costs a significant part of our Azure spend.

In this example we have LRS enabled on the blob storage. We also have GRS redundancy with some customers, which makes the storage costs about 20% higher. Because of the GRS option we use V2 storage accounts.

Our functions app is mainly used for performing background jobs. We selected the functions consumption plan because the usage patterns of customers show high variation. Because of the single tenant architecture the consumption plan was the perfect cost/performance ratio for us.

In the example below the function execute a small task every 5 minutes and some longer/larger tasks every hour. As you can see the compute costs are very low compared to the storage costs. Upon investigation we found that the durable function blob operations are mainly the cause of the blob costs.

gorillapower commented 1 year ago

That was our conclusion too, we found that at idle there were a lot of queue operations. After migrating to V1 storage account our costs decreased by more than 50%.

Tealons commented 1 year ago

We did some more in-depth analysis and see the same scenario that @karol-gro reported in his following comment: https://github.com/Azure/azure-functions-durable-extension/issues/1651#issuecomment-898435338

Creating a function app with an OrchestrationTrigger will give you a $10 bill regardless of the activity of the function app.

Was there any cost benchmark done for the new partition manager? I would expect that an Azure function with less than 1 cent of costs, also have max several cents in costs for the associated storage account?

cgillum commented 1 year ago

@Tealons yes, we did benchmarking and observed a minor cost decrease for storage V1 and a much bigger cost decrease for storage V2. What kind of storage accounts are you using in your measurement? Also, how many partitions do you have? And can you confirm whether you’re using the new V3 partition manager? What’s the breakdown of your $10 storage bill? Also, what SKU is your app running on (Consumption, Elastic Premium, an App Service Plan)?

Tealons commented 1 year ago

@cgillum: Please see the image in my previous comment for the breakdown of $10 storage bill (the image was not the full month yet). We use V2 storage accounts, Consumption plan based functions and the default partition count of 4.

davidmrdavid commented 1 year ago

Could this cost be coming from the consumption scale controller monitoring the app, @cgillum?

Azure / durabletask

Replace DurableTask.AzureStorage partition management with more cost effective approach for V2 storage accounts #555

Problem

Proposal 1 (Coordinator pattern)