Open ConnorMcMahon opened 3 years ago
This seems really important and relatively urgent, so I wanted to get the ball rolling with some discussion. Do we already have any ideas for what this new, more cost-effective, might look like?
Currently the two partition management strategies we have involve many Blob transactions (mainly ReadBlob, ReadBlobMetadata, RenewLease) in order to successfully ensure each orchestration is only handled by one worker at a time. This is exacerbated by the new partition management technique, which is more reliable but involves two blobs for each partition. The problem with the current algorithm is that it scales by a factor of the following:
There is a single blob lease (separate from the app lease) per application , the owner of which is the “coordinator”. Each follower must regularly write to a common source with a “heartbeat” timestamp. The coordinator then uses this information to determine how many workers are alive, and how to load balance partitions across the alive workers. It has exclusive write access to a source that maps target partitions to the alive workers. All workers regularly poll this partition mapping to realize what queues they should be polling.
Table Design
WorkerStatusTable
PartitionMappingTable
Algorithm
Open questions:
Pros:
Cons:
Estimated Costs (very rough!)
Loadbalance:
1 table read per load balance. (1 read by coordinator)
Up to 1 + p writes per load balance, where p is the partition count. (1 batch write by coordinator + up to p writes by each old lease owner to release lease to new owner)
Up to ~10p reads per load balance (assuming 10 extra polls by target lease owner before they own the lease)
Coordinator Lease
Lease renewal by coordinator (every 30 seconds?)
1 lease acquisition attempt from each other worker (every minute?)
Heartbeats
1 table writes per heartbeat
Polling
1 table reads per poll
Total Cost Analysis
Assuming each worker does a heart beat every 30 minutes, a partition poll every 30 seconds, and the coordinator load balances every 30 seconds, lease acquisition every minute, lease renewal every 30 seconds, you get the number of operations per minute =
4 + 3x (where x is the number of workers) + C where C is the number of operations caused by partition moves, at a maximum of 2 + 22p
Per day, that results in:
2880 + 4320x + C where C is dependent total number of partition movements in the day (maxes at 2880 + 31680*p)
That means in partition management terms, each additional worker only costs ~$0.00016 per day ($.0007 for v2 blobs). Note that on consumption and premium C may be relatively nontrivial, but each partition costs a maximum of $.001 per day. Fixed costs (coordinator lease renewals mainly) will be on the order of $.001 to $.003 a day.
I like this design, and it seems to lower costs sufficiently. I'd be worried about trying to optimize it further at the risk of adding complexity, as you noted. That said, I'll provide a potential optimization below, in case you think it's worth pursuing.
Just a question: You mentioned that costs would increase if these tables were encrypted but, since these are DF-managed tables, aren't we in control of whether they'd be encrypted? If so, why is this a concern?
And regarding your open questions:
How frequent do keep alive / partition polling / load balancing operations take place?
My intuition is that having each of these operations take place at the same frequency-intervals is likely to introduce race conditions, so I'd prefer to avoid that.
Just to throw some numbers out there, here's this is what I have in mind.
Operations done by Coordinator
Operations done by Workers
How does the coordinator handle workers that are “possibly dead” from a load balancing perspective?
Perhaps this is overly aggressive, but I think any worker without a heartbeat log in the last minute, so in 2 reads of the coordinator, can be deemed "dead".
Can we reduce storage transactions by somehow combining WorkerStatus and PartitionMapping table without introducing more complexity?
An alternative (optimized?) schema could be a single table with the following columns:
(1) WorkerName (GUID) (2) LastHeartbeat (Date) (3) LeasedPartition (GUID) (4) LeaseInheritor (GUID)
In this case, LeaseInheritor
is corresponds to "targetLeaseOwner".
So, consider the example where we have two workers Bob
and Alice
, and one partition named A
. One of these two workers is the coordinator, it doesn't matter which of them.
If Bob
owns A
and is also its "target owner" according to the Coordinator, then the table would look like this.
WorkerName | LastHeartbeat | LeasedPartition | LeaseInheritor |
---|---|---|---|
Bob | ... | A | null |
Alice | ... | null | null |
However, if A
is load balanced to Alice
, then the coordinator would first update the table as follows:
WorkerName | LastHeartbeat | LeasedPartition | LeaseInheritor |
---|---|---|---|
Bob | ... | A | Alice |
Alice | ... | null | null |
Then, once A
is done with its in-flight orchestrations, then A
would update the table as follows:
WorkerName | LastHeartbeat | LeasedPartition | LeaseInheritor |
---|---|---|---|
Bob | ... | null | null |
Alice | ... | A | null |
And that's it! For data integrity, I'd say
First, that non-coordinator workers can only write to this table so long as:
WorkerName
LeasedPartition
cell of their LeaseInheritor
rowSecond, coordinators can make any changes they want.
Let me know what you think!
@ConnorMcMahon @davidmrdavid Wondering if there are any updates on this item?
Hi @gorillapower, I believe this hasn't been prioritized yet. I'll add it as a discussion item for our next team meeting. Also, Connor is no longer working on the Durable Functions project, so just a heads up that they will most likely not be responding to pings on GitHub about this project :)
I'd like to propose an alternate design, which I think might be simpler. It uses table storage as Connor recommended but involves only one table. Let's called it {TaskHub}Partitions. It has the following columns:
Name | Type | Description |
---|---|---|
RowKey | String | The name of the control queue |
PartitionKey | String | Empty string |
CurrentOwner | String | The name of the worker that owns this partition |
NextOwner | String | The name of the worker that is stealing the partition or null if nobody is trying to steal it |
OwnedSince | DateTime | The date/time at which the partition was originally acquired |
LastRenewal | DateTime | The date/time at which the partition was last renewed |
ExpiresAt | DateTime | The date/time at which the partition lease expires |
There is 1 row for each partition, so if there are 4 partitions (the default) then there will be 4 rows. When the task hub is created, these rows will be created with empty values, except for the RowKey
which will have the partition names.
In a background loop, each worker does the following in intervals of 5 seconds (configurable):
CurrentOwner
, the worker takes it (a worker can take all of them if available).ExpiresAt
column), the worker takes it (can take all if available).NextOwner
column to one or more rows to gracefully "steal" the lease.CurrentOwner
to null.ExpiresAt
time to renew the lease.I've left out some details, like updating OwnedSince
and LastRenewal
for brevity but those should also be updated at the appropriate times. I've also left out details around specific time intervals for lease renewal, expiration, etc. Presumably we can reuse the configuration we already have.
Some of the details could be optimized. For example, it might not be necessary to query the full table each time. We could potentially just query a subset based on some of the conditions mentioned above. However, I suggest we start by just querying all of them since the number is limited (no more than 16) and this likely keeps the code simple, mitigating the risk of query bugs.
I haven't done the cost analysis yet, but during steady state, the number of table operations per interval would be some factor of M * N where M is the total number of workers and N is the number of partitions. Writes would occur for each lease renewal and each assignment.
The thing I like about this design is that it's dramatically simpler than what we have now. It's also easier to debug since you can easily read the state of partition assignments in a single table. Of course, it's also much cheaper since we don't have to do any blob scanning.
Great! This looks much simpler and I suspect it will be much cheaper. Is there a concern of the workers racing to update the table and hitting etag failures? I guess once a worker has a lease, it will hold on to it for a while (until stolen or scale in), so maybe its less of a concern.
Makes me think, would this mean we could go beyond the 32 partitions we have available today?
Really like the simplicity of this new load balancing algo!
@cgillum do you know if this will be part of any of the teams upcoming work?
Update: we're starting on this work now! We currently estimate it taking 3-4 months to complete, but always take estimates with a huge grain of salt.
After starting an investigation about our relative high blob operation costs for our azure functions on consumption plan we found this issue and hope we can benefit from it soon. Maybe a short update about the ETA and needed migration path for our azure functions? This would help me plan our work and update my stakeholders.
@Tealons ETA is still roughly the same - we're targeting June-July for this being released. The PR tracking this work is here. We're currently planning to have this be a transparent update - i.e., all apps with the Azure Storage backend will have this new partition manager by default.
/FYI @nytiannn
[heart] Naiyuan Tian reacted to your message:
From: Chris Gillum @.> Sent: Friday, May 19, 2023 4:25:34 PM To: Azure/durabletask @.> Cc: Naiyuan Tian @.>; Mention @.> Subject: Re: [Azure/durabletask] Replace DurableTask.AzureStorage partition management with more cost effective approach for V2 storage accounts (#555)
@Tealonshttps://github.com/Tealons ETA is still roughly the same - we're targeting June-July for this being released. The PR tracking this work is herehttps://github.com/Azure/durabletask/pull/899. We're currently planning to have this be a transparent update - i.e., all apps with the Azure Storage backend will have this new partition manager by default.
/FYI @nytiannnhttps://github.com/nytiannn
— Reply to this email directly, view it on GitHubhttps://github.com/Azure/durabletask/issues/555#issuecomment-1554818281, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A2IIOROE3OQ7Z7I4R2UH5WTXG6NH5ANCNFSM47EXVIJA. You are receiving this because you were mentioned.Message ID: @.***>
@Tealons ETA is still roughly the same - we're targeting June-July for this being released. The PR tracking this work is here. We're currently planning to have this be a transparent update - i.e., all apps with the Azure Storage backend will have this new partition manager by default.
/FYI @nytiannn
Thanks for the update! Looking forward to the results in production!
Any update about the rollout for consumption based functions? I still see very high storage costs for my functions in comparison to the computational costs.
@Tealons Hi, I am sorry for this. Are you using the Partition Manager V3 with the consumption plan?
It's more recommended to apply this new feature to dedicated/premium plans for now. For consumption plan, the feature will be enabled once we update all the scale controller repo, which the ETA will be soon.
@nytian No. Would it be possible to test it via the Partition Manager V3 setting? Otherwise we will wait until the scale controller has been updated. I'm just hoping this will be soon. Currently the costs of our storage accounts are 2500 times more than our compute costs of the consumption functions. It completely defies the purpose of consumption based functions :(
Currently the costs of our storage accounts are 2500 times more than our compute costs of the consumption functions.
@Tealons can you tell us a bit more about why your storage bill is so high? Is there a breakdown that tells you where most of the costs are coming from? The expectation is that most of the savings are in the form of reduced background blob storage transactions.
Also, are you using a legacy V1 general purpose storage account or a V2 general purpose storage account? V2 storage accounts are significantly more expensive when using Durable Functions, which is why we strongly recommend using the legacy V1 accounts (for now) until the new table partition manager is GA.
@cgillum I discussed with my team which example we should post here. The 2500 times example is on the extreme end of our functions, so we decided to post a more common scenario here. Please note that although the shown costs are not very high, we have a single tenant architecture per customer, so there are many more instances that have this problem, making the costs a significant part of our Azure spend.
In this example we have LRS enabled on the blob storage. We also have GRS redundancy with some customers, which makes the storage costs about 20% higher. Because of the GRS option we use V2 storage accounts.
Our functions app is mainly used for performing background jobs. We selected the functions consumption plan because the usage patterns of customers show high variation. Because of the single tenant architecture the consumption plan was the perfect cost/performance ratio for us.
In the example below the function execute a small task every 5 minutes and some longer/larger tasks every hour. As you can see the compute costs are very low compared to the storage costs. Upon investigation we found that the durable function blob operations are mainly the cause of the blob costs.
That was our conclusion too, we found that at idle there were a lot of queue operations. After migrating to V1 storage account our costs decreased by more than 50%.
We did some more in-depth analysis and see the same scenario that @karol-gro reported in his following comment: https://github.com/Azure/azure-functions-durable-extension/issues/1651#issuecomment-898435338
Creating a function app with an OrchestrationTrigger will give you a $10 bill regardless of the activity of the function app.
Was there any cost benchmark done for the new partition manager? I would expect that an Azure function with less than 1 cent of costs, also have max several cents in costs for the associated storage account?
@Tealons yes, we did benchmarking and observed a minor cost decrease for storage V1 and a much bigger cost decrease for storage V2. What kind of storage accounts are you using in your measurement? Also, how many partitions do you have? And can you confirm whether you’re using the new V3 partition manager? What’s the breakdown of your $10 storage bill? Also, what SKU is your app running on (Consumption, Elastic Premium, an App Service Plan)?
@cgillum: Please see the image in my previous comment for the breakdown of $10 storage bill (the image was not the full month yet). We use V2 storage accounts, Consumption plan based functions and the default partition count of 4.
Could this cost be coming from the consumption scale controller monitoring the app, @cgillum?
Our current blob-lease per partition approach (or 2 blobs per partition in the non-legacy approach) means that the blob costs scale multiplicatively with both the number of partitions and the number of TaskHubWorkers. Even a relatively idle app with the default 4 partitions costs ~.15 cents per day on a V1 storage account, and with V2 storage accounts increasing the cost of transactions by an order of magnitude, this makes using V2 storage accounts cost prohibitive for many customers.
With some networking features being tied to V2 storage accounts only, it would be great to have a partition management strategy that is more cost effective.
Ideally this partition management algorithm would have the following properties: