Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
713 stars 267 forks source link

New partition management strategy defaulted to in v2.3.1 increases blob costs #1651

Open olitomlinson opened 3 years ago

olitomlinson commented 3 years ago

I've been looking at the costs for running my Durable Function App.

The below chart is a daily look at the break down of storage costs for the Storage account that the Task Hub is running against.

image

I was suprised to see that pretty much all of the costs are going against the 'blob' tier.

Would this imply that my orchestration state is frequently exceeding the size that will fit in the usual Instances Table row and is being serialized into the *-largemessages blob storage instead?

Many thanks!

cgillum commented 3 years ago

I think it’s more likely because of the continuous blob lease polling and renewal that happens in the background to ensure partitions are balanced.

olitomlinson commented 3 years ago

@cgillum Thanks for the response!

So that particular day had a total storage cost of £4.27

Are you suggesting that £3.83 of that total cost is not actually due to load?

I ask because our Storage costs have gone up significantly which is inline with running many more orchestrations each day, and these blob fees as I highlighted are the reason why.

I'm just trying to understand if this £3.83 for partition balancing is going to plateau and essentially become a new daily rough fixed cost because of the scale we are at? Or is this partition balancing cost going to grow also as our traffic/workload increases and we run more orchestrations daily?

For context the app is running

olitomlinson commented 3 years ago

I've just checked the date of when I released 2.3.1 to prod (which included the new partition management strategy) and that was the 29th of October. You can see a significant correlation in cost increase at the same time.

Is this inline with what you might anticipate with the new partition strategy? The graph below is the combined cost of 4 task hubs which were upgraded to 2.3.1 at the same time.

image

ConnorMcMahon commented 3 years ago

The number of blob lease operations scales multiplicatively based on:

It's also worth noting that the new partition management strategy introduced in v2.3.1 effectively doubles the number of blob oprations, since we have two blob leasese per partition to make partitions safer.

I think that I may have been overly aggressive with the blob intervals for the new set of blob leases, which may have made the multiplier closer to 2.5x or 3x, but we can't really lower that beyond the 2x increase, due to the increase in blob usage.

I am also starting to consider some sort of blob lease backoff strategy. This would be super helpful for the case where you have 16 partitions but have something like 50 workers. Workers 17-50 are never going to need to grab a partition lease unless one of the lease owners go down, so a lot of those blob lease operations are going to be mostly useless. However, I would want to make sure that any backoff we did wouldn't allow for a partition to be stranded for too long.

olitomlinson commented 3 years ago

That all makes sense @ConnorMcMahon - Thanks!

I definitely think the baseline cost of the TaskHub storage (when not under significant load) such as Saturday/Sunday has increased significantly.

This is the cost for one TaskHub

image

And this is the App Insights telemetry for successful Orchestration executions for the functions App which uses the above TaskHub Storage (I've removed non-orchestration triggers so the results are more specific to partition-related activity)

image

Notice how the Saturday/Sunday traffic is consistently low throughout the time range, even before 2.3.1 was rolled out on the 29th of October.

However compare the baseline cost increase for Saturday/Sunday. Its gone from approx £0.30 - £0.40 during saturday/sunday, to now somewhere in the range of £1.70 - £2.20. Which is 4 - 8 x greater baseline cost.

ConnorMcMahon commented 3 years ago

That definitely seems high @olitomlinson.

I'll take a look, but we just froze the code for v2.4.1, so this will be v2.4.2 or v2.5.0. I think I can get that down to a 2x cost, but anything more than that is highly unlikely for the current partition management strategy. Would that be satisfactory?

olitomlinson commented 3 years ago

I think that would be great @ConnorMcMahon !

Going from £1-£2 a day to £3-£6 is quite the jump.

So a 2x reduction in that would definitely be noticeable on the bottom line as I have 5 DF Apps, currently on 16 partitions. This could represent a saving of a few hundred pounds across the month! Even more when I factor in the dev environment which replicates prod!

Infact, this is another prime example of why partition count shouldn’t be a host.json setting, as I can’t easily set the partition count to just 1-2 for Dev and 16 for prod without doing some host.json swap-out as part of the DevOps release pipeline.

A better option would be a first-class DurableTask Resource Provider (which also encapsulates the provisioning of the durable Task storage account) - I could then use terraform to configure my TaskHub Resource as I desire per my environment, rather than having a static host.json bundled with the Function App code which is static across all environments.

ConnorMcMahon commented 3 years ago

@olitomlinson,

You can actually technically do this today with injecting your own IDurabilityProviderFactory. You can look at what we do with AzureStorageDurabilityProviderFactory and come with a slight variant that allows you to configure it with app settings as opposed to host.json.

This is how we are lighting up our alternative backends in the near future.

olitomlinson commented 3 years ago

@ConnorMcMahon oh that’s wicked! This must have been a recent-ish development because I recall Chris saying that something wouldn’t work properly in the consumption plan if partition count came from an app setting instead of host.json.

It might be worth publishing an official doc on this as I’ve seen a few folk wishing to vary the settings in host.json across environments.

ConnorMcMahon commented 3 years ago

We've had some form of this since v2.0.0, but we definitely haven't properly documented this.

As a side note, you will want to be careful about changing partition counts for a task hub with in flight executions, as that can cause orchestration messages to go on different control queues. Just something to be aware of if you are doing slot swaps with sticky app settings.

olitomlinson commented 3 years ago

@ConnorMcMahon yes I certainly wouldn’t change the partition count as I understand there is a risk of orphaning existing running orchestrations!

Great point about sticky app settings!

olitomlinson commented 3 years ago

@ConnorMcMahon

I've taken this from my dev environment which highlights the base cost increases more so.

image

The 4 lines represent the exact same DF App, with the exact same host.json configuration and Consumption Plan / Storage Account resource configuration.

Interestingly 2 of the Apps which are running a certain kind of low-volume workload has a much lower base-cost than the other 2 apps which are more frequently tested throughout the day with recurring regression tests.

It also appears that the base-cost is now more sensitive to load.

Unfortunately, I've decided to re-enable the legacy partition strategy across all of my Apps in dev and prod as I can't, in good faith, justify the increase in cost at this point. I would go as far as to say the new partition strategy is cost prohibitive for me. cc @cgillum

ConnorMcMahon commented 3 years ago

That's understandable at this point @olitomlinson. Thanks for bringing this to our attention. I'll see what I can do about driving this down for our upcoming release.

olitomlinson commented 3 years ago

Thanks so much @ConnorMcMahon ! It’s really appreciated!

olitomlinson commented 3 years ago

@ConnorMcMahon is this likely to be prioritised for 2.4.2? Cheers!

ConnorMcMahon commented 3 years ago

That is the goal. Note that the current deadline for that milestone is a placeholder.

ConnorMcMahon commented 3 years ago

@olitomlinson,

Can you provide details about your environment. Specifically the following:

PartitionCount Azure Functions Plan type How many workers (at rest and at scale).

I am working on the fix now, but I want to get some accurate measurements, and since you hit such a big cost difference, your scenario is what I want to test against.

olitomlinson commented 3 years ago

@ConnorMcMahon

Off the top of my head, 16 partitions on Consumption Plan.

I assume to get the concurrent worker count I need to aggregate the cloudRole_InstanceId In App Insights? If so, I can get this to you tomorrow when I have my laptop.

ConnorMcMahon commented 3 years ago

I think that is the best approach @olitomlinson. My guess is that you have enough traffic to have a steady 1 worker at most/all times, and then you have periods of load that get you much higher than that.

olitomlinson commented 3 years ago

@ConnorMcMahon

Given 5 minute aggregation windows

I've seen up to ~55 workers during scale out for peaks of load

image

Generally over the course of the day there is typically around 3-4 workers (processing orchestrations & activities) in any 5 minute window, dropping to zero for a few moments during a 24 hour day.

image

ConnorMcMahon commented 3 years ago

@olitomlinson,

As a status update, here are the costs from my initial experimentation, with the case of 1 worker running on consumption with 16 partitions over a 3 day weekend with idle load.

Strategy Queues Blobs Total
Legacy $.23 $.42 $.67
Safe $.23 $1.12 $1.37
Safe (revised) $.25 $.84 $1.10

Looking at queues, that remains constant across partition management strategies, which makes sense. In fact, this should be a consistent cost for idle load regardless of how many workers you have.

Therefore, let's just look at Blob cost, as that is going to be the major driver of default cost of running Durable Functions, and it will scale roughly linearly with the number of workers you have.

The current safe partition management has a ~2.67x cost differential compared to legacy. If you look at my revision (which I will publish a PR for so), I brought that down to a ~2x cost differential.

I will want to test the app under some sustained load to see if that changes the numbers, but this is about the lower bound of what I can decrease it to with the new algorithm, considering we've double the blobs for the safe strategy.

I do want to point out that in an earlier message, you mentioned the desire to override host.json with appsettings to make it easier to configure partitions on a per environment basis. It turns out this is possible, as documented here.

EDIT:

I realized that I've added an optimization that may actually improve these numbers slightly at load. We don't have to read the any ownership blob data for workers that don't own any intent leases (i.e. workers 17+ in the case of 16 partitions), which should cut the blob reads for these workers back to the levels of the legacy partition manage.

I'm also looking into some further design changes that could potentially reduce blob read costs even more, but I need to explore those to make sure they are actually viable.

olitomlinson commented 3 years ago

@ConnorMcMahon

Great! This sounds like really positive steps! Thanks :)

ConnorMcMahon commented 3 years ago

@olitomlinson,

Great news. I was able to code up a more ambitious cost reduction approach. Now we only download ownership leases on the worker that own intent leases. This means that the cost increase should be static, and only increase with the number of partitions, not with the number of workers. So if you are on one worker, you would still be about a 2x improvement, but there should be little/no increase for additional workers.

The PR is here. If you are amenable, we could put out a prerelease package for you to consume and test in one of your test environments to test real world cost reduction for your scenario.

olitomlinson commented 3 years ago

@ConnorMcMahon excellent! Happy to take a pre-release and run it in a test environment and feedback.

ConnorMcMahon commented 3 years ago

@olitomlinson, if you reference Microsoft.Azure.DurableTask.AzureStorage 1.8.5-prerelease on our myget feed (https://www.myget.org/F/azure-appservice/api/v3/index.json), you can try out the new code that should reduce storage costs with the safe partition management strategy.

olitomlinson commented 3 years ago

@ConnorMcMahon awesome! Will get it rolled out and tested as soon as I can!

ConnorMcMahon commented 3 years ago

@olitomlinson,

The bad news is we caught some transient regression in the partition balancing during our performance testing validation before the release, so we are going to punt this work to v2.5.0 so we can harden it. Sorry for the inconvenience here.

olitomlinson commented 3 years ago

@ConnorMcMahon

I have spied another instance of a customer walking away from DF due to insane Storage costs for Durable (specifically Entities)

The user was hitting £150 a day in storage costs - this is specifically a storage account just for hosting the TaskHub.

image

src : https://twitter.com/craftyfella/status/1376514291323506688?s=20

Interestingly, the break down of storage transactions in the above image are the same transactions as I reported when opening this issue.

What kind of partition management scenario could cause such costs against one storage account?

JonCanning commented 3 years ago

@ConnorMcMahon

Here are our costs for last month as requested

image

And our host.json

{
  "version": "2.0",
  "extensions": {
    "durableTask": {
      "hubName": "%HubName%",
      "storageProvider": {
        "partitionCount": 16,
        "connectionStringName": "EntityStorage",
        "useLegacyPartitionManagement": false
      }
    }
  }
}
ConnorMcMahon commented 3 years ago

@JonCanning,

Thanks so much for the details. Just to clarify, do you know if your blob account is set to use premium block blob performance? Internally, we use block blobs mainly for lease purposes, and premium block blob operations are orders of magnitude more expensive, which could explain the high blob costs you are seeing.

Since we only use blobs for leases (and large messages that don't fit in queues/tables), premium block blob performance is largely a waste of money for Durable Functions.

JonCanning commented 3 years ago

We used a general purpose v2 storage account

ConnorMcMahon commented 3 years ago

Hmmm, that seems absurdly high for non-premium block blob usage... Do you mind sharing an orchestration instance id (or entity id) + region and we could look at your app to see what storage operations could be creating such high costs?

JonCanning commented 3 years ago

We've removed durable entities from the service now. If you can see the historical data, the storage account id was /subscriptions/8a4a4971-b728-4c5a-b800-e17fe2a330d9/resourceGroups/pod-prod-uksouth-billing/providers/Microsoft.Storage/storageAccounts/podproduksouthentity

It was handling 7 millions updates a day

olitomlinson commented 3 years ago

@JonCanning

Would you say that translates to roughly 7 million signals/calls per day? Spread across roughly how many Entities? Would you say that in your use-case each entity was receiving many signals/calls in quick succession?

I ask because I've seen that when an entity receives beyond a few hundred signals in quick succession, the state size of the Entity can no longer fit into the Azure Storage Table input field. In this case, it falls back to blob storage for entity state. I wonder if this explains the super high blob cost...


The state grows because each Entity tracks the signals it receives (Signal Id & Timestamp) within the last rolling time frame (I think its last 30 minutes by default) This is so that it can dedupe and manage out-of-order signals.

JonCanning commented 3 years ago

Yes, there were maybe a million entities, and some were receiving in quick succession, but the majority were not. Your explanation makes sense in this context.

olitomlinson commented 3 years ago

I appreciate you've already departed from DF at this point, so this is just noise.... I speculate somewhat, but it's possible that your use-case would have been a good candidate for the new Durable Functions Storage Provider called Netherite as that is optimised for 'entities at scale'.

Unfortunately, that provider isn't GA just yet.

karol-gro commented 3 years ago

What is the status of this issue?

We recently created a simple Durable Function, and found that we do about 500k requests to Blob API per day (which by itself costs about $10/month). Strange it happens even on our dev environment, where the function is never called (But the app itself is running as there are other functions as well). Not a big cost, but we're a bit worried how it will scale

The statistics look similar to those above (mostly list&create Blob and Other operation)

maxime-delisle commented 3 years ago

What is the status of this issue?

We recently created a simple Durable Function, and found that we do about 500k requests to Blob API per day (which by itself costs about $10/month). Strange it happens even on our dev environment, where the function is never called (But the app itself is running as there are other functions as well). Not a big cost, but we're a bit worried how it will scale

The statistics look similar to those above (mostly list&create Blob and Other operation)

We have the same issue on our side. It seems to be related to the usage of an EventHub trigger; which even when unused, causes approx 1Mio transactions per day

karol-gro commented 3 years ago

We don't have anything on EventHub trigger - only OrchestrationTrigger and Timer. Same app without OrchestrationTrigger has no such issues

ConnorMcMahon commented 3 years ago

Let me provide some context here:

The Azure Storage implementation of DTFx which is used by Durable Functions has always used blob leases to ensure each partition is being processed by only a single VM at a time. We borrowed this implementation from EventHubs circa 2016, which is why they have such similar characteristics that people are noticing on this thread.

This algorithm involves polling each blob (one for each partition) on each worker, at a fairly frequent interval. This happens even with the app is idle, as we need to ensure that when we do receive traffic, partitions are still only processed by a single VM at a time. This means that as you increase partition count, or increase the number of workers that your code is running on, you will increase the number of transactions.

Note that the costs for most customers with this legacy algorithm are fairly low (~$5-10 per month per TaskHub). However, v2 storage accounts increased the cost of blob API transactions by an order of magnitude, which means that even this legacy algorithm can get quite expensive.

Our legacy algorithm was found to have issues in cases where partitions move quickly (i.e. the consumption and premium plans for Azure Functions), so we settled on a new algorithm. While this algorithm is definitely more reliable at preventing split brain (multiple workers processing a partition at the same time), it unfortunately adds a second blob lease for each partition, which obviously drives up costs.

We did find some opportunities where we could change the number of API transactions in the new algorithm from around ~2x to ~1.25x, but we had to revert those changes because we found bugs right before the release.

Our current approach to solving this problem is to bypass blob leases altogether, as even the legacy algorithm has high costs if using v2 storage accounts. V2 Storage accounts are required for various networking features, so this will eventually become the default for new function apps. Because of that, we are investing our efforts in a new algorithm that relies more on Azure Tables than on blob leases, as opposed to moderately improving the current partition management strategy.

This issue is being tracked here. I have a design floating around internally that we believe will work that I will post in that issue thread later today. If my back-of-the-envelop math is correct, this new approach should be more cost effective on both v1 and v2 storage accounts, and hopefully be just as reliable as our current dual-blob lease algorithm.

lilyjma commented 1 year ago

Hi All, we've released a new version of the partition manager recently that's in preview. It's designed to be more cost efficient than the old version. However, it doesn't have support for managed identity as of now (we'll be adding that in soon). Please give this a try if possible and let us know of any feedback you have: https://github.com/Azure/azure-functions-durable-extension/releases/tag/v2.10.0. Thanks!

olitomlinson commented 1 year ago

@lilyjma

Glad you finally have released a solution for this!

Unfortunately, nearly 3 years have passed and I no longer work at the same organisation so I'm unable to validate.