Azure / azure-cosmos-dotnet-v3

.NET SDK for Azure Cosmos DB for the core SQL API
MIT License
742 stars 495 forks source link

Autoscale throughput support #1365

Closed j82w closed 4 years ago

j82w commented 4 years ago

The current design of autoscale throughput for the .NET v3 SDK.

Design:

  1. Union type. ThroughputProperties will contain values for both fixed and autoscale throughput. This allows a single read operation to always succeed.
  2. ThroughputProperties will be created through factory to ensure the correct configuration. Avoid runtime exceptions from misconfigured ThroughputProperties.
  3. Throughput properties will be set for both autoscale and fixed throughput offerings. MaxAutoscaleThroughput and AutoUpgradeMaxThroughputIncrementPercentage will only be populated if autoscale is set.
Database database = await this.cosmosClient.CreateDatabaseAsync(
    nameof(CreateDropAutoscaleDatabase) + Guid.NewGuid().ToString(),
    ThroughputProperties.CreateAutoScaleProvionedThroughput(5000));

ThroughputResponse autoscale = await database.ReadThroughputAsync(requestOptions: null);
Assert.IsNotNull(autoscale);
Assert.AreEqual(5000, autoscale.Resource.MaxAutoscaleThroughput);

ThroughputResponse autoscaleReplaced = await database.ReplaceThroughputPropertiesAsync(
    ThroughputProperties.CreateAutoScaleProvionedThroughput(10000));
Assert.IsNotNull(autoscaleReplaced);
Assert.AreEqual(10000, autoscaleReplaced.Resource.MaxAutoscaleThroughput);
 Container container = await database.CreateContainerAsync(
    new ContainerProperties(Guid.NewGuid().ToString(), "/pk"),
    ThroughputProperties.CreateAutoScaleProvionedThroughput(5000));

ThroughputResponse autoscale = await container.ReadThroughputAsync(requestOptions: null);
Assert.IsNotNull(autoscale);
Assert.AreEqual(5000, autoscale.Resource.MaxAutoscaleThroughput);

ThroughputResponse autoscaleIfExists = await container.ReadThroughputIfExistsAsync(requestOptions: null);
Assert.IsNotNull(autoscaleIfExists);
Assert.AreEqual(5000, autoscaleIfExists.Resource.MaxAutoscaleThroughput);

ThroughputResponse autoscaleReplaced = await container.ReplaceThroughputPropertiesAsync(
    ThroughputProperties.CreateAutoScaleProvionedThroughput(10000));
Assert.IsNotNull(autoscaleReplaced);
Assert.AreEqual(10000, autoscaleReplaced.Resource.MaxAutoscaleThroughput);
public class ThroughputProperties
{
    [JsonConstructor]
    private ThroughputProperties()
    {
    }

    [JsonProperty(PropertyName = Constants.Properties.ETag, NullValueHandling = NullValueHandling.Ignore)]
    public string ETag { get; private set; }

    [JsonConverter(typeof(UnixDateTimeConverter))]
    [JsonProperty(PropertyName = Constants.Properties.LastModified, NullValueHandling = NullValueHandling.Ignore)]
    public DateTime? LastModified { get; private set; }

    [JsonIgnore]
    public int? Throughput { get; private set; }

        [JsonIgnore]
    public int? MaxAutoscaleThroughput {get;}

    public static ThroughputProperties CreateFixedThroughput(int throughput)
    {
        return;
    }

    public static ThroughputProperties CreateAutoscaleProvionedThroughput(
        int maxAutoscaleThroughput)
    {
        return;
    }

    [JsonProperty(PropertyName = Constants.Properties.SelfLink, NullValueHandling = NullValueHandling.Ignore)]
    public string SelfLink { get; private set; }
}
jahmai-ca commented 4 years ago

Can you explain what "AutoUpgradeMaxThroughputIncrementPercentage" does? I haven't seen any feature described in Autopilot docs that matches that terminology.

deborahc commented 4 years ago

@j82w - can you add a sample that shows how the conversion between autoscale and manual throughput would go?

j82w commented 4 years ago

I see there being the following options.

  1. Implicit conversion where we handle it internally in the SDK
    
    Database database = await this.cosmosClient.CreateDatabaseAsync(
    "testDb",
    ThroughputProperties.CreateFixedThroughput(5000));

ThroughputResponse autoscale = await database.ReadThroughputAsync(requestOptions: null); Assert.AreEqual(5000, autoscale.Resource.Throughput);

ThroughputResponse autoscaleReplaced = await database.ReplaceThroughputPropertiesAsync( ThroughputProperties.CreateAutoScaleProvionedThroughput(10000));


2. Explicit conversion with explicit APIs
```C#
Database database = await this.cosmosClient.CreateDatabaseAsync(
    "testDb",
    ThroughputProperties.CreateFixedThroughput(5000));

ThroughputResponse autoscale = await database.UpdateThroughputTypeToAutoscale();

ThroughputResponse autoscaleReplaced = await database.ReplaceThroughputPropertiesAsync(
    ThroughputProperties.CreateAutoScaleProvionedThroughput(10000));

ThroughputResponse autoscale = await database.UpdateThroughputTypeToFixed();
  1. Explicit conversion with enum
    
    Database database = await this.cosmosClient.CreateDatabaseAsync(
    "testDb",
    ThroughputProperties.CreateFixedThroughput(5000));

ThroughputResponse autoscale = await database.UpdateThroughputType(ThroughputType.Autoscale);

ThroughputResponse autoscaleReplaced = await database.ReplaceThroughputPropertiesAsync( ThroughputProperties.CreateAutoScaleProvionedThroughput(10000));

ThroughputResponse autoscale = await database.UpdateThroughputType(ThroughputType.Fixed);

deborahc commented 4 years ago

I am in favor of an explicit conversion. When a user does the conversion from manual -> autoscale, the backend will set a max RU/s. The user can change that in a follow-up call, as in 2,3.

With 1, it's possible the user could try to set an invalid value. Is it possible that the conversion to autoscale succeeds, but setting the new value doesn't? Or is it atomic?

j82w commented 4 years ago

@jahmai I removed AutoUpgradeMaxThroughputIncrementPercentage from the current design. It's new autoscale feature that is being discussed. The feature is meant to be for advance users, and what it does it allows CosmosDB to scale the MaxThroughput by a specified percentage so a database/container would never get stuck being throttled.

kirankumarkolli commented 4 years ago

One other option is to overload Update/Replace with forceMigrade flag. It keeps the models to the data-contracts only and operations are still generic.

j82w commented 4 years ago

@kirankumarkolli The problem with that design is the backend does not allow the properties to be updated in the same call to do the conversion. This would require 2 networks calls and it get's kind of ugly if the conversion isn't instant. We would also have to handle the scenario where the conversion is a success, but the update value the user gave is not valid. Do we convert back to fixed, do we leave it in auto but with the wrong user value?

kirankumarkolli commented 4 years ago

Two NW for migration is not bad.

Migration/scale-up/down are not guaranteed to be instant can be async. Even existing sync ones are kind of implementation details and are bound to change.

j82w commented 4 years ago

This will require the SDK do keep pulling until the conversion is complete and then do another call to update the value. So it could be a lot more than two NW calls.

j82w commented 4 years ago

@deborahc what if only the portal and CLI supported migration scenarios? Is there really a scenario where the user would do the conversion with the SDK? It's a one time operation. Doing a check or calling the migration every time their application start would be bad. The better user experience would be to do it through the portal or CLI.

jahmai-ca commented 4 years ago

Just for a small piece of user input on this discussion:

I can't imagine a scenario where we would migrate between manual <-> auto as a part of the standard operation of our application (as in response to some application condition). To my mind, Autopilot is there to avoid the need for realtime adjustments in throughput capacity.

I can see the need for occasional bulk changing of containers through in-house CLI tools, and in those cases, I think it's perfectly fine for the operation to be multiple calls or even polling for status for the migration to complete (initiate migration, then poll for completion, then adjust RUs). If it was a one-off conversion of a single, we would probably just use the portal.

To be perfectly selfish, the MVP for me is creating a container that is already configured for Autopilot and be able to retrieve the container properties that inform me that it is configured as an Autopilot container and what the RUs are (whether that's both min/max, or just min assuming that the range is well defined and static).

kirankumarkolli commented 4 years ago

@deborahc am kind of inline with @jahmai.

CLI is better fit for migration and also more practical. Can migration be limited/started with just CLI?

deborahc commented 4 years ago

The GA has 2 main features - 1) custom max RU/s and 2) ability to convert existing from manual -> autoscale.

It's a fair point it will likely be a one-time operation, mostly from manual -> autoscale. However, I think for completeness, it makes sense to support the migration scenario in the SDK. For developers who are already doing container/database management with SDK, it will be more convenient than switching to CLI and scripting. Otherwise, we have to say that feature 2) is available everywhere (Portal, CLI, PS, ARM) except for SDK.

I'm reaching out to a few customers who asked for the migration scenario to see if they would be blocked by it not in the SDK.

Converting an existing database/container to autoscale is conceptually a 2 step process - first you convert, then after it is successful, you change the max RU/s if you want (Portal also does it like this).

Is the worry with SDK support that the conversion operation will not be instant, and require a lot of polling? Based on how the system calculates the new max RU/s, I do not see a scenario where a scale-up/split is required - but I will add Padma or Rav to comment here.

FabianMeiswinkel commented 4 years ago

+1 to Deborah's comment - there are customers still using the SDK for CI and container/database management.

But I also agree that CLI support for migration would be more important than SDK supporting it.

I woudl expect long-term more and more at least enterprise customers will move the management tasks towards control plane to be able to use AAD-integration and audit logs etc.

padmaaradhyula commented 4 years ago

Is the worry with SDK support that the conversion operation will not be instant, and require a lot of polling? Based on how the system calculates the new max RU/s, I do not see a scenario where a scale-up/split is required - but I will add Padma or Rav to comment here.

There are some scenarios which may require split in some corner cases during migration from manual to autopilot.

My 2c: Currently we don't support any such billing impacting migrations via SDK - multi master to single master, free tier to non free tier. Liming the migration support to CLI/PS and portal sounds good to me.

ravgill commented 4 years ago

As Padma said, yes there will be small subset of containers/databases that may encounter splits when migration will get triggered. Max RU/s can't be changed while splits are going on.

deborahc commented 4 years ago

In that case - how about we go with supporting the create new scenario for now, and add the conversion based on feedback later?

jahmai-ca commented 4 years ago

@j82w I tried this out using ThroughputProperties.CreateAutoscaleThroughput(4000) against a new account/database/container in the Australia East region and it just created a Manual throughput container with 400 RUs. This is expected at this time? Is there some server update that needs to happen first?

j82w commented 4 years ago

@jahmai how are you checking if it's a manual throughput container? The portal changes are currently being deployed which is why we haven't released 3.9.0-preview yet. The emulator has a bug that is being fixed for reading the container throughput.

ThroughputProperties.Throughput which is the currently provisioned throughput. ThroughputProperties.MaxAutoscaleThroughput which is the maximum the container will scale to.

jahmai-ca commented 4 years ago

@j82w Using the portal after creating the container through code. Looking at the provisioned throughput in Data Explorer "Scale and Settings" tab it is set to Manual with 400 RUs. It doesn't allow me to select Autopilot (which I know Manual -> Autopilot is currently a missing feature). If I delete and re-create the container manually in the portal then Autopilot is selected as expected and I can change the max Autopilot max RU's as expected through the portal.

ravgill commented 4 years ago

@jahmai the autoscale offer resource that the service is creating from preview sdk is slightly different from the offer that portal will create for autoscale container today for the same autoscale max throughput. Portal doesn't understand this difference yet and that's why it is not showing a way to change max throughput or switch back to manual(not because it thinks this container has been configured with manual throughput and there is no way to make this autoscale today) If you see a banner that asks you to refresh the browser cache under "scale and settings" tab in portal that's an indication that collection is autoscale enabled and was created using the preview sdk. As @j82w mentioned portal experience will get updated early this week.

jahmai-ca commented 4 years ago

@ravgill Ahhh ok makes sense :) We're just doing migration testing at the moment so data will likely get blown away in the next week or so before we do production testing so I'll probably wait for that update. Thanks!

jahmai-ca commented 4 years ago

@ravgill Any updated release date on the portal update? Containers created with the preview SDK still appearing as Manual in the portal (but the RU's automatically go up and down with load). Also, I have a container that didn't automatically switch from 4000 RU to 20,000 RU with Autopilot when I exceeded the 50GB mark. I have a case open with Azure support, but could this be related to the fact I created the container with the preview SDK?

deborahc commented 4 years ago

@jahmai - thanks for trying this out! When you create the container with the preview SDK, the way the system handles the max RU/s in relation to storage is a bit different from the autopilot tiers model, which was a step function. Based on the max RU/s, the storage you can have before the max RU/s increase is max RU/s / 100. For example, if you start with a maximum RU/s of 50,000 RU/s (scales between 5000 - 50,000 RU/s), you can store up to 500 GB of data. If you exceed 500 GB - e.g. storage is now 600 GB, the new maximum RU/s will be 60,000 RU/s (scales between 6000 - 60,000 RU/s).

This will all be documented in an upcoming release, which will include the portal. Please stay tuned :)

jahmai-ca commented 4 years ago

@deborahc Ahhh, that actually sounds better than the tiering model. Very cool!

wahyuen commented 4 years ago

@deborahc @j82w Just wanted to provide some feedback from a vendor/developer perspective.

We currently have a number of clients where we have CosmosDB deployed to, and have for quite awhile provided custom application scaling of throughput to allow customers to control their cost. Given that our product is often deployed in a customers tenant and we often do not have portal access, our only option for pushing our infrastructure changes is through initialization code (via SDK) or via ARM/CLI scripts (via locked down Service Principal in Azure DevOps Release Pipeline).

It would be awesome to have some migration path that could be programatically performed so that we could roll these out to customers (not fussed if its ARM/SDK/CLI, we are used to using a combination of these tools quite often now!)

padmaaradhyula commented 4 years ago

Migration will be supported in CLI by late June.

From: Wah Yuen notifications@github.com Sent: Wednesday, May 20, 2020 7:31 PM To: Azure/azure-cosmos-dotnet-v3 azure-cosmos-dotnet-v3@noreply.github.com Cc: PadmaPriya Aradhyula Bhavani padmaa@microsoft.com; Comment comment@noreply.github.com Subject: Re: [Azure/azure-cosmos-dotnet-v3] Autoscale throughput support (#1365)

@deborahchttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdeborahc&data=02%7C01%7Cpadmaa%40microsoft.com%7C1a7b15d68e0f41a9da0e08d7fd2f0383%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637256250667309828&sdata=kZgPuFofEkIiNG1%2FaotTqMf%2BogpW4BgKo4G7RHYIjgU%3D&reserved=0 @j82whttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fj82w&data=02%7C01%7Cpadmaa%40microsoft.com%7C1a7b15d68e0f41a9da0e08d7fd2f0383%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637256250667319785&sdata=UxURskLmSG%2F5uJ3kGCiq4QXh7y6QHHWYewmEci%2BofBU%3D&reserved=0 Just wanted to provide some feedback from a vendor/developer perspective.

We currently have a number of clients where we have CosmosDB deployed to, and have for quite awhile provided custom application scaling of throughput to allow customers to control their cost. Given that our product is often deployed in a customers tenant and we often do not have portal access, our only option for pushing our infrastructure changes is through initialization code (via SDK) or via ARM/CLI scripts (via locked down Service Principal in Azure DevOps Release Pipeline).

It would be awesome to have some migration path that could be programatically performed so that we could roll these out to customers (not fussed if its ARM/SDK/CLI, we are used to using a combination of these tools quite often now!)

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-cosmos-dotnet-v3%2Fissues%2F1365%23issuecomment-631841543&data=02%7C01%7Cpadmaa%40microsoft.com%7C1a7b15d68e0f41a9da0e08d7fd2f0383%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637256250667319785&sdata=S4p4yOs49MQCkPx0c17dMplRUTGs8vpfQcmJ%2FCHlb9g%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAUR3HOIZUHWEP5Q2DR3J2LRSSG6TANCNFSM4MFT6OBQ&data=02%7C01%7Cpadmaa%40microsoft.com%7C1a7b15d68e0f41a9da0e08d7fd2f0383%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637256250667329740&sdata=8%2BQvr3H9mjBr%2FbGj1FjUSTWdsPptOt%2FxZD7fyN4EefQ%3D&reserved=0.