Azure / azure-cosmos-dotnet-v3

.NET SDK for Azure Cosmos DB for the core SQL API
MIT License
741 stars 494 forks source link

[Internal] AllVersionsAndDeletes on change feed pull model for partition split-handling caching design documentation. Compute Gateway. #3681

Closed philipthomas-MSFT closed 1 year ago

philipthomas-MSFT commented 1 year ago

Purpose statement

This is a document to enhance the Cosmos DB experience by achieving even higher performance.

Description: The plan to improve latency and overall performance for Change feed in Azure Cosmos DB Pull model requests while in AllVersionsAndDeletes (preview) change feed mode by introducing a caching strategy that is local to Compute Gateway for a collection's physical partition archival lineage. The collection's physical partition archival lineage is a routing map that instructs Compute Gateway on how to drain documents when a change feed request is received. It is driven by the physical partition's minimum and maximum log sequence numbers which is a change feed information request to the Backend API. This will support all SDK languages, specifically .NET and Java SDKs. There would be tenant configuration feature flags and additional diagnostic logging that would need to be implemented as well. This issue will be split into multiple PRs, caching, feature flag, and logging). Tenant configuration for feature flag is a fail-safe for if the caching strategy is not working as expected.

Level-setting

Stakeholders

Resources

Out of scope

Scope of work

The Microsoft Azure Cosmos DB .NET SDK Version 3 needs to achieve optimal performance by implementing a local caching strategy in Compute Gateway for all change feed request while in AllVersionsAndDeletes (preview) change feed mode. So, introducing a caching strategy with additional trace logging and a feature flag for the collection's physical partition archival lineage will unequivocally improve performance by accessing a cache that is local to the Compute Gateway.

Criteria for caching

When a collection's physical partition has split. The logic to construct the collection's physical partition archival lineage is solely determined by whether that physical partition has returned an HTTP status code Gone, or 410 when the change feed has requested items.

Current baseline architecture

Too many unnecessary Backend request that affect latency and overall performance

Currently, we do not support any caching strategy, and all change feed request while in AllVersionsAndDeletes (preview) change feed mode will request additional minimum and maximum log sequence numbers for every partition that exists within a collection's partition archival lineage. If change feed requests are being sent to the same collection to exhaust change feed items, then a cache, local to Compute Gateway, of that collection's physical partition archival lineage should be fetched and used for determining the physical partition routing strategy for draining documents on a live physical partition. Currently, the collection's physical partition archival lineage is being constructed and traversed for every change feed request while in AllVersionsAndDeletes (preview) change feed mode. The construction of the collection's physical partition archival lineage increases latency due to its need to make additional network hops to the Backend services to make change feed information request that include minimum and maximum log sequence numbers for a physical partition. For example, if a collection has a physical partition that has split, you now have 2 child physical partitions, and a change feed information request is made 2 times to get minimum and maximum log sequence numbers for each child physical partition. If those child physical partitions split, the number increases, and so on, and so forth. The more splits that occur, the more network hops to the Backend services to request change feed information to return minimum and maximum log sequence numbers, the higher the latency for making change feed request while in AllVersionsAndDeletes (preview) change feed mode.

Proposed solution

graph TD;
    A(get account endpoint)-->B;
    B(get collectionRid)-->C
    C(create containerResourceId)-->D

image

Branch

Cache

Key (ContainerResourceId + PartitionKeyRangeIds + Account) {"Value":"pwd(I8Xc*9+=","PartitionKeyRanges":[{"minInclusive":"00","maxExclusive":"MM","ridPrefix":null,"throughputFraction":0.0,"status":"Invalid","lsn":0,"parents":["0"],"id":"1","_rid":null,"_self":null,"_ts":0,"_etag":null},{"minInclusive":"MM","maxExclusive":"FF","ridPrefix":null,"throughputFraction":0.0,"status":"Invalid","lsn":0,"parents":["0"],"id":"2","_rid":null,"_self":null,"_ts":0,"_etag":null}],"AccountEndpoint":"http://testaccount.documents.azure.com"}

How to determine the Caching Key

Scenario1: Container A that has no split partitions.
Scenario 2: Container B that has split partitions.
Scenario 3: Container C that has forest with no split partitions.
Scenario 4: Container D that has forest with split partitions.

Container A and C will not build archival trees because there are no split partitions.
Container A and C will not have caching strategies because there are no archival trees.

Because containers B and D have split partitions, B and D will build archival trees.
Because containers B and D have archival trees, B and D will have caching strategies.

What is the difference between 2 containers that follow type Container B?

The unique identifier of the container.
The unique identifier of the account that the container belongs to.
Refresh if the number of partitions that have been split has changed, or the actual partitionKeyRangeIds
that exists for that container has changed.

NOTE:

The incomingPartitionKeyRangeId does not affect the uniqueness for this case
because they all share the same root parentPartitionKeyRangeId. And since they share the
same parentPartitionKeyRangeId, the archival tree would be the same for every incomingPartitionKeyRangeId.

Proposal:

The caching key should be composed of container identifier and account identifier with a refresh when partitionKeyRangeIds change.

What is the difference between 2 containers that follow type Container D?

The unique identifier of the container.
The unique identifier of the account that the container belongs to.
Refresh if the number of partitions that have been split has changed, or the actual partitionKeyRangeIds
that exists for that container has changed.

NOTE:

If the incomingPartitionKeyRangeIds share the same root parentPartitionKeyRangeId, then incomingPartitionKeyRangeId does not affect the uniqueness.
If the incomingPartitionKeyRangeIds do not share the same root parentPartitionKeyRangeId, then incomingPartitionKeyRangeId does affect incomingPartitionKeyRangeId does not affect the uniqueness.

Proposal

The caching key should be composed of container identifier, account identifier, and incomingPartitionKeyRangeId with a refresh when partitionKeyRangeIds change.

What is the difference between 2 containers where one container follows type Container B and the other follows the structure of Container D?

The unique identifier of the container.
The unique identifier of the account that the container belongs to.
Refresh if the number of partitions that have been split changed, or the actual partitionKeyRangeIds
that exists for that container changed.

NOTE:

If the incomingPartitionKeyRangeIds share the same root parentPartitionKeyRangeId,
then incomingPartitionKeyRangeId does not affect the uniqueness.
If the
incomingPartitionKeyRangeIds do not share the same root parentPartitionKeyRangeId, then incomingPartitionKeyRangeId does affect incomingPartitionKeyRangeId does not affect the uniqueness.

Proposal

The caching key should be composed of container identifier, account identifier, and incomingPartitionKeyRangeId with a refresh when partitionKeyRangeIds change.

Final proposal.

The caching key should be composed of container identifier, account identifier, and incomingPartitionKeyRangeId with a refresh when partitionKeyRangeIds change. This handles all container types.

Concerns.

The duplication of archival tree cached items. Every incomingPartitionKeyRangeId that share the same root parentPartitionKeyRangeId will have a cached item that is the same.

Value

{
    "ContainerResourceId": {
        "Value": "vBVWAK+-HQY="
    },
    "DateCreated": "2022-07-25T11:42:24.5483782Z",
    "DrainRoute": {
        "ParentToRoutePartitionItems": {
            "0": {
                "AdditionalContext": "Partition 0 uses (RouteToPartitionKeyRangeId: 2, UseArchivalPartition: True) and yields a MinLsn of 0 and a MaxLsn of 5000.",
                "CurrentPartitionKeyRangeId": {
                    "Value": 0
                },
                "MaxExclusive": "FF",
                "MaxLsn": 5000,
                "MinInclusive": "00",
                "MinLsn": 0,
                "RouteToPartitionKeyRangeId": {
                    "Value": 2
                },
                "UseArchivalPartition": true
            },
            "1": {
                "AdditionalContext": "Partition 1 uses (RouteToPartitionKeyRangeId: 4, UseArchivalPartition: True) and yields a MinLsn of 5001 and a MaxLsn of 7500.",
                "CurrentPartitionKeyRangeId": {
                    "Value": 1
                },
                "MaxExclusive": "MM",
                "MaxLsn": 7500,
                "MinInclusive": "00",
                "MinLsn": 5001,
                "RouteToPartitionKeyRangeId": {
                    "Value": 4
                },
                "UseArchivalPartition": true
            },
            "2": {
                "AdditionalContext": "Partition 2 uses (RouteToPartitionKeyRangeId: 2, UseArchivalPartition: False) and yields a MinLsn of 5001 and a MaxLsn of 10000.",
                "CurrentPartitionKeyRangeId": {
                    "Value": 2
                },
                "MaxExclusive": "FF",
                "MaxLsn": 10000,
                "MinInclusive": "MM",
                "MinLsn": 5001,
                "RouteToPartitionKeyRangeId": {
                    "Value": 2
                },
                "UseArchivalPartition": false
            },
            "3": {
                "AdditionalContext": "Partition 3 uses (RouteToPartitionKeyRangeId: 3, UseArchivalPartition: False) and yields a MinLsn of 7501 and a MaxLsn of 15000.",
                "CurrentPartitionKeyRangeId": {
                    "Value": 3
                },
                "MaxExclusive": "GG",
                "MaxLsn": 15000,
                "MinInclusive": "00",
                "MinLsn": 7501,
                "RouteToPartitionKeyRangeId": {
                    "Value": 3
                },
                "UseArchivalPartition": false
            },
            "4": {
                "AdditionalContext": "Partition 4 uses (RouteToPartitionKeyRangeId: 4, UseArchivalPartition: False) and yields a MinLsn of 7501 and a MaxLsn of 20000.",
                "CurrentPartitionKeyRangeId": {
                    "Value": 4
                },
                "MaxExclusive": "MM",
                "MaxLsn": 20000,
                "MinInclusive": "GG",
                "MinLsn": 7501,
                "RouteToPartitionKeyRangeId": {
                    "Value": 4
                },
                "UseArchivalPartition": false
            }
        },
        "SplitGraph": {
            "Root": {
                "PartitionKeyRangeId": 0,
                "MinInclusive": "00",
                "MaxExclusive": "FF",
                "Children": [
                    {
                        "PartitionKeyRangeId": 1,
                        "MinInclusive": "00",
                        "MaxExclusive": "MM",
                        "Children": [
                            {
                                "PartitionKeyRangeId": 3,
                                "MinInclusive": "00",
                                "MaxExclusive": "GG",
                                "Children": []
                            },
                            {
                                "PartitionKeyRangeId": 4,
                                "MinInclusive": "GG",
                                "MaxExclusive": "MM",
                                "Children": []
                            }
                        ]
                    },
                    {
                        "PartitionKeyRangeId": 2,
                        "MinInclusive": "MM",
                        "MaxExclusive": "FF",
                        "Children": []
                    }
                ]
            },
            "SplitLineages": [
                {
                    "Item1": [
                        0,
                        1
                    ],
                    "Item2": {
                        "Id": "1",
                        "Parents": [
                            "0"
                        ],
                        "MinInclusive": "00",
                        "MaxExclusive": "MM"
                    }
                },
                {
                    "Item1": [
                        0,
                        2
                    ],
                    "Item2": {
                        "Id": "2",
                        "Parents": [
                            "0"
                        ],
                        "MinInclusive": "MM",
                        "MaxExclusive": "FF"
                    }
                },
                {
                    "Item1": [
                        0,
                        1,
                        3
                    ],
                    "Item2": {
                        "Id": "3",
                        "Parents": [
                            "0",
                            "1"
                        ],
                        "MinInclusive": "00",
                        "MaxExclusive": "GG"
                    }
                },
                {
                    "Item1": [
                        0,
                        1,
                        4
                    ],
                    "Item2": {
                        "Id": "4",
                        "Parents": [
                            "0",
                            "1"
                        ],
                        "MinInclusive": "GG",
                        "MaxExclusive": "MM"
                    }
                }
            ]
        }
    },
    "IncomingPartitionKeyRangeId": {
        "Value": 1
    }
}

The collection's partition archival lineage is constructed by collection, or ContainerResourceId. The collection's partition archival lineage contains the DateCreated, the DrainRoute, and the IncomingPartitionKeyRangeId. There is question as to whether the IncomingPartitionKeyRangeId is necessary, so it may go away. If so, I will update this document accordingly, but at the time of document creation, it exists.

ContainerResourceId {"Value":"pwd(I8Xc*9+=","PartitionKeyRanges"}

DateCreated "DateCreated": "2022-07-25T11:42:24.5483782Z"

DrainRoute

"0": {
                "AdditionalContext": "Partition 0 uses (RouteToPartitionKeyRangeId: 2, UseArchivalPartition: True) and yields a MinLsn of 0 and a MaxLsn of 5000.",
                "CurrentPartitionKeyRangeId": {
                    "Value": 0
                },
                "MaxExclusive": "FF",
                "MaxLsn": 5000,
                "MinInclusive": "00",
                "MinLsn": 0,
                "RouteToPartitionKeyRangeId": {
                    "Value": 2
                },
                "UseArchivalPartition": true
            },
            "1": {
                "AdditionalContext": "Partition 1 uses (RouteToPartitionKeyRangeId: 4, UseArchivalPartition: True) and yields a MinLsn of 5001 and a MaxLsn of 7500.",
                "CurrentPartitionKeyRangeId": {
                    "Value": 1
                },
                "MaxExclusive": "MM",
                "MaxLsn": 7500,
                "MinInclusive": "00",
                "MinLsn": 5001,
                "RouteToPartitionKeyRangeId": {
                    "Value": 4
                },
                "UseArchivalPartition": true
            },
            "2": {
                "AdditionalContext": "Partition 2 uses (RouteToPartitionKeyRangeId: 2, UseArchivalPartition: False) and yields a MinLsn of 5001 and a MaxLsn of 10000.",
                "CurrentPartitionKeyRangeId": {
                    "Value": 2
                },
                "MaxExclusive": "FF",
                "MaxLsn": 10000,
                "MinInclusive": "MM",
                "MinLsn": 5001,
                "RouteToPartitionKeyRangeId": {
                    "Value": 2
                },
                "UseArchivalPartition": false
            },
            "3": {
                "AdditionalContext": "Partition 3 uses (RouteToPartitionKeyRangeId: 3, UseArchivalPartition: False) and yields a MinLsn of 7501 and a MaxLsn of 15000.",
                "CurrentPartitionKeyRangeId": {
                    "Value": 3
                },
                "MaxExclusive": "GG",
                "MaxLsn": 15000,
                "MinInclusive": "00",
                "MinLsn": 7501,
                "RouteToPartitionKeyRangeId": {
                    "Value": 3
                },
                "UseArchivalPartition": false
            },
            "4": {
                "AdditionalContext": "Partition 4 uses (RouteToPartitionKeyRangeId: 4, UseArchivalPartition: False) and yields a MinLsn of 7501 and a MaxLsn of 20000.",
                "CurrentPartitionKeyRangeId": {
                    "Value": 4
                },
                "MaxExclusive": "MM",
                "MaxLsn": 20000,
                "MinInclusive": "GG",
                "MinLsn": 7501,
                "RouteToPartitionKeyRangeId": {
                    "Value": 4
                },
                "UseArchivalPartition": false
            }

Performance

Security

Areas of impact

Estimation for deliverables

Supportability

Client telemetry
Distributed tracing

TBD

Diagnostic logging

Testing

Use case/scenarios

Concerns



Why would this flag be set? Because the collection might have been recreated (deleted and created again with the same name), the partitions might be different.

If this is the case, a request to refresh Partitions might fail with a 404 because the RID that we have does not exist anymore.

This is a tricky situation because there are 2 scenarios:

Request has no CollectionRID header: We use this collectionCache and the PKRange cache, if we need to refresh, then refreshing the collectionCache will give us the updated RID and refreshing the PKRange cache the new partitions.
If the request has the CollectionRID header: We do not use the collectionCache. Refreshing the PKRange cache might still result in a 404 for this scenario, so what to do? SDKs have a retry policy that handles this scenario I believe (https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RenameCollectionAwareClientRetryPolicy.java#L59-L61  and https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/Microsoft.Azure.Cosmos/src/RenameCollectionAwareClientRetryPolicy.cs#L85-L86 ) and it seems that if we return 404/1002 the client will refresh its own RID cache and retry.```
philipthomas-MSFT commented 1 year ago

Gopal and Hemeswari need to review.

philipthomas-MSFT commented 1 year ago

Breaking up PR

kirankumarkolli commented 1 year ago

Product/SDK/.net/Microsoft.Azure.Cosmos.Friends/FFCF/ FullFidelityChangeFeedHandler

Life of cache begins at the handler. Introducing AsyncCache. Is it the end-state for caching?

kirankumarkolli commented 1 year ago

From offline sync-up

image
ealsur commented 1 year ago

Food for thought

PartitionKeyRange response contains a Parent property but within that, there is no hierarchy.

For example: Let's say that PKRange 0, split to 1,2, then 2 split in to 3,4 and 1 split into 5 and 6.

If I send a request to 0 (because the client was old) and I get a 410 because it's gone and then I obtain the PartitionKeyRanges (TryOverlappingRanges) I would get PartitionKeyRange 3,4,5,6 where the Parents would be "0,2" for 3 and 4 and "0,1" for 5 and 6.

So after the 410, I would then attempt to route to, for example, 3 from the client.

If the cache has:

I just wondered:

If the answer is that we would not support this grandfathering, that's also ok, just knowing that this is a known gap.

philipthomas-MSFT commented 1 year ago

Product/SDK/.net/Microsoft.Azure.Cosmos.Friends/FFCF/ FullFidelityChangeFeedHandler

Life of cache begins at the handler. Introducing AsyncCache. Is it the end-state for caching?

cc @kirankumarkolli The handler's static method caller is RawRequestExtensions which is a static type. All requests start there, so the same new AsyncCache instance will be used for all requests. I will have tests to support this.

philipthomas-MSFT commented 1 year ago

Food for thought

PartitionKeyRange response contains a Parent property but within that, there is no hierarchy.

For example: Let's say that PKRange 0, split to 1,2, then 2 split in to 3,4 and 1 split into 5 and 6.

If I send a request to 0 (because the client was old) and I get a 410 because it's gone and then I obtain the PartitionKeyRanges (TryOverlappingRanges) I would get PartitionKeyRange 3,4,5,6 where the Parents would be "0,2" for 3 and 4 and "0,1" for 5 and 6.

So after the 410, I would then attempt to route to, for example, 3 from the client.

If the cache has:

  • "0" Routes to "2" (some min/max LSN)
  • "0" Routes to "1" (some other min/max LSN)
  • "1" Routes to "5" (some min/max LSN)
  • "1" Routes to "6" (some other min/max LSN)
  • "2" Routes to "3" (some min/max LSN)
  • "2" Routes to "4" (some other min/max LSN)
  • "3" Routes to "3" (some min/max LSN)
  • "4" Routes to "4" (some min/max LSN)
  • "5" Routes to "5" (some min/max LSN)
  • "6" Routes to "6" (some min/max LSN)

I just wondered:

  • When I come with Incoming 3 with an LSN from before the split, does the algorithm need to be recursive in the sense that, it needs to support going back several levels because maybe some of the partitions it would go to for the Archival also had a split?
  • If the cache is empty, how would it be constructed after the split (only 3,4,5,6 are live, how does it know that 2 and 1 were before and that 0 was before that if the Parents property only contains a list with no hierarchy level).

If the answer is that we would not support this grandfathering, that's also ok, just knowing that this is a known gap.

cc @ealsur So I just want to correct something first. This is the correct route for your example. I did a ~strikethrough~ for the invalid ones.

"When I come with Incoming 3 with an LSN from before the split, does the algorithm need to be recursive in the sense that, it needs to support going back several levels because maybe some of the partitions it would go to for the Archival also had a split?"

Irrespective of IncomingPartitionKeyRangeId, the IncomingLSN always tries to find the correct partition's min/max LSN. So just because the request's IncomingPartitionKeyRangeId is 3, the IncomingLSN determines where it actually routes to. If the IncomingLSN belongs to the IncomingPartitionKeyRangeId , that is just a normal passthrough.

"If the cache is empty, how would it be constructed after the split (only 3,4,5,6 are live, how does it know that 2 and 1 were before and that 0 was before that if the Parents property only contains a list with no hierarchy level)."

If you look at the tree example in this document, "UseArchivalPartition": true indicates that it was split and is a parent to some child, which is or "UseArchivalPartition": false

gopalrander commented 1 year ago

@philipthomas-MSFT DrainRoute in the cache is invalidated when Compute detects that there is a partition splits, correct? This needs to be mentioned in Cache eviction policy.

gopalrander commented 1 year ago

From what I understand, a valid cache at a moment, would be usable by all pk range ids. So why we we need a Pk rangeid in the cache key? Please clarify if I am missing any detail here. Thanks.

philipthomas-MSFT commented 1 year ago

@philipthomas-MSFT DrainRoute in the cache is invalidated when Compute detects that there is a partition splits, correct? This needs to be mentioned in Cache eviction policy.

There is no eviction policy. The archival tree cache gets created if it needs to be built and lives in-memory until Compute Gateway discards it, like a reboot or something.

philipthomas-MSFT commented 1 year ago

From what I understand, a valid cache at a moment, would be usable by all pk range ids. So why we we need a Pk rangeid in the cache key? Please clarify if I am missing any detail here. Thanks.

cc @gopalrander For now, every incoming partition key range id and container resource id will have a cached item. I am playing around with the possibility of just caching the drain route information for the incoming partition key range at some point.

gopalrander commented 1 year ago

@philipthomas-MSFT DrainRoute in the cache is invalidated when Compute detects that there is a partition splits, correct? This needs to be mentioned in Cache eviction policy.

There is no eviction policy. The archival tree cache gets created if it needs to be built and lives in-memory until Compute Gateway discards it, like a reboot or something.

I meant that the cache might be invalid after a split. In the above cache example, if partition 4 splits, then the whole routing map needs to be updated.. right?

philipthomas-MSFT commented 1 year ago

@philipthomas-MSFT DrainRoute in the cache is invalidated when Compute detects that there is a partition splits, correct? This needs to be mentioned in Cache eviction policy.

There is no eviction policy. The archival tree cache gets created if it needs to be built and lives in-memory until Compute Gateway discards it, like a reboot or something.

I meant that the cache might be invalid after a split. In the above cache example, if partition 4 splits, then the whole routing map needs to be updated.. right?

@gopalrander , I think I see your point and probably why I wanted to use all the PartitionKeyRangeIds for a collection instead of just the IncomingPartitionKeyRangeId as a key along with the ContainerResourceId. I brought this up in our walkthrough but I could not remember this case.

ContainerResourceId: 5dWn+9OCNxn= IncomingPartitionKeyRangeId: 2

CacheKey: 5dWn+9OCNxn=2 CacheValue: 0 -> 1, 2

If PartitionKeyRangeId 2 splits at some later time,

ContainerResourceId: 5dWn+9OCNxn= IncomingPartitionKeyRangeId: 2

Compute Gateway would unfortunately continue to use the original cache item,

CacheKey: 5dWn+9OCNxn=2 CacheValue: 0 -> 1, 2

But it should create a new cache item instead,

CacheKey: Unknown CacheValue: 0 -> 1, (2 -> 3, 4)

So that means that either the original cache item gets invalidated,

CacheKey: 5dWn+9OCNxn=2 CacheValue: 0 -> 1, (2 -> 3, 4)

or because we want the cached items to be immutable, we will create a new cached item which would work if we use all the Collection's PartitionKeyRangeIds along with the ContainerResourceId as the CacheKey.

CacheKey: 5dWn+9OCNxn=1234 CacheValue: 0 -> 1, (2 -> 3, 4)

I am going to make this change over the weekend, but @kirankumarkolli , @ealsur , @jcocchi , you all where in the meeting and I believe there were some concerns about the size of the CacheKey, but it seems like I may have to go back to this idea.

philipthomas-MSFT commented 1 year ago

Also, the PR is here.

https://msdata.visualstudio.com/DefaultCollection/CosmosDB/_git/CosmosDB/pullrequest/1088615

I did note that I need to fix the CacheKey this weekend in the PR's description. See previous comment about why.

cc @gopalrander @kirankumarkolli @ealsur @jcocchi