philipthomas-MSFT commented 1 year ago

Purpose statement

This is a document to enhance the Cosmos DB experience by achieving even higher performance.

Description: The plan to improve latency and overall performance for Change feed in Azure Cosmos DB Pull model requests while in AllVersionsAndDeletes (preview) change feed mode by introducing a caching strategy that is local to Compute Gateway for a collection's physical partition archival lineage. The collection's physical partition archival lineage is a routing map that instructs Compute Gateway on how to drain documents when a change feed request is received. It is driven by the physical partition's minimum and maximum log sequence numbers which is a change feed information request to the Backend API. This will support all SDK languages, specifically .NET and Java SDKs. There would be tenant configuration feature flags and additional diagnostic logging that would need to be implemented as well. This issue will be split into multiple PRs, caching, feature flag, and logging). Tenant configuration for feature flag is a fail-safe for if the caching strategy is not working as expected.

Level-setting

What is change feed?
- https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed
- Pull model?
- Read changes from a particular partition key
- Control the pace at which your client receives changes for processing
- Perform a one-time read of the existing data in the change feed (for example, to do a data migration)
- https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/change-feed-pull-model?tabs=dotnet
What is AllVersionsAndDeletes?
- All versions and deletes mode (preview) is a persistent record of all changes to items from create, update, and delete operations. You get a record of each change to items in the order that it occurred, including intermediate changes to an item between change feed reads.
- https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/change-feed-modes?tabs=latest-version#all-versions-and-deletes-change-feed-mode-preview
What is physical partition split handling?
- When a physical partition is split, the original, now archived parent partition does not exist and will yield a Gone (410) http status code when any change feed requests are attempted by the caller, or SDK client. All change feed requests must be properly routed to the appropriate live child partition in order to receive documents. Additional logic is required in order to request AllVersionsAndDeletes (preview) meta data along with the documents.
- https://learn.microsoft.com/en-us/azure/cosmos-db/partitioning-overview
What is log sequence number, or LSN?
- The last committed transaction number per partition.
- Minimum and maximum
- On change feed request, a log sequence number is provided. Every physical partition has a minimum and maximum log sequence number. The requested log sequence number will be sent to the physical partition that belongs fits within the range of it's minimum and maximum log sequence number.
- What is collection's partition archival lineage?
- The drain route map that instructs the Compute Gateway where to route change feed requests while in AllVersionsAndDeletes (preview) change feed mode for the purpose of draining all documents in a partition.
Why local, and not a globally distributed cache?
- For the sake of simplicity. It may be advantageous to look into a globally distributed caching solution in the future. However, we are looking to focus on making the smallest incremental changes with the greatest over impact to optimal performance.
Tenant configurations
- A Microsoft.Portal/tenantConfigurations resource.
- https://learn.microsoft.com/en-us/azure/templates/microsoft.portal/tenantconfigurations?pivots=deployment-language-arm-template
  Tasks
[x] Identifying stakeholders
[x] Understanding source current architecture
[x] Detailing the target future architecture and gaps
[x] Describing the testing strategy
[x] Identifying performance and security concerns
[x] Defining out of scope and future states
[ ] Architecture katas
[x] Outlining supportability, manageability, and configuration
[x] Providing visual aids (diagrams, flow charts, C4, etc.)
[x] Estimating time to complete PR
[x] Managing and configuring the feature flag

Stakeholders

Azure Cosmos DB .NET SDK for NoSQL
- Philip Thomas
Compute Gateway
- Philip Thomas
- Dmitri Melnikov
Backend
- Gopal Rander
- Michael Koltachev
Materialized View
- Hemeswari Varada
- Sarwesh Krishnan
- Abhijit P. Pai
Other collaborations
- Kiran Kumar Koli
- Justin Cocchi

Resources

Out of scope

Change feed push model
Incremental (LatestVersion) change feed mode.
Global, or distributed caching.
Time-based, or state-based invalidation.
Removing MAX LSN model was addressed in a previous PR.
There are no public facing contracts that need to be updated, deleted or added with one caveat. There is a possibility that force refresh cache and feature flag could be driven by logic within the SDK client, but more discussions need to be made first. My first reaction would be no, because this is a major performance boost, there is no need to have consumers opt-out.

Scope of work

The Microsoft Azure Cosmos DB .NET SDK Version 3 needs to achieve optimal performance by implementing a local caching strategy in Compute Gateway for all change feed request while in AllVersionsAndDeletes (preview) change feed mode. So, introducing a caching strategy with additional trace logging and a feature flag for the collection's physical partition archival lineage will unequivocally improve performance by accessing a cache that is local to the Compute Gateway.

Criteria for caching

When a collection's physical partition has split. The logic to construct the collection's physical partition archival lineage is solely determined by whether that physical partition has returned an HTTP status code Gone, or 410 when the change feed has requested items.

The IsPassthrough state is false.
The cached item does not have a key the represents the current state of the collection (parent/child partition relationship).

Current baseline architecture

Too many unnecessary Backend request that affect latency and overall performance

Currently, we do not support any caching strategy, and all change feed request while in AllVersionsAndDeletes (preview) change feed mode will request additional minimum and maximum log sequence numbers for every partition that exists within a collection's partition archival lineage. If change feed requests are being sent to the same collection to exhaust change feed items, then a cache, local to Compute Gateway, of that collection's physical partition archival lineage should be fetched and used for determining the physical partition routing strategy for draining documents on a live physical partition. Currently, the collection's physical partition archival lineage is being constructed and traversed for every change feed request while in AllVersionsAndDeletes (preview) change feed mode. The construction of the collection's physical partition archival lineage increases latency due to its need to make additional network hops to the Backend services to make change feed information request that include minimum and maximum log sequence numbers for a physical partition. For example, if a collection has a physical partition that has split, you now have 2 child physical partitions, and a change feed information request is made 2 times to get minimum and maximum log sequence numbers for each child physical partition. If those child physical partitions split, the number increases, and so on, and so forth. The more splits that occur, the more network hops to the Backend services to request change feed information to return minimum and maximum log sequence numbers, the higher the latency for making change feed request while in AllVersionsAndDeletes (preview) change feed mode.

Proposed solution

Introduce a caching strategy local to the Compute Gateway that includes Feature Flag, EnableFullFidelityChangeFeedSplitHandlingArchivalTreeCaching. This is a tenant level feature flag.
- Although the feature flag is a tenant level configuration, it is not quite determined whether this should be public for consumers via the SDK client, or if there is logic that should exist to trigger this. I will open up dialog to discuss with team. This would require an additional PR for the SDK along with a release. My first reaction would be no, because this is a major performance boost, there is no need to have consumers opt-out.
Invalidation, or eviction policy, based on instance life of the Compute Gateway and optional Force Cache Refresh.
Force Cache Refresh
- Other than having this optional for testability purposes, it is not quite determined whether this should be public for consumers via the SDK client, or if there is logic that should exist to trigger this. I will open up dialog to discuss with team. This would require an additional PR for the SDK along with a release. My first reaction would be no, because this is a major performance boost, there is no need to have consumers opt-out.

graph TD;
    A(get account endpoint)-->B;
    B(get collectionRid)-->C
    C(create containerResourceId)-->D

Branch

msdata
- Caching
- users/philipthomas/allversionsanddeletes_collection_partition_archival_lineage_caching
- Feature Flag
- users/philipthomas/allversionsanddeletes_collection_partition_archival_lineage_caching_feature_flag
- Trace Logging
- users/philipthomas/allversionsanddeletes_collection_partition_archival_lineage_caching_trace_logging

Cache

Key (ContainerResourceId + PartitionKeyRangeIds + Account) {"Value":"pwd(I8Xc*9+=","PartitionKeyRanges":[{"minInclusive":"00","maxExclusive":"MM","ridPrefix":null,"throughputFraction":0.0,"status":"Invalid","lsn":0,"parents":["0"],"id":"1","_rid":null,"_self":null,"_ts":0,"_etag":null},{"minInclusive":"MM","maxExclusive":"FF","ridPrefix":null,"throughputFraction":0.0,"status":"Invalid","lsn":0,"parents":["0"],"id":"2","_rid":null,"_self":null,"_ts":0,"_etag":null}],"AccountEndpoint":"http://testaccount.documents.azure.com"}

How to determine the Caching Key

Scenario1: Container A that has no split partitions.
Scenario 2: Container B that has split partitions.
Scenario 3: Container C that has forest with no split partitions.
Scenario 4: Container D that has forest with split partitions.

Container A and C will not build archival trees because there are no split partitions.
Container A and C will not have caching strategies because there are no archival trees.

Because containers B and D have split partitions, B and D will build archival trees.
Because containers B and D have archival trees, B and D will have caching strategies.

What is the difference between 2 containers that follow type Container B?

The unique identifier of the container.
The unique identifier of the account that the container belongs to.
Refresh if the number of partitions that have been split has changed, or the actual partitionKeyRangeIds
that exists for that container has changed.

NOTE:

The incomingPartitionKeyRangeId does not affect the uniqueness for this case
because they all share the same root parentPartitionKeyRangeId. And since they share the
same parentPartitionKeyRangeId, the archival tree would be the same for every incomingPartitionKeyRangeId.

Proposal:

The caching key should be composed of container identifier and account identifier with a refresh when partitionKeyRangeIds change.

What is the difference between 2 containers that follow type Container D?

The unique identifier of the container.
The unique identifier of the account that the container belongs to.
Refresh if the number of partitions that have been split has changed, or the actual partitionKeyRangeIds
that exists for that container has changed.

NOTE:

If the incomingPartitionKeyRangeIds share the same root parentPartitionKeyRangeId, then incomingPartitionKeyRangeId does not affect the uniqueness.
If the incomingPartitionKeyRangeIds do not share the same root parentPartitionKeyRangeId, then incomingPartitionKeyRangeId does affect incomingPartitionKeyRangeId does not affect the uniqueness.

Proposal

The caching key should be composed of container identifier, account identifier, and incomingPartitionKeyRangeId with a refresh when partitionKeyRangeIds change.

What is the difference between 2 containers where one container follows type Container B and the other follows the structure of Container D?

The unique identifier of the container.
The unique identifier of the account that the container belongs to.
Refresh if the number of partitions that have been split changed, or the actual partitionKeyRangeIds
that exists for that container changed.

NOTE:

If the incomingPartitionKeyRangeIds share the same root parentPartitionKeyRangeId,
then incomingPartitionKeyRangeId does not affect the uniqueness.
If the
incomingPartitionKeyRangeIds do not share the same root parentPartitionKeyRangeId, then incomingPartitionKeyRangeId does affect incomingPartitionKeyRangeId does not affect the uniqueness.

Proposal

The caching key should be composed of container identifier, account identifier, and incomingPartitionKeyRangeId with a refresh when partitionKeyRangeIds change.

Final proposal.

The caching key should be composed of container identifier, account identifier, and incomingPartitionKeyRangeId with a refresh when partitionKeyRangeIds change. This handles all container types.

Concerns.

The duplication of archival tree cached items. Every incomingPartitionKeyRangeId that share the same root parentPartitionKeyRangeId will have a cached item that is the same.

Value

{
    "ContainerResourceId": {
        "Value": "vBVWAK+-HQY="
    },
    "DateCreated": "2022-07-25T11:42:24.5483782Z",
    "DrainRoute": {
        "ParentToRoutePartitionItems": {
            "0": {
                "AdditionalContext": "Partition 0 uses (RouteToPartitionKeyRangeId: 2, UseArchivalPartition: True) and yields a MinLsn of 0 and a MaxLsn of 5000.",
                "CurrentPartitionKeyRangeId": {
                    "Value": 0
                },
                "MaxExclusive": "FF",
                "MaxLsn": 5000,
                "MinInclusive": "00",
                "MinLsn": 0,
                "RouteToPartitionKeyRangeId": {
                    "Value": 2
                },
                "UseArchivalPartition": true
            },
            "1": {
                "AdditionalContext": "Partition 1 uses (RouteToPartitionKeyRangeId: 4, UseArchivalPartition: True) and yields a MinLsn of 5001 and a MaxLsn of 7500.",
                "CurrentPartitionKeyRangeId": {
                    "Value": 1
                },
                "MaxExclusive": "MM",
                "MaxLsn": 7500,
                "MinInclusive": "00",
                "MinLsn": 5001,
                "RouteToPartitionKeyRangeId": {
                    "Value": 4
                },
                "UseArchivalPartition": true
            },
            "2": {
                "AdditionalContext": "Partition 2 uses (RouteToPartitionKeyRangeId: 2, UseArchivalPartition: False) and yields a MinLsn of 5001 and a MaxLsn of 10000.",
                "CurrentPartitionKeyRangeId": {
                    "Value": 2
                },
                "MaxExclusive": "FF",
                "MaxLsn": 10000,
                "MinInclusive": "MM",
                "MinLsn": 5001,
                "RouteToPartitionKeyRangeId": {
                    "Value": 2
                },
                "UseArchivalPartition": false
            },
            "3": {
                "AdditionalContext": "Partition 3 uses (RouteToPartitionKeyRangeId: 3, UseArchivalPartition: False) and yields a MinLsn of 7501 and a MaxLsn of 15000.",
                "CurrentPartitionKeyRangeId": {
                    "Value": 3
                },
                "MaxExclusive": "GG",
                "MaxLsn": 15000,
                "MinInclusive": "00",
                "MinLsn": 7501,
                "RouteToPartitionKeyRangeId": {
                    "Value": 3
                },
                "UseArchivalPartition": false
            },
            "4": {
                "AdditionalContext": "Partition 4 uses (RouteToPartitionKeyRangeId: 4, UseArchivalPartition: False) and yields a MinLsn of 7501 and a MaxLsn of 20000.",
                "CurrentPartitionKeyRangeId": {
                    "Value": 4
                },
                "MaxExclusive": "MM",
                "MaxLsn": 20000,
                "MinInclusive": "GG",
                "MinLsn": 7501,
                "RouteToPartitionKeyRangeId": {
                    "Value": 4
                },
                "UseArchivalPartition": false
            }
        },
        "SplitGraph": {
            "Root": {
                "PartitionKeyRangeId": 0,
                "MinInclusive": "00",
                "MaxExclusive": "FF",
                "Children": [
                    {
                        "PartitionKeyRangeId": 1,
                        "MinInclusive": "00",
                        "MaxExclusive": "MM",
                        "Children": [
                            {
                                "PartitionKeyRangeId": 3,
                                "MinInclusive": "00",
                                "MaxExclusive": "GG",
                                "Children": []
                            },
                            {
                                "PartitionKeyRangeId": 4,
                                "MinInclusive": "GG",
                                "MaxExclusive": "MM",
                                "Children": []
                            }
                        ]
                    },
                    {
                        "PartitionKeyRangeId": 2,
                        "MinInclusive": "MM",
                        "MaxExclusive": "FF",
                        "Children": []
                    }
                ]
            },
            "SplitLineages": [
                {
                    "Item1": [
                        0,
                        1
                    ],
                    "Item2": {
                        "Id": "1",
                        "Parents": [
                            "0"
                        ],
                        "MinInclusive": "00",
                        "MaxExclusive": "MM"
                    }
                },
                {
                    "Item1": [
                        0,
                        2
                    ],
                    "Item2": {
                        "Id": "2",
                        "Parents": [
                            "0"
                        ],
                        "MinInclusive": "MM",
                        "MaxExclusive": "FF"
                    }
                },
                {
                    "Item1": [
                        0,
                        1,
                        3
                    ],
                    "Item2": {
                        "Id": "3",
                        "Parents": [
                            "0",
                            "1"
                        ],
                        "MinInclusive": "00",
                        "MaxExclusive": "GG"
                    }
                },
                {
                    "Item1": [
                        0,
                        1,
                        4
                    ],
                    "Item2": {
                        "Id": "4",
                        "Parents": [
                            "0",
                            "1"
                        ],
                        "MinInclusive": "GG",
                        "MaxExclusive": "MM"
                    }
                }
            ]
        }
    },
    "IncomingPartitionKeyRangeId": {
        "Value": 1
    }
}

The collection's partition archival lineage is constructed by collection, or ContainerResourceId. The collection's partition archival lineage contains the DateCreated, the DrainRoute, and the IncomingPartitionKeyRangeId. There is question as to whether the IncomingPartitionKeyRangeId is necessary, so it may go away. If so, I will update this document accordingly, but at the time of document creation, it exists.

ContainerResourceId {"Value":"pwd(I8Xc*9+=","PartitionKeyRanges"}

DateCreated "DateCreated": "2022-07-25T11:42:24.5483782Z"

DrainRoute

"0": {
                "AdditionalContext": "Partition 0 uses (RouteToPartitionKeyRangeId: 2, UseArchivalPartition: True) and yields a MinLsn of 0 and a MaxLsn of 5000.",
                "CurrentPartitionKeyRangeId": {
                    "Value": 0
                },
                "MaxExclusive": "FF",
                "MaxLsn": 5000,
                "MinInclusive": "00",
                "MinLsn": 0,
                "RouteToPartitionKeyRangeId": {
                    "Value": 2
                },
                "UseArchivalPartition": true
            },
            "1": {
                "AdditionalContext": "Partition 1 uses (RouteToPartitionKeyRangeId: 4, UseArchivalPartition: True) and yields a MinLsn of 5001 and a MaxLsn of 7500.",
                "CurrentPartitionKeyRangeId": {
                    "Value": 1
                },
                "MaxExclusive": "MM",
                "MaxLsn": 7500,
                "MinInclusive": "00",
                "MinLsn": 5001,
                "RouteToPartitionKeyRangeId": {
                    "Value": 4
                },
                "UseArchivalPartition": true
            },
            "2": {
                "AdditionalContext": "Partition 2 uses (RouteToPartitionKeyRangeId: 2, UseArchivalPartition: False) and yields a MinLsn of 5001 and a MaxLsn of 10000.",
                "CurrentPartitionKeyRangeId": {
                    "Value": 2
                },
                "MaxExclusive": "FF",
                "MaxLsn": 10000,
                "MinInclusive": "MM",
                "MinLsn": 5001,
                "RouteToPartitionKeyRangeId": {
                    "Value": 2
                },
                "UseArchivalPartition": false
            },
            "3": {
                "AdditionalContext": "Partition 3 uses (RouteToPartitionKeyRangeId: 3, UseArchivalPartition: False) and yields a MinLsn of 7501 and a MaxLsn of 15000.",
                "CurrentPartitionKeyRangeId": {
                    "Value": 3
                },
                "MaxExclusive": "GG",
                "MaxLsn": 15000,
                "MinInclusive": "00",
                "MinLsn": 7501,
                "RouteToPartitionKeyRangeId": {
                    "Value": 3
                },
                "UseArchivalPartition": false
            },
            "4": {
                "AdditionalContext": "Partition 4 uses (RouteToPartitionKeyRangeId: 4, UseArchivalPartition: False) and yields a MinLsn of 7501 and a MaxLsn of 20000.",
                "CurrentPartitionKeyRangeId": {
                    "Value": 4
                },
                "MaxExclusive": "MM",
                "MaxLsn": 20000,
                "MinInclusive": "GG",
                "MinLsn": 7501,
                "RouteToPartitionKeyRangeId": {
                    "Value": 4
                },
                "UseArchivalPartition": false
            }

Performance

Latency
- Latency is decreased due to accessing the Compute Gateway's local cache to retrieve the collection's partition archival lineage instead of making a network request to the Backend services to obtain each partition's minimum and maximum log sequence numbers.

Security

No security concerns at the time of document creation.

Areas of impact

[x] PR#1
- https://msdata.visualstudio.com/DefaultCollection/CosmosDB/_git/CosmosDB/pullrequest/1088615
- Introducing Local Compute Gateway Caching
- Product/SDK/.net/Microsoft.Azure.Cosmos.Friends/FFCF/ArchivalTree
  - There may be some methods that are no longer relevant that might need cleaning up or refactored for efficiency, but nothing directly related to caching.
- Product/SDK/.net/Microsoft.Azure.Cosmos.Friends/FFCF/ FullFidelityChangeFeedHandler
  - Life of cache begins at the handler. Introducing AsyncCache.
- New types
  - Product/SDK/.net/Microsoft.Azure.Cosmos.Friends/FFCF/Cache/AsyncArchivalTreeCache
- Introducing Feature Flag, EnableFullFidelityChangeFeedSplitHandlingArchivalTreeCaching.
- Product/Cosmos/Compute/Configuration/TenantConfiguration
- Product/Cosmos/Compute/Configuration/TenantConfigurationKeys
- Product/Cosmos/Sql/Service/SqlApiOperationHandler
- Product/Cosmos/Sql/Service/SqlConfiguration
- Product/Cosmos/Sql/Service/SqlxQueryOperationHandler
- Product/SDK/.net/Microsoft.Azure.Cosmos.Friends/RawRequestFeatures
- If we expose this to the SDK client, then changes would have to be made to DocumentServiceRequest
[ ] PR#2
- Introducing Caching Trace Logging
- Product/SDK/.net/Microsoft.Azure.Cosmos.Friends/FFCF/FullFidelityChangeFeedHelper
  - Proposing to add new trace child for building the archival tree when it is not cached labeled, FullFidelityChangeFeedHandler BuildArchivalTreeAsync.
- Product/SDK/.net/Microsoft.Azure.Cosmos.Friends/FFCF/ArchivalTree
  - AddDatum("Creating archival tree,", archivalTree) already exists and will belong to the new trace child.
- Product/SDK/.net/Microsoft.Azure.Cosmos.Friends/FFCF/FullFidelityChangeFeedHandler
  - Proposal to add new trace child for setting or getting the archival tree from cache labeled, FullFidelityChangeFeedHandler ArchivalTree Cached Item Trace.
[ ] PR#3 Removing SplitGraph and SplitLineage from ArchivalTree so that it is not cached.
- There is a significant number of changes required on both the code as well as the tests, so I am creating a PR just to deal with that.
[ ] PR#4 Hierarchical Caching
- Section explanation coming soon.
- ArchivalTree refactoring out IncomingPartitionKeyRangeId. @kirankumarkolli suggested this.
- BuildArchivalTree and CreateDrainRoute make async, but why? @kirankumarkolli suggested this.
[ ] Benchmarking Compute Gateway, specifically Friends.

Estimation for deliverables

1-2 days coding per PR
1-2 days testing per PR
5+ days awaiting per PR approval

Supportability

Client telemetry

https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/docs/observability.md

Distributed tracing

TBD

Diagnostic logging

If EnableFullFidelityChangeFeedSplitHandlingArchivalTreeCaching is enabled, then include the following:
- EnableFullFidelityChangeFeedSplitHandlingArchivalTreeCaching EnableFullFidelityChangeFeedSplitHandlingArchivalTreeCaching: true
- The collection's partition archival lineage to the diagnostic logs.
- Please refer to Proposed solution, Cache section for a sample of the collection's partition archival lineage.
Observations. The time spent in Compute Gateway should be significantly shorter, especially if the collection has more splits.

Testing

Use case/scenarios

Test behavior of feature flag when disabled.
Test behavior with feature flag when enabled, force refresh is disabled, and the collection's partition archival lineage is not cached.
Test behavior with feature flag when enabled, force refresh is disabled, and the collection's partition archival lineage is cached.
Test behavior with feature flag when enabled, force refresh is enabled, and the collection's partition archival lineage is not cached.
Test behavior with feature flag when enabled, force refresh is enabled, and the collection's partition archival lineage is cached.
Test behavior that feature flag and the collection's partition archival lineage is included in the diagnostic logs when the feature flag is enabled.
Unit (Gated pipeline)
Compute Gateway only, not SDK. However, if forced cache refresh is exposed to the SDK, could open up potential to increase test coverage.
Emulator (Gated pipeline)
Internal emulator that runs with Compute Gateway pipeline, not public emulator with SDK.
Performance/Benchmarking (Gated pipeline)
We need to get a baseline of how change feed requests perform while in AllVersionsAndDeletes (preview) change feed mode.
Security/Penetration (Gated pipeline)
Not required.

Concerns



Why would this flag be set? Because the collection might have been recreated (deleted and created again with the same name), the partitions might be different.

If this is the case, a request to refresh Partitions might fail with a 404 because the RID that we have does not exist anymore.

This is a tricky situation because there are 2 scenarios:

Request has no CollectionRID header: We use this collectionCache and the PKRange cache, if we need to refresh, then refreshing the collectionCache will give us the updated RID and refreshing the PKRange cache the new partitions.
If the request has the CollectionRID header: We do not use the collectionCache. Refreshing the PKRange cache might still result in a 404 for this scenario, so what to do? SDKs have a retry policy that handles this scenario I believe (https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RenameCollectionAwareClientRetryPolicy.java#L59-L61  and https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/Microsoft.Azure.Cosmos/src/RenameCollectionAwareClientRetryPolicy.cs#L85-L86 ) and it seems that if we return 404/1002 the client will refresh its own RID cache and retry.```

philipthomas-MSFT commented 1 year ago

Gopal and Hemeswari need to review.

philipthomas-MSFT commented 1 year ago

Breaking up PR

kirankumarkolli commented 1 year ago

Product/SDK/.net/Microsoft.Azure.Cosmos.Friends/FFCF/ FullFidelityChangeFeedHandler

Life of cache begins at the handler. Introducing AsyncCache. Is it the end-state for caching?

kirankumarkolli commented 1 year ago

From offline sync-up

ealsur commented 1 year ago

Food for thought

PartitionKeyRange response contains a Parent property but within that, there is no hierarchy.

For example: Let's say that PKRange 0, split to 1,2, then 2 split in to 3,4 and 1 split into 5 and 6.

If I send a request to 0 (because the client was old) and I get a 410 because it's gone and then I obtain the PartitionKeyRanges (TryOverlappingRanges) I would get PartitionKeyRange 3,4,5,6 where the Parents would be "0,2" for 3 and 4 and "0,1" for 5 and 6.

So after the 410, I would then attempt to route to, for example, 3 from the client.

If the cache has:

"0" Routes to "2" (some min/max LSN)
"0" Routes to "1" (some other min/max LSN)
"1" Routes to "5" (some min/max LSN)
"1" Routes to "6" (some other min/max LSN)
"2" Routes to "3" (some min/max LSN)
"2" Routes to "4" (some other min/max LSN)
"3" Routes to "3" (some min/max LSN)
"4" Routes to "4" (some min/max LSN)
"5" Routes to "5" (some min/max LSN)
"6" Routes to "6" (some min/max LSN)

I just wondered:

When I come with Incoming 3 with an LSN from before the split, does the algorithm need to be recursive in the sense that, it needs to support going back several levels because maybe some of the partitions it would go to for the Archival also had a split?
If the cache is empty, how would it be constructed after the split (only 3,4,5,6 are live, how does it know that 2 and 1 were before and that 0 was before that if the Parents property only contains a list with no hierarchy level).

If the answer is that we would not support this grandfathering, that's also ok, just knowing that this is a known gap.

philipthomas-MSFT commented 1 year ago

Product/SDK/.net/Microsoft.Azure.Cosmos.Friends/FFCF/ FullFidelityChangeFeedHandler

Life of cache begins at the handler. Introducing AsyncCache. Is it the end-state for caching?

cc @kirankumarkolli The handler's static method caller is RawRequestExtensions which is a static type. All requests start there, so the same new AsyncCache instance will be used for all requests. I will have tests to support this.

philipthomas-MSFT commented 1 year ago

Food for thought

PartitionKeyRange response contains a Parent property but within that, there is no hierarchy.

For example: Let's say that PKRange 0, split to 1,2, then 2 split in to 3,4 and 1 split into 5 and 6.

If I send a request to 0 (because the client was old) and I get a 410 because it's gone and then I obtain the PartitionKeyRanges (TryOverlappingRanges) I would get PartitionKeyRange 3,4,5,6 where the Parents would be "0,2" for 3 and 4 and "0,1" for 5 and 6.

So after the 410, I would then attempt to route to, for example, 3 from the client.

If the cache has:

"0" Routes to "2" (some min/max LSN)

"0" Routes to "1" (some other min/max LSN)

"1" Routes to "5" (some min/max LSN)

"1" Routes to "6" (some other min/max LSN)

"2" Routes to "3" (some min/max LSN)

"2" Routes to "4" (some other min/max LSN)

"3" Routes to "3" (some min/max LSN)

"4" Routes to "4" (some min/max LSN)

"5" Routes to "5" (some min/max LSN)

"6" Routes to "6" (some min/max LSN)

I just wondered:

When I come with Incoming 3 with an LSN from before the split, does the algorithm need to be recursive in the sense that, it needs to support going back several levels because maybe some of the partitions it would go to for the Archival also had a split?

If the cache is empty, how would it be constructed after the split (only 3,4,5,6 are live, how does it know that 2 and 1 were before and that 0 was before that if the Parents property only contains a list with no hierarchy level).

If the answer is that we would not support this grandfathering, that's also ok, just knowing that this is a known gap.

cc @ealsur So I just want to correct something first. This is the correct route for your example. I did a ~strikethrough~ for the invalid ones.

~"0" Routes to "2" (some min/max LSN)~
~"0" Routes to "1" (some other min/max LSN)~
"0" Routes to "3" (some other min/max LSN)
~"1" Routes to "5" (some min/max LSN)~
"1" Routes to "6" (some other min/max LSN)
~"2" Routes to "3" (some min/max LSN)~
"2" Routes to "4" (some other min/max LSN)
"3" Routes to "3" (some min/max LSN)
"4" Routes to "4" (some min/max LSN)
"5" Routes to "5" (some min/max LSN)
"6" Routes to "6" (some min/max LSN)

"When I come with Incoming 3 with an LSN from before the split, does the algorithm need to be recursive in the sense that, it needs to support going back several levels because maybe some of the partitions it would go to for the Archival also had a split?"

Irrespective of IncomingPartitionKeyRangeId, the IncomingLSN always tries to find the correct partition's min/max LSN. So just because the request's IncomingPartitionKeyRangeId is 3, the IncomingLSN determines where it actually routes to. If the IncomingLSN belongs to the IncomingPartitionKeyRangeId , that is just a normal passthrough.

"If the cache is empty, how would it be constructed after the split (only 3,4,5,6 are live, how does it know that 2 and 1 were before and that 0 was before that if the Parents property only contains a list with no hierarchy level)."

If you look at the tree example in this document, "UseArchivalPartition": true indicates that it was split and is a parent to some child, which is or "UseArchivalPartition": false

gopalrander commented 1 year ago

@philipthomas-MSFT DrainRoute in the cache is invalidated when Compute detects that there is a partition splits, correct? This needs to be mentioned in Cache eviction policy.

gopalrander commented 1 year ago

From what I understand, a valid cache at a moment, would be usable by all pk range ids. So why we we need a Pk rangeid in the cache key? Please clarify if I am missing any detail here. Thanks.

philipthomas-MSFT commented 1 year ago

@philipthomas-MSFT DrainRoute in the cache is invalidated when Compute detects that there is a partition splits, correct? This needs to be mentioned in Cache eviction policy.

There is no eviction policy. The archival tree cache gets created if it needs to be built and lives in-memory until Compute Gateway discards it, like a reboot or something.

philipthomas-MSFT commented 1 year ago

From what I understand, a valid cache at a moment, would be usable by all pk range ids. So why we we need a Pk rangeid in the cache key? Please clarify if I am missing any detail here. Thanks.

cc @gopalrander For now, every incoming partition key range id and container resource id will have a cached item. I am playing around with the possibility of just caching the drain route information for the incoming partition key range at some point.

gopalrander commented 1 year ago

@philipthomas-MSFT DrainRoute in the cache is invalidated when Compute detects that there is a partition splits, correct? This needs to be mentioned in Cache eviction policy.

There is no eviction policy. The archival tree cache gets created if it needs to be built and lives in-memory until Compute Gateway discards it, like a reboot or something.

I meant that the cache might be invalid after a split. In the above cache example, if partition 4 splits, then the whole routing map needs to be updated.. right?

philipthomas-MSFT commented 1 year ago

@philipthomas-MSFT DrainRoute in the cache is invalidated when Compute detects that there is a partition splits, correct? This needs to be mentioned in Cache eviction policy.

There is no eviction policy. The archival tree cache gets created if it needs to be built and lives in-memory until Compute Gateway discards it, like a reboot or something.

I meant that the cache might be invalid after a split. In the above cache example, if partition 4 splits, then the whole routing map needs to be updated.. right?

@gopalrander , I think I see your point and probably why I wanted to use all the PartitionKeyRangeIds for a collection instead of just the IncomingPartitionKeyRangeId as a key along with the ContainerResourceId. I brought this up in our walkthrough but I could not remember this case.

ContainerResourceId: 5dWn+9OCNxn= IncomingPartitionKeyRangeId: 2

CacheKey: 5dWn+9OCNxn=2 CacheValue: 0 -> 1, 2

If PartitionKeyRangeId 2 splits at some later time,

ContainerResourceId: 5dWn+9OCNxn= IncomingPartitionKeyRangeId: 2

Compute Gateway would unfortunately continue to use the original cache item,

CacheKey: 5dWn+9OCNxn=2 CacheValue: 0 -> 1, 2

But it should create a new cache item instead,

CacheKey: Unknown CacheValue: 0 -> 1, (2 -> 3, 4)

So that means that either the original cache item gets invalidated,

CacheKey: 5dWn+9OCNxn=2 CacheValue: 0 -> 1, (2 -> 3, 4)

or because we want the cached items to be immutable, we will create a new cached item which would work if we use all the Collection's PartitionKeyRangeIds along with the ContainerResourceId as the CacheKey.

CacheKey: 5dWn+9OCNxn=1234 CacheValue: 0 -> 1, (2 -> 3, 4)

I am going to make this change over the weekend, but @kirankumarkolli , @ealsur , @jcocchi , you all where in the meeting and I believe there were some concerns about the size of the CacheKey, but it seems like I may have to go back to this idea.

philipthomas-MSFT commented 1 year ago

Also, the PR is here.

https://msdata.visualstudio.com/DefaultCollection/CosmosDB/_git/CosmosDB/pullrequest/1088615

I did note that I need to fix the CacheKey this weekend in the PR's description. See previous comment about why.

cc @gopalrander @kirankumarkolli @ealsur @jcocchi

Azure / azure-cosmos-dotnet-v3

[Internal] AllVersionsAndDeletes on change feed pull model for partition split-handling caching design documentation. Compute Gateway. #3681

Purpose statement

Tasks

Stakeholders

Resources

Out of scope

Scope of work

Criteria for caching

Current baseline architecture

Too many unnecessary Backend request that affect latency and overall performance

Proposed solution

Branch

Cache

Performance

Security

Areas of impact

Estimation for deliverables

Supportability

Client telemetry

Distributed tracing

Diagnostic logging

Testing

Use case/scenarios

Unit (Gated pipeline)

Emulator (Gated pipeline)

Performance/Benchmarking (Gated pipeline)

Security/Penetration (Gated pipeline)

Concerns

Food for thought

Food for thought