Closed philipthomas-MSFT closed 1 year ago
Gopal and Hemeswari need to review.
Breaking up PR
Product/SDK/.net/Microsoft.Azure.Cosmos.Friends/FFCF/ FullFidelityChangeFeedHandler
Life of cache begins at the handler. Introducing AsyncCache. Is it the end-state for caching?
From offline sync-up
PartitionKeyRange response contains a Parent property but within that, there is no hierarchy.
For example: Let's say that PKRange 0, split to 1,2, then 2 split in to 3,4 and 1 split into 5 and 6.
If I send a request to 0 (because the client was old) and I get a 410 because it's gone and then I obtain the PartitionKeyRanges (TryOverlappingRanges) I would get PartitionKeyRange 3,4,5,6 where the Parents would be "0,2" for 3 and 4 and "0,1" for 5 and 6.
So after the 410, I would then attempt to route to, for example, 3 from the client.
If the cache has:
I just wondered:
When I come with Incoming 3 with an LSN from before the split, does the algorithm need to be recursive in the sense that, it needs to support going back several levels because maybe some of the partitions it would go to for the Archival also had a split?
If the cache is empty, how would it be constructed after the split (only 3,4,5,6 are live, how does it know that 2 and 1 were before and that 0 was before that if the Parents property only contains a list with no hierarchy level).
If the answer is that we would not support this grandfathering, that's also ok, just knowing that this is a known gap.
Product/SDK/.net/Microsoft.Azure.Cosmos.Friends/FFCF/ FullFidelityChangeFeedHandler
Life of cache begins at the handler. Introducing AsyncCache. Is it the end-state for caching?
cc @kirankumarkolli The handler's static method caller is RawRequestExtensions which is a static type. All requests start there, so the same new AsyncCache
Food for thought
PartitionKeyRange response contains a Parent property but within that, there is no hierarchy.
For example: Let's say that PKRange 0, split to 1,2, then 2 split in to 3,4 and 1 split into 5 and 6.
If I send a request to 0 (because the client was old) and I get a 410 because it's gone and then I obtain the PartitionKeyRanges (TryOverlappingRanges) I would get PartitionKeyRange 3,4,5,6 where the Parents would be "0,2" for 3 and 4 and "0,1" for 5 and 6.
So after the 410, I would then attempt to route to, for example, 3 from the client.
If the cache has:
- "0" Routes to "2" (some min/max LSN)
- "0" Routes to "1" (some other min/max LSN)
- "1" Routes to "5" (some min/max LSN)
- "1" Routes to "6" (some other min/max LSN)
- "2" Routes to "3" (some min/max LSN)
- "2" Routes to "4" (some other min/max LSN)
- "3" Routes to "3" (some min/max LSN)
- "4" Routes to "4" (some min/max LSN)
- "5" Routes to "5" (some min/max LSN)
- "6" Routes to "6" (some min/max LSN)
I just wondered:
- When I come with Incoming 3 with an LSN from before the split, does the algorithm need to be recursive in the sense that, it needs to support going back several levels because maybe some of the partitions it would go to for the Archival also had a split?
- If the cache is empty, how would it be constructed after the split (only 3,4,5,6 are live, how does it know that 2 and 1 were before and that 0 was before that if the Parents property only contains a list with no hierarchy level).
If the answer is that we would not support this grandfathering, that's also ok, just knowing that this is a known gap.
cc @ealsur So I just want to correct something first. This is the correct route for your example. I did a ~strikethrough~ for the invalid ones.
"When I come with Incoming 3 with an LSN from before the split, does the algorithm need to be recursive in the sense that, it needs to support going back several levels because maybe some of the partitions it would go to for the Archival also had a split?"
Irrespective of IncomingPartitionKeyRangeId, the IncomingLSN always tries to find the correct partition's min/max LSN. So just because the request's IncomingPartitionKeyRangeId is 3, the IncomingLSN determines where it actually routes to. If the IncomingLSN belongs to the IncomingPartitionKeyRangeId , that is just a normal passthrough.
"If the cache is empty, how would it be constructed after the split (only 3,4,5,6 are live, how does it know that 2 and 1 were before and that 0 was before that if the Parents property only contains a list with no hierarchy level)."
If you look at the tree example in this document, "UseArchivalPartition": true
indicates that it was split and is a parent to some child, which is or "UseArchivalPartition": false
@philipthomas-MSFT DrainRoute in the cache is invalidated when Compute detects that there is a partition splits, correct? This needs to be mentioned in Cache eviction policy.
From what I understand, a valid cache at a moment, would be usable by all pk range ids. So why we we need a Pk rangeid in the cache key? Please clarify if I am missing any detail here. Thanks.
@philipthomas-MSFT DrainRoute in the cache is invalidated when Compute detects that there is a partition splits, correct? This needs to be mentioned in Cache eviction policy.
There is no eviction policy. The archival tree cache gets created if it needs to be built and lives in-memory until Compute Gateway discards it, like a reboot or something.
From what I understand, a valid cache at a moment, would be usable by all pk range ids. So why we we need a Pk rangeid in the cache key? Please clarify if I am missing any detail here. Thanks.
cc @gopalrander For now, every incoming partition key range id and container resource id will have a cached item. I am playing around with the possibility of just caching the drain route information for the incoming partition key range at some point.
@philipthomas-MSFT DrainRoute in the cache is invalidated when Compute detects that there is a partition splits, correct? This needs to be mentioned in Cache eviction policy.
There is no eviction policy. The archival tree cache gets created if it needs to be built and lives in-memory until Compute Gateway discards it, like a reboot or something.
I meant that the cache might be invalid after a split. In the above cache example, if partition 4 splits, then the whole routing map needs to be updated.. right?
@philipthomas-MSFT DrainRoute in the cache is invalidated when Compute detects that there is a partition splits, correct? This needs to be mentioned in Cache eviction policy.
There is no eviction policy. The archival tree cache gets created if it needs to be built and lives in-memory until Compute Gateway discards it, like a reboot or something.
I meant that the cache might be invalid after a split. In the above cache example, if partition 4 splits, then the whole routing map needs to be updated.. right?
@gopalrander , I think I see your point and probably why I wanted to use all the PartitionKeyRangeIds for a collection instead of just the IncomingPartitionKeyRangeId as a key along with the ContainerResourceId. I brought this up in our walkthrough but I could not remember this case.
ContainerResourceId: 5dWn+9OCNxn= IncomingPartitionKeyRangeId: 2
CacheKey: 5dWn+9OCNxn=2 CacheValue: 0 -> 1, 2
If PartitionKeyRangeId 2 splits at some later time,
ContainerResourceId: 5dWn+9OCNxn= IncomingPartitionKeyRangeId: 2
Compute Gateway would unfortunately continue to use the original cache item,
CacheKey: 5dWn+9OCNxn=2 CacheValue: 0 -> 1, 2
But it should create a new cache item instead,
CacheKey: Unknown CacheValue: 0 -> 1, (2 -> 3, 4)
So that means that either the original cache item gets invalidated,
CacheKey: 5dWn+9OCNxn=2 CacheValue: 0 -> 1, (2 -> 3, 4)
or because we want the cached items to be immutable, we will create a new cached item which would work if we use all the Collection's PartitionKeyRangeIds along with the ContainerResourceId as the CacheKey.
CacheKey: 5dWn+9OCNxn=1234 CacheValue: 0 -> 1, (2 -> 3, 4)
I am going to make this change over the weekend, but @kirankumarkolli , @ealsur , @jcocchi , you all where in the meeting and I believe there were some concerns about the size of the CacheKey, but it seems like I may have to go back to this idea.
Also, the PR is here.
https://msdata.visualstudio.com/DefaultCollection/CosmosDB/_git/CosmosDB/pullrequest/1088615
I did note that I need to fix the CacheKey this weekend in the PR's description. See previous comment about why.
cc @gopalrander @kirankumarkolli @ealsur @jcocchi
Purpose statement
This is a document to enhance the Cosmos DB experience by achieving even higher performance.
Description: The plan to improve latency and overall performance for Change feed in Azure Cosmos DB Pull model requests while in AllVersionsAndDeletes (preview) change feed mode by introducing a caching strategy that is local to Compute Gateway for a collection's physical partition archival lineage. The collection's physical partition archival lineage is a routing map that instructs Compute Gateway on how to drain documents when a change feed request is received. It is driven by the physical partition's minimum and maximum log sequence numbers which is a change feed information request to the Backend API. This will support all SDK languages, specifically .NET and Java SDKs. There would be tenant configuration feature flags and additional diagnostic logging that would need to be implemented as well. This issue will be split into multiple PRs, caching, feature flag, and logging). Tenant configuration for feature flag is a fail-safe for if the caching strategy is not working as expected.
Level-setting
Tasks
Stakeholders
Resources
Out of scope
Scope of work
The Microsoft Azure Cosmos DB .NET SDK Version 3 needs to achieve optimal performance by implementing a local caching strategy in Compute Gateway for all change feed request while in AllVersionsAndDeletes (preview) change feed mode. So, introducing a caching strategy with additional trace logging and a feature flag for the collection's physical partition archival lineage will unequivocally improve performance by accessing a cache that is local to the Compute Gateway.
Criteria for caching
When a collection's physical partition has split. The logic to construct the collection's physical partition archival lineage is solely determined by whether that physical partition has returned an HTTP status code Gone, or 410 when the change feed has requested items.
Current baseline architecture
Too many unnecessary Backend request that affect latency and overall performance
Currently, we do not support any caching strategy, and all change feed request while in AllVersionsAndDeletes (preview) change feed mode will request additional minimum and maximum log sequence numbers for every partition that exists within a collection's partition archival lineage. If change feed requests are being sent to the same collection to exhaust change feed items, then a cache, local to Compute Gateway, of that collection's physical partition archival lineage should be fetched and used for determining the physical partition routing strategy for draining documents on a live physical partition. Currently, the collection's physical partition archival lineage is being constructed and traversed for every change feed request while in AllVersionsAndDeletes (preview) change feed mode. The construction of the collection's physical partition archival lineage increases latency due to its need to make additional network hops to the Backend services to make change feed information request that include minimum and maximum log sequence numbers for a physical partition. For example, if a collection has a physical partition that has split, you now have 2 child physical partitions, and a change feed information request is made 2 times to get minimum and maximum log sequence numbers for each child physical partition. If those child physical partitions split, the number increases, and so on, and so forth. The more splits that occur, the more network hops to the Backend services to request change feed information to return minimum and maximum log sequence numbers, the higher the latency for making change feed request while in AllVersionsAndDeletes (preview) change feed mode.
Proposed solution
Branch
Cache
Key (ContainerResourceId + PartitionKeyRangeIds + Account)
{"Value":"pwd(I8Xc*9+=","PartitionKeyRanges":[{"minInclusive":"00","maxExclusive":"MM","ridPrefix":null,"throughputFraction":0.0,"status":"Invalid","lsn":0,"parents":["0"],"id":"1","_rid":null,"_self":null,"_ts":0,"_etag":null},{"minInclusive":"MM","maxExclusive":"FF","ridPrefix":null,"throughputFraction":0.0,"status":"Invalid","lsn":0,"parents":["0"],"id":"2","_rid":null,"_self":null,"_ts":0,"_etag":null}],"AccountEndpoint":"http://testaccount.documents.azure.com"}
Value
The collection's partition archival lineage is constructed by collection, or ContainerResourceId. The collection's partition archival lineage contains the DateCreated, the DrainRoute, and the IncomingPartitionKeyRangeId. There is question as to whether the IncomingPartitionKeyRangeId is necessary, so it may go away. If so, I will update this document accordingly, but at the time of document creation, it exists.
ContainerResourceId
{"Value":"pwd(I8Xc*9+=","PartitionKeyRanges"}
DateCreated
"DateCreated": "2022-07-25T11:42:24.5483782Z"
DrainRoute
Performance
Security
Areas of impact
Estimation for deliverables
Supportability
Client telemetry
Distributed tracing
TBD
Diagnostic logging
EnableFullFidelityChangeFeedSplitHandlingArchivalTreeCaching: true
Testing
Use case/scenarios
Unit (Gated pipeline)
Emulator (Gated pipeline)
Performance/Benchmarking (Gated pipeline)
Security/Penetration (Gated pipeline)
Concerns