philipthomas-MSFT commented 5 months ago

Description

Investigating Compute Gateway AVAD Archival Tree Cache Eviction Caching.

Problem statement

Currently, Archival caching is unbounded for CF AVAD on Compute Gateway. Investigation is needed to determine if and when bounded caching is necessary.

Stackholders

Dmitri Melnikov
Gary Fang
Rodimiro Cerrato Espinal
Justine Cocchi

Tasks

[x] Cache must be tied to tenant container (should be evicted when container is deleted)
[x] Trim cached data to minimum required set (consider abbreviations, etc.)
[x] Investigate max size of our cache and implement some limits
[x] Publish counters specific to cache information
[ ] Code complete, signoffs, delivery and release schedules.

Diagram

draft1-archivalTreeCache-eviction-publish-design

Cache must be tied to tenant container (should be evicted when container is deleted)

How frequently are tenant containers evicted?
- Follow-up: With engineer, Gary Fang, that can give me telemetry for this.
  - ~~Actionable items~~
  - ~~[ ] Observability/KUSTO queries on TenantContainer percentages for eviction times.~~
  - ~~[ ] KUSTO queries on when TenantContainer's where last evicted.~~
    - _TenantContainer_s are evicted automatically at least 3 weeks when Upgrades occur. So, no TenantContainer life last longer than 3 weeks.
  - ~~[ ] Observability/KUSTO queries on size of specific caches or activities in memory on TenantContainer.~~
  - ~~[ ] Model for adding ArchivalTreeCaching to TenantContainer.~~
  - ~~No support for Graphana or Application Insights.~~
What constitutes or justifies a container deletion and who owns this?
- TenantContainer are evicted after not seeing new requests for ~5 hours on the node.
  - This is configurable account level and federation level.
  - Gary is point of contact for Compute.
    - Actionable item:
      - [x] Locating the configuration.
      - TenantConfiguration.SocketConnectionIdleTimeoutInSeconds
- We may have to include code to evict the archival tree cache ourselves, it is not done automagically.
- ArchivalTreeCache needs to be owned by the TenantContainer object, not the FullFidelityChangeFeedHandler object, so some refactoring needs to happen and since TenantContainer owns other caches, we can potentially model this from the others as a point of reference.
~~Are other global caches deleted, or stored, in the event of a tenant container deletion?~~
- ~~Yes. (Ex. ComputeRecyclableMemoryManager).~~
- ~~Rules for creating caches owned by TenantContainer.~~
  - ~~It loads all the necessary service at the initialization.~~
  - ~~It can refresh once every 5 minutes allowing to change the state using feature flags.~~
  - ~~It must dispose all of the data on dispose being called once tenant went idle for 5+ hours.~~
  - ~~Frequently accessed services could be received using a separate interface using direct (bypass dictionary lookup).~~
~~Should archival tree caches be evicted based only on the deletion of tenant container?~~
- ~~Reconvene with Kiran again on this.~~
Since the archival tree cache is primary based on the state of the tenant container's partition key range cache, should archival tree caches be evicted based on changes or refreshes to the partition key range cache?
- ~~Reconvene with Kiran again on this.~~
~~Should archival tree caches be evicted based on changes or refreshes to the tenant container cache (not necessarily related to partition key range)?~~
- ~~Reconvene with Kiran again on this.~~

~~#### Trim cached data to minimum required set (consider abbreviations, etc.)~~ ~~Note: The impact of change needs to be assessed more closely. There may be a situation where this is changed on a more iterative release as part of a phase 2~~ ~~1. What cached elements are essential?~~ ~~2. What cached elements are informational?~~ ~~3. What cached elements can be abbreviated or removed?~~ ~~4. Should we revisit and refactor the archival tree structure?~~ ~~5. What is the baseline size under the most extreme cases and how do we measure and test this?~~ ~~6. What is the future size under the most extreme cases if a refactoring is done?~~

Investigate max size of our cache and implement some limits

Check with Rodimiro.
- Actionable items:
  - [x] Get the most extreme case of a partitions in a single container. let minSplitTreeDepth = 10; let agoDuration = 30m; ReportQuota5M | where TIMESTAMP >= ago(agoDuration) | where array_length(parse_json(PartitionKeyRangeParentIds)) >= minSplitTreeDepth | project TIMESTAMP, Tenant, DatabaseName, CollectionName, PartitionId, PartitionKeyRangeParentIds
  - ~~[ ] Take extreme case, generate a sample ArchivalTree, cache it, measure it, and create a tests that uses this cache to observe the performance.~~

[x] Samples

1 split (2982 bytes)
2 splits (4538 bytes)
3 splits (6146 bytes)
6 splits (12484 bytes)
10 splits (18964 bytes)

[x] Highest split count for all containers, not exclusive to FFCF accounts was 258 splits using KUSTO

let minSplitTreeDepth = 258;
let agoDuration = 30m;
ReportQuota5M
| where TIMESTAMP >= ago(agoDuration)
| where array_length(parse_json(PartitionKeyRangeParentIds)) >= minSplitTreeDepth
| project TIMESTAMP, Tenant, DatabaseName, CollectionName, PartitionId, PartitionKeyRangeParentIds
| summarize Count=count()

[x] Highest split count for all containers exclusive to FFCF accounts was 12 splits using KUSTO

let minSplitTreeDepth = 12;
let agoDuration = 30m;
let accountsWithFFCFEnabled = MgmtGlobalDatabaseAccountTrace
| where Status !in ("CreationFailed", "DeletionFailed")
| extend EnableFFCFWithPITR=iff(extractjson("$.enableffcfwithpitr", tolower(ConfigurationOverrides)) == "true", "True", "False")
| extend EnableFFCFWithPITRCopy=iff(extractjson("$.enableffcfwithpitrcopy", tolower(ConfigurationOverrides)) == "true", "True", "False")
| summarize arg_max(TIMESTAMP, EnableFFCFWithPITR, EnableFFCFWithPITRCopy) by SubscriptionId, GlobalDatabaseAccount
| where EnableFFCFWithPITR == "True" or EnableFFCFWithPITRCopy == "True"
| project GlobalDatabaseAccountName=GlobalDatabaseAccount;
ReportQuota5M
| where TIMESTAMP >= ago(agoDuration)
| where array_length(parse_json(PartitionKeyRangeParentIds)) >= minSplitTreeDepth
| join kind=inner accountsWithFFCFEnabled on GlobalDatabaseAccountName
| project TIMESTAMP, Tenant, DatabaseName, CollectionName, PartitionId, PartitionKeyRangeParentIds

Publish counters specific to cache information

Observability on metrics. How to measure and get notifications. Supportability.
- [x] Log in memory size of all collective caches when a cache entry is
  - [x] added.
  - [x] removed.
  - NOTE: Trickier than it seems.
  - AsyncCache does not give access to the actual value, ArchivalTree, so even when serialized, it only gives you the key (Ex. {"Keys":[{"Item1":"%d+q,PHgqbz=","Item2":4}]}). The only access you have to the ArchivalTree is by way of AsyncCache.GetAsync, and you need the key for that.
    - Option 1: We can check memory before and after adding and removing cached item using GC.GetTotalMemory(true); and do the subtraction and that can be the best guess. Verdict: More favorable and easy to implement.
    - Option 2: Read all keys in AsyncCache. Iterate through all the keys to call Async.GetAsync. Measure each ArchivalTree by converting it to a MemoryStream and using MemoryStream.Length incrementing a long size value and use that. Verdict: Do not like this one. Seems a bit much.
    - [x] Option 3: new AsyncCache object that has ClearCache, GetAsync, TotalMemorySize capabilities. Object is also responsible for calculating the object size and logic to publish counters. Cache is calculated usingthe already existing UserDefinedTypesExtensions.ConvertToStream{T}(T) in Compute Gateway to get the length of the memory stream. It is only calculating the size once after the archival tree is built and added to the cache. See diagram above.
- [ ] Produce KUSTO query to view memory size.
- [x] Uses DefaultTrace.TraceInformation logger.
- [ ] What logger to use? Friends does not have access to CosmosDBTrace which is widely used by TenantContainer and is based on Microsoft.Azure.Cosmos.Compute.Core.Diagnostics.ITrace. Friends uses Microsoft.Azure.Cosmos.Tracing.ITrace.
  - [ ] Which KUSTO table does these loggers actually log to?
  - ~~[ ] Suggestion was made by Rodimiro to log to NodeCounter5MRoleInstanceEvent. Waiting on response on which logger writes to this KUSTO table.~~

Code complete, testing, signoffs, delivery and release schedules.

Proof of concept
- [ ] Also dealing with dependencies while moving internal static ConcurrentDictionary<Uri, AsyncCache<(string, int), ArchivalTree>> ArchivalCacheTree { get; } = new(); to TenantConfiguration from FullFidelityChangeFeedHandler.
- ~~[ ] Create and test feature flag.~~
- [x] Code and assembly.info changes to give TenantContainer access to Friends library.
Release and delivery
- mid-June?
- Schedule: Once a month, or every 3 weeks?
  - I need more confirmation.
- Is this tied to Private Preview, Public Preview, or GA?
Signoffs
- Soham can sign off for Compute
- Rodimiro can sign off for Backend
Testing
- [ ] Create a CTL in test environment.

kirankumarkolli commented 4 months ago

In Tasks, below two are the high priority and MUST

Cache must be tied to tenant container (should be evicted when container is deleted) Publish counters specific to cache information

On when to emit cache full size information

Size gets impacted by addition/deletes => On add/delete
Periodic emitter => Gary shared a related code reference

Please share the design of the cache size computation (need to cover the CPU/resurce constraints)

kirankumarkolli commented 4 months ago

Is the proposal above final solution? Or current stage?

Also there is an open fundamental question about "Is JSON serialized size the right representation of in-memory size"? Can you please also cover that aspect?

philipthomas-MSFT commented 4 months ago

Is the proposal above final solution? Or current stage?

Also there is an open fundamental question about "Is JSON serialized size the right representation of in-memory size"? Can you please also cover that aspect?

@kirankumarkolli I choose this path because I feel that it is probably the best, which is why, in my documentation, I talked briefly about the other options, and I did strike through them.

After checking many sources, there is no accurate means to determine the actual size of an instanced managed object in .NET. Using Stream.Length is an approximation of the data contained within the object without the overhead. Memory is usually going to be slightly larger.

If we were dealing with structs, there are things like Marshall.SizeOf that we could use, but it is still just an approximation.

GC.GetTotalMemory(true) before and after caches are created or removed can give us a number of what was just add or created, but not the actual size of the caches without subtracting afterSize and beforeSize (afterSize - beforeSize). I don't really have a way of asserting this in a test other that the value is greater every time you add to the cache, and lesser every time you remove from the cache. And 0 if the cache is totally cleared. With this approach you get results like this from one of my tests

One cache item, one service endpoint

Post MemorySizes.TryAdd, the total memory size of AsyncArchivalTreeCache.Caches: (1496) bytes.
The memory size removed for https://localhost/ with key((7P[NqJ9[AUL=, 1)) was (1496) bytes.
Post MemorySizes.TryRemove, the total memory size of AsyncArchivalTreeCache.Caches: (0) bytes.

Two caches, one service endpoint

Post MemorySizes.TryAdd, the total memory size of AsyncArchivalTreeCache.Caches: (1736) bytes.
Post MemorySizes.TryAdd, the total memory size of AsyncArchivalTreeCache.Caches: (1752) bytes.
The memory size removed for https://localhost/ with key((#M1u8uTBIXw=, 2)) was (16) bytes.
The memory size removed for https://localhost/ with key((4Z.(u!yQnb>=, 1)) was (1736) bytes.
Post MemorySizes.TryRemove, the total memory size of AsyncArchivalTreeCache.Caches: (0) bytes.

Two caches, two service endpoints.

Post MemorySizes.TryAdd, the total memory size of AsyncArchivalTreeCache.Caches: (1408) bytes.
Post MemorySizes.TryAdd, the total memory size of AsyncArchivalTreeCache.Caches: (1424) bytes.
The memory size removed for https://localhost/ with key((YHI0O1mybjR=, 1)) was (1408) bytes.
Post MemorySizes.TryRemove, the total memory size of AsyncArchivalTreeCache.Caches: (16) bytes.
The memory size removed for https://localhost2/ with key((x+A]($XTR!j=, 2)) was (16) bytes.
Post MemorySizes.TryRemove, the total memory size of AsyncArchivalTreeCache.Caches: (0) bytes.

There is talk that BenchmarkDotNet and ObjectLayoutInspector has some functions that supposedly does the trick, but not sure this is something we want to introduce.

kirankumarkolli commented 4 months ago

GC is for the whole process and there might be other concurrent allocations/de-allocations happening right? Our goal is to find just ArchivalTree approximation for the cache, process memory counters are already captured.

ObjectLayoutInspector: ArchivalTree is composed of DrainRoute, SplitGraph types, it's a fixed set of types (even if it evolves it will be a bounded set). If for each type how much space is in-memory, then we can compose the total size right?

Now for each of those types we can pre-compile/model the size and use as a possibility thoughts?

philipthomas-MSFT commented 4 months ago

GC is for the whole process and there might be other concurrent allocations/de-allocations happening right? Our goal is to find just ArchivalTree approximation for the cache, process memory counters are already captured.

ObjectLayoutInspector: ArchivalTree is composed of DrainRoute, SplitGraph types, it's a fixed set of types (even if it evolves it will be a bounded set). If for each type how much space is in-memory, then we can compose the total size right?

Now for each of those types we can pre-compile/model the size and use as a possibility thoughts?

@kirankumarkolli

I agree with you sentiments on GC, which is why I did not wish to go that route but included it in the PR just as a point of reference just like I did for Stream.Length.

I do not know much about ObjectLayoutInspector. Maybe I am falsely assuming that this was not a viable option bringing in such a dependency in Compute Gateway so I do not fully understand your question or the direction you are going in when you make references to DrainRoute, SplitGraph types. If you are suggesting that ObjectLayoutInspector is not a deal breaker, then I will look into it more, but my next steps were to start reaching out to others for recommendations, but I think that we are just drawing this out but I am 100% sure you will disagree with whatever stance I take.

The reason why Stream.Length wins for me, at least for this iteration, is that it is a consistent number. It goes up. It goes down. It is something we can use to measure. It is not the accuracy of the number. It is what the number represents. A counter does not have to be memory. But sure, we can keep going down this rabbit hole.

I have CTL test environment to deal with today. I will continue to research a solution.

Azure / azure-cosmos-dotnet-v3

ChangeFeed: Investigating Compute Gateway AVAD Archival Tree Cache Eviction #4474

Description

Investigating Compute Gateway AVAD Archival Tree Cache Eviction Caching.

Problem statement

Stackholders

Tasks

Diagram

Cache must be tied to tenant container (should be evicted when container is deleted)

Investigate max size of our cache and implement some limits

Publish counters specific to cache information

Code complete, testing, signoffs, delivery and release schedules.