Closed Ryanman closed 4 years ago
I have a Pull Request (#14), but I'm worried that this duct-tapes something fundamental on initial RunAsync(). Please let me know your thoughts.
Thanks @Ryanman. We will look at your pull request and let you know what we think.
I see the PR was approved. I'd like us to keep this issue open until I'm able to report back that a) this solves the issue and b) that the cache is able to recover from being replicated. With the time this is occurring it may be a day or two.
If we're doing something wrong to cause this hopefully I'll have info for that as well.
Sounds good. I was able to get the error in my local testing when I set the MaxCacheSizeInMegabytes be a negative value so that I could get all cached items to all be removed. With your fix this does not happen even if the MaxCacheSizeInMegabytes is set to a negative value.
Hey all - Ran some tests last night. Our two different environments encountered the issue around 6 hours after our deployment and test. This is also the length of our longest sliding expiration.
Are y'all able to push the latest update to NuGet relatively soon? There's a chance we'll uncover another issue with some of the data we're putting in with the long sliding expiration, but to be frank we've been loading and retrieving those items for a while now before the issue began. Perhaps there's other changes we've made that have indirectly caused the deletion of expired items to bomb out.
Time | Log Scope | Event |
---|---|---|
3:21:32 | Release | Smoke Test Succeeded |
9:23:36 | Operational | Transient Fault detected for Cache node EventName: ApiReportFaultOperational Category: StateTransition ActivityId: 0e08b908-f83d-463c-ae1e-8cd63b7c8923:0 PartitionId: {267b4330-42f9-4b0c-b1be-b5e9bec824c7} FaultType: Transient NodeId: a3784be1d81710242ed0a9632647b4f7 ReplicaId: 132386231100152448 ReplicaInstanceId: 132386231100152448 EventInstanceId: {2f95758e-7536-4cb5-80f0-555a3ad4dfc9} |
9:23:37 | Application | First Failure in App |
9:23:37 | Admin | ServiceHostProcess: XXX.DistributedCache.exe started at 2020-07-07T19:19:22.580147200Z with Process Id 4860 for ApplicationId 692c6b8d-f7e5-44f1-b69c-938a8c9af749 terminated unexpectedly with exit code 255 on node id a3784be1d81710242ed0a9632647b4f7 |
9:23:39 | Operational | Failover 1 Completed |
9:23:42 | Admin | Data Loss Reported |
Shifted our longest expiration time to 5 hours instead of six, Failures occurred after 5 hours instead of 6. So this is relatively confirmed to occur when the cache is emptied during long periods of inactivity.
Sorry I thought your PR already went through and Nuget package was pushed but it looks like your test is failing the build:
System.NullReferenceException : Object reference not set to an instance of an object.
Stack trace
at SoCreate.Extensions.Caching.ServiceFabric.LinkedDictionaryHelper.AddLast(CacheStoreMetadata cacheStoreMetadata, String cacheItemKey, CachedItem cachedItem, Byte[] newValue) in /home/vsts/work/1/s/src/SoCreate.Extensions.Caching.ServiceFabric/LinkedDictionaryHelper.cs:line 77
at SoCreate.Extensions.Caching.ServiceFabric.DistributedCacheStoreService.<>c__DisplayClass20_0.<<SetCachedItemAsync>b__0>d.MoveNext() in /home/vsts/work/1/s/src/SoCreate.Extensions.Caching.ServiceFabric/DistributedCacheStoreService.cs:line 136
--- End of stack trace from previous location where exception was thrown ---
at SoCreate.Extensions.Caching.ServiceFabric.RetryHelper.<>c__DisplayClass4_0.<<ExecuteWithRetry>b__0>d.MoveNext() in /home/vsts/work/1/s/src/SoCreate.Extensions.Caching.ServiceFabric/RetryHelper.cs:line 72
--- End of stack trace from previous location where exception was thrown ---
at SoCreate.Extensions.Caching.ServiceFabric.RetryHelper.ExecuteWithRetry[TResult](Func`3 operation, Object state, CancellationToken cancellationToken, Int32 maxAttempts, Nullable`1 initialDelay)
at SoCreate.Extensions.Caching.ServiceFabric.RetryHelper.ExecuteWithRetry(IReliableStateManager stateManager, Func`4 operation, Object state, CancellationToken cancellationToken, Int32 maxAttempts, Nullable`1 initialDelay) in /home/vsts/work/1/s/src/SoCreate.Extensions.Caching.ServiceFabric/RetryHelper.cs:line 83
at SoCreate.Extensions.Caching.ServiceFabric.DistributedCacheStoreService.SetCachedItemAsync(String key, Byte[] value, Nullable`1 slidingExpiration, Nullable`1 absoluteExpiration) in /home/vsts/work/1/s/src/SoCreate.Extensions.Caching.ServiceFabric/DistributedCacheStoreService.cs:line 105
at SoCreate.Extensions.Caching.Tests.DistributedCacheStoreServiceTest.RemoveCachedItemAsync_RemoveItemsFromLinkedDictionary_RemovalWorksWithMalformedMetadata(Mock`1 stateManager, Mock`1 cacheItemDict, Mock`1 metadataDict, ServiceFabricDistributedCacheStoreService cacheStore) in /home/vsts/work/1/s/tests/SoCreate.Extensions.Caching.Tests/DistributedCacheStoreServiceTest.cs:line 343
at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__139_0(Object state)
That's embarassing - my apologies. I got all the test to pass, then "Cleaned up" before opening the PR, with the cachedItems object not having direct follow-up references in the test.
Re-added the necessary memory store setup call: #16
In other news we tested the new code by temporarily adding the package as a project, and the problem appears to be solved in our lower environments.
First - I really appreciate the work y'all have put in, I know maintaining this package isn't easy.
I believe this issue is separate from the timeout issue being discussed.
Our cache service encounters an irrevocable error around 6 hours after being deployed. We do have some variable sliding expiration times for items in the cache, however the expiration times don't line up exactly and the cache itself isn't actively being used. EX: We deploy and test the app that uses the cache at 4:50PM, with this exception being encountered at 11:02 PM. The cluster has the cache across 3 nodes, each of which fails at nearly identical times. It also appears that this error is occurring when building a replica of the service.
Looking at the source code, it appears to be line 204of DistributedCacheStoreService.
Other things to note: Data Loss was detected on these replicas seconds after recovery began, which makes sense if the replicas did not spin up.
Please let me know if there's any other way I can help or provide information.
Stack Trace Highlight:
Message: RunAsync failed due to an unhandled exception causing the host process to crash: System.ArgumentNullException: Value cannot be null. (Parameter 'key') at System.Fabric.Store.TStore'5.ConditionalGetAsync(IStoreTransaction transaction, TKey key, ReadMode readMode, TimeSpan timeout, CancellationToken cancellationToken) at System.Fabric.Store.TStore'5.GetAsync(IStoreTransaction transaction, TKey key, ReadMode readMode, TimeSpan timeout, CancellationToken cancellationToken) at Microsoft.ServiceFabric.Data.Collections.DistributedDictionary'2.TryGetValueAsync(Transaction tx, TKey key, LockMode lockMode, TimeSpan timeout, CancellationToken cancellationToken) at SoCreate.Extensions.Caching.ServiceFabric.DistributedCacheStoreService.<>c__DisplayClass24_1.<<RemoveLeastRecentlyUsedCacheItemWhenOverMaxCacheSize>b__1>d.MoveNext()
Stack trace here:
Application: Secret.DistributedCache.exe CoreCLR Version: 4.700.19.56402 .NET Core Version: 3.1.0 Description: The application requested process termination through System.Environment.FailFast(string message). Message: RunAsync failed due to an unhandled exception causing the host process to crash: System.ArgumentNullException: Value cannot be null. (Parameter 'key') at System.Fabric.Store.TStore'5.ConditionalGetAsync(IStoreTransaction transaction, TKey key, ReadMode readMode, TimeSpan timeout, CancellationToken cancellationToken) at System.Fabric.Store.TStore'5.GetAsync(IStoreTransaction transaction, TKey key, ReadMode readMode, TimeSpan timeout, CancellationToken cancellationToken) at Microsoft.ServiceFabric.Data.Collections.DistributedDictionary'2.TryGetValueAsync(Transaction tx, TKey key, LockMode lockMode, TimeSpan timeout, CancellationToken cancellationToken) at SoCreate.Extensions.Caching.ServiceFabric.DistributedCacheStoreService.<>c__DisplayClass24_1.<<RemoveLeastRecentlyUsedCacheItemWhenOverMaxCacheSize>b__1>d.MoveNext() --- End of stack trace from previous location where exception was thrown --- at SoCreate.Extensions.Caching.ServiceFabric.DistributedCacheStoreService.<>c__DisplayClass24_0.<<RemoveLeastRecentlyUsedCacheItemWhenOverMaxCacheSize>b__0>d.MoveNext() --- End of stack trace from previous location where exception was thrown --- at SoCreate.Extensions.Caching.ServiceFabric.RetryHelper.<>c__DisplayClass4_0.<<ExecuteWithRetry>b__0>d.MoveNext() --- End of stack trace from previous location where exception was thrown --- at SoCreate.Extensions.Caching.ServiceFabric.RetryHelper.ExecuteWithRetry[TResult](Func'3 operation, Object state, CancellationToken cancellationToken, Int32 maxAttempts, Nullable'1 initialDelay) at SoCreate.Extensions.Caching.ServiceFabric.RetryHelper.ExecuteWithRetry(IReliableStateManager stateManager, Func'4 operation, Object state, CancellationToken cancellationToken, Int32 maxAttempts, Nullable'1 initialDelay) at SoCreate.Extensions.Caching.ServiceFabric.DistributedCacheStoreService.RemoveLeastRecentlyUsedCacheItemWhenOverMaxCacheSize(CancellationToken cancellationToken) at SoCreate.Extensions.Caching.ServiceFabric.DistributedCacheStoreService.RunAsync(CancellationToken cancellationToken) at Microsoft.ServiceFabric.Services.Runtime.StatefulServiceReplicaAdapter.ExecuteRunAsync(CancellationToken runAsyncCancellationToken) Stack: at System.Environment.FailFast(System.String) at Microsoft.ServiceFabric.Services.Runtime.ServiceHelper+<>c__DisplayClass18_0.<HandleRunAsyncUnexpectedException>b__0() at System.Threading.Tasks.Task.InnerInvoke() at System.Threading.Tasks.Task+<>c.<.cctor>b__274_0(System.Object) at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(System.Threading.Thread, System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) at System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread) at System.Threading.Tasks.Task.ExecuteEntryUnsafe(System.Threading.Thread) at System.Threading.Tasks.Task.ExecuteFromThreadPool(System.Threading.Thread) at System.Threading.ThreadPoolWorkQueue.Dispatch() at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback()