Open rikbosch opened 5 years ago
@sebastianburckhardt What do you think?
PR #3773 contains some fixes to achieve a better retry logic, and also an API that lets you define your own error handling and diagnostics behavior for handling storage connection problems. Perhaps we could prioritize it.
As @rikbosch wrote:
When there is an exception during
ReadStateFromStorage
inLogViewAdaptor.ReadAsync
which is unrecoverable, the JournalledGrain ends up trying to load in an endless loop.
This is true but at least the endless loop seems currently to be using Exponential backoff.
On the other hand situation is a lot worse, if an exception is thrown from ApplyUpdatesToStorage
but ApplyUpdatesToStorage
does not throw an exception. Then the result is an endless fast loop without Exponential backoff.
One could try to mitigate the endless fast loop by adding something like:
protected override void OnConnectionIssue(ConnectionIssue issue)
{
issue.RetryDelay = new TimeSpan(0, 0, 10); // Retry delay of 10 seconds
base.OnConnectionIssue(issue);
}
...but that would mess up the Exponential backoff in the case of exception in ReadStateFromStorage
.
So if for some reason the exception gets thrown from ApplyUpdatesToStorage
but ReadStateFromStorage
is working ok the system will potentially suffocate the DB with incoming read requests.
Here's a concrete example implementation of ICustomStorageInterface
:
using System;
using System.Collections.Generic;
using System.Threading.Tasks;
using Orleans.EventSourcing.CustomStorage;
using Orleans.EventSourcing;
using Orleans.LogConsistency;
namespace Example
{
[Serializable]
public class ExampleState
{
public string ExampleValue { get; set; }
}
abstract public class ExampleStorage
: JournaledGrain<ExampleState, String>,
ICustomStorageInterface<ExampleState, String>
{
protected override void OnConnectionIssue(ConnectionIssue issue)
{
Console.WriteLine("!!! Connection Issue");
//issue.RetryDelay = new TimeSpan(0, 0, 10);
base.OnConnectionIssue(issue);
}
protected override void OnConnectionIssueResolved(ConnectionIssue issue)
{
Console.WriteLine("!!! Connection Issue Resolved");
base.OnConnectionIssueResolved(issue);
}
Task<bool> ICustomStorageInterface<ExampleState, String>.ApplyUpdatesToStorage(IReadOnlyList<String> updates, int expectedversion)
{
Console.WriteLine("!!! ApplyUpdatesToStorage");
// An exception thrown here will cause FAST INFINITE LOOP OF RETRIES if GetVersion and GetEvents do NOT throw exceptions.
// This would potentially suffocate the DB with incoming read requests.
// This is due to Orleans: https://github.com/dotnet/orleans/blob/main/src/Orleans.EventSourcing/CustomStorage/LogViewAdaptor.cs#L193
throw new Exception("test yourself");
// Do nothing as this is just an example.
return Task.FromResult(true);
}
Task<KeyValuePair<int, ExampleState>> ICustomStorageInterface<ExampleState, String>.ReadStateFromStorage()
{
Console.WriteLine("!!! ReadStateFromStorage");
// An exception thrown here will cause loop of retries using Exponential backoff. This is ok.
//throw new Exception("test yourself");
var state = new ExampleState { ExampleValue = "Example value" };
return Task.FromResult(new KeyValuePair<int, ExampleState>(0, state));
}
}
}
This would print to console:
!!! ReadStateFromStorage
!!! ApplyUpdatesToStorage
!!! Connection Issue
!!! ReadStateFromStorage
!!! Connection Issue Resolved
!!! ApplyUpdatesToStorage
!!! Connection Issue
!!! ReadStateFromStorage
!!! Connection Issue Resolved
!!! ApplyUpdatesToStorage
!!! Connection Issue
!!! ReadStateFromStorage
!!! Connection Issue Resolved
...
...and so on.
Sorry, a typing mistake here:
On the other hand situation is a lot worse, if an exception is thrown from ApplyUpdatesToStorage but [ApplyUpdatesToStorage] does not throw an exception. Then the result is an endless fast loop without Exponential backoff.
This should have been:
On the other hand situation is a lot worse, if an exception is thrown from ApplyUpdatesToStorage but ReadStateFromStorage does not throw an exception. Then the result is an endless fast loop without Exponential backoff.
We've moved this issue to the Backlog. This means that it is not going to be worked on for the coming release. We review items in the backlog at the end of each milestone/release and depending on the team's priority we may reconsider this issue for the following milestone.
Glad to see that this is on the backlog. I just got bit by exactly what @roikonen observed in ApplyUpdatesToStorage
's lack of exponential backoff, and that led to a more fundamental question: handling non-transient write failures in a journaled grain.
Scenario:
RaiseEvent(poisonEvent); await ConfirmEvents();
throw
or return false
because the event is unacceptable for some reasonResult: A pegged CPU and a flooded log until the silo is restarted.
There seems to be an assumption in PrimaryBasedLogViewAdaptor
that any failures are either transient or due to resolvable inconsistency:
https://github.com/dotnet/orleans/blob/7720582a44723e3c8bb9ba3c371984b4b8f537be/src/Orleans.EventSourcing/Common/PrimaryBasedLogViewAdaptor.cs#L593-L595
(Side note: would this be a good place for the framework to await LastPrimaryIssue.DelayBeforeRetry()
before looping?)
In the event of unresolved failures, @jason-bragg 's recommendation makes sense but feels at odds with the PrimaryBasedLogViewAdaptor
's logic:
The safest path, imo, is to always deactivate the grain on storage operation (read, write, or clear) failure (of any kind). https://github.com/dotnet/orleans/issues/6128#issuecomment-555642545
So, in a simple scenario such as mine (no grain replication, re-entrancy, or unconfirmed events), is a viable workaround strategy:
ApplyUpdatesToStorage
implementation uses a finite retry policy and catches any exceptionsIsInconsistent
) but reports success, becauseprivate async Task TryWriteEvent(TEvent maybePoisonEvent)
{
RaiseEvent(maybePoisonEvent);
await ConfirmEvents();
if (IsInconsistent) throw new InconsistentStateException();
}
try
{
await context.Invoke();
}
catch (InconsistentStateException)
{
// set the context.Response.Result
}
If that strategy is flawed or just too Rube Goldberg, what is the next-best option? Implement a custom LogViewAdaptor?
It does appear that even if an incoming call filter catches the InconsistentStateException
, the grain gets recycled by the runtime. DeactivateOnIdle()
seemed like a viable alternative to throwing, but since the runtime goes to the trouble of handling InconsistentStateException
s, that's probably the better option?
TL;DR - Any major flaws with letting the LogViewAdaptor temporarily think a write succeeded (so it stops retrying) and immediately burning the grain down afterwards?
When there is an exception during
ReadStateFromStorage
inLogViewAdaptor.ReadAsync
which is unrecoverable, theJournalledGrain
ends up trying to load in an endless loop.https://github.com/dotnet/orleans/blob/36faf869b555407e42b6f4db9e34db3848eab7d0/src/Orleans.EventSourcing/CustomStorage/LogViewAdaptor.cs#L119-L169
We would like to deactivate the grain (same behavior as InconsistentState exception).
I've tried deactivating the grain inside the
ICustomStorageInterface<T, object>.ReadStateFromStorage()
but that method is not run inside a grain context, but from a BatchWorker.