dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
9.89k stars 2.01k forks source link

CustomStorage.LogViewAdaptor could potentially cause an endless loop #5107

Open rikbosch opened 5 years ago

rikbosch commented 5 years ago

When there is an exception during ReadStateFromStorage inLogViewAdaptor.ReadAsync which is unrecoverable, the JournalledGrain ends up trying to load in an endless loop.

https://github.com/dotnet/orleans/blob/36faf869b555407e42b6f4db9e34db3848eab7d0/src/Orleans.EventSourcing/CustomStorage/LogViewAdaptor.cs#L119-L169

We would like to deactivate the grain (same behavior as InconsistentState exception).

I've tried deactivating the grain inside the ICustomStorageInterface<T, object>.ReadStateFromStorage() but that method is not run inside a grain context, but from a BatchWorker.

sergeybykov commented 5 years ago

@sebastianburckhardt What do you think?

sebastianburckhardt commented 5 years ago

PR #3773 contains some fixes to achieve a better retry logic, and also an API that lets you define your own error handling and diagnostics behavior for handling storage connection problems. Perhaps we could prioritize it.

roikonen commented 2 years ago

As @rikbosch wrote:

When there is an exception during ReadStateFromStorage in LogViewAdaptor.ReadAsync which is unrecoverable, the JournalledGrain ends up trying to load in an endless loop.

This is true but at least the endless loop seems currently to be using Exponential backoff.

On the other hand situation is a lot worse, if an exception is thrown from ApplyUpdatesToStorage but ApplyUpdatesToStorage does not throw an exception. Then the result is an endless fast loop without Exponential backoff.

One could try to mitigate the endless fast loop by adding something like:

protected override void OnConnectionIssue(ConnectionIssue issue)
{
        issue.RetryDelay = new TimeSpan(0, 0, 10); // Retry delay of 10 seconds
        base.OnConnectionIssue(issue);
}

...but that would mess up the Exponential backoff in the case of exception in ReadStateFromStorage.

So if for some reason the exception gets thrown from ApplyUpdatesToStorage but ReadStateFromStorage is working ok the system will potentially suffocate the DB with incoming read requests.

Here's a concrete example implementation of ICustomStorageInterface:

using System;
using System.Collections.Generic;
using System.Threading.Tasks;
using Orleans.EventSourcing.CustomStorage;
using Orleans.EventSourcing;
using Orleans.LogConsistency;

namespace Example
{
    [Serializable]
    public class ExampleState
    {
        public string ExampleValue { get; set; }
    }

    abstract public class ExampleStorage
        : JournaledGrain<ExampleState, String>,
        ICustomStorageInterface<ExampleState, String>
    {

        protected override void OnConnectionIssue(ConnectionIssue issue)
        {
            Console.WriteLine("!!! Connection Issue");
            //issue.RetryDelay = new TimeSpan(0, 0, 10);
            base.OnConnectionIssue(issue);
        }

        protected override void OnConnectionIssueResolved(ConnectionIssue issue)
        {
            Console.WriteLine("!!! Connection Issue Resolved");
            base.OnConnectionIssueResolved(issue);
        }

        Task<bool> ICustomStorageInterface<ExampleState, String>.ApplyUpdatesToStorage(IReadOnlyList<String> updates, int expectedversion)
        {
            Console.WriteLine("!!! ApplyUpdatesToStorage");

            // An exception thrown here will cause FAST INFINITE LOOP OF RETRIES if GetVersion and GetEvents do NOT throw exceptions.
            // This would potentially suffocate the DB with incoming read requests.
            // This is due to Orleans: https://github.com/dotnet/orleans/blob/main/src/Orleans.EventSourcing/CustomStorage/LogViewAdaptor.cs#L193
            throw new Exception("test yourself");

            // Do nothing as this is just an example.

            return Task.FromResult(true);

        }

        Task<KeyValuePair<int, ExampleState>> ICustomStorageInterface<ExampleState, String>.ReadStateFromStorage()
        {

            Console.WriteLine("!!! ReadStateFromStorage");

            // An exception thrown here will cause loop of retries using Exponential backoff. This is ok.
            //throw new Exception("test yourself");

            var state = new ExampleState { ExampleValue = "Example value" };

            return Task.FromResult(new KeyValuePair<int, ExampleState>(0, state));
        }
    }
}

This would print to console:

!!! ReadStateFromStorage
!!! ApplyUpdatesToStorage
!!! Connection Issue
!!! ReadStateFromStorage
!!! Connection Issue Resolved
!!! ApplyUpdatesToStorage
!!! Connection Issue
!!! ReadStateFromStorage
!!! Connection Issue Resolved
!!! ApplyUpdatesToStorage
!!! Connection Issue
!!! ReadStateFromStorage
!!! Connection Issue Resolved
...

...and so on.

roikonenvisma commented 2 years ago

Sorry, a typing mistake here:

On the other hand situation is a lot worse, if an exception is thrown from ApplyUpdatesToStorage but [ApplyUpdatesToStorage] does not throw an exception. Then the result is an endless fast loop without Exponential backoff.

This should have been:

On the other hand situation is a lot worse, if an exception is thrown from ApplyUpdatesToStorage but ReadStateFromStorage does not throw an exception. Then the result is an endless fast loop without Exponential backoff.

ghost commented 1 year ago

We've moved this issue to the Backlog. This means that it is not going to be worked on for the coming release. We review items in the backlog at the end of each milestone/release and depending on the team's priority we may reconsider this issue for the following milestone.

jamesbperry commented 11 months ago

Glad to see that this is on the backlog. I just got bit by exactly what @roikonen observed in ApplyUpdatesToStorage's lack of exponential backoff, and that led to a more fundamental question: handling non-transient write failures in a journaled grain.

Scenario:

  1. [in some method of a journaled grain] RaiseEvent(poisonEvent); await ConfirmEvents();
  2. [ApplyUpdatesToStorage] always throw or return false because the event is unacceptable for some reason

Result: A pegged CPU and a flooded log until the silo is restarted.

There seems to be an assumption in PrimaryBasedLogViewAdaptor that any failures are either transient or due to resolvable inconsistency: https://github.com/dotnet/orleans/blob/7720582a44723e3c8bb9ba3c371984b4b8f537be/src/Orleans.EventSourcing/Common/PrimaryBasedLogViewAdaptor.cs#L593-L595

(Side note: would this be a good place for the framework to await LastPrimaryIssue.DelayBeforeRetry() before looping?)

In the event of unresolved failures, @jason-bragg 's recommendation makes sense but feels at odds with the PrimaryBasedLogViewAdaptor's logic:

The safest path, imo, is to always deactivate the grain on storage operation (read, write, or clear) failure (of any kind). https://github.com/dotnet/orleans/issues/6128#issuecomment-555642545

So, in a simple scenario such as mine (no grain replication, re-entrancy, or unconfirmed events), is a viable workaround strategy:

If that strategy is flawed or just too Rube Goldberg, what is the next-best option? Implement a custom LogViewAdaptor?

It does appear that even if an incoming call filter catches the InconsistentStateException, the grain gets recycled by the runtime. DeactivateOnIdle() seemed like a viable alternative to throwing, but since the runtime goes to the trouble of handling InconsistentStateExceptions, that's probably the better option?

TL;DR - Any major flaws with letting the LogViewAdaptor temporarily think a write succeeded (so it stops retrying) and immediately burning the grain down afterwards?