dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.06k stars 2.03k forks source link

InconsistentStateException with ADO.NET SQL Server grain storage #4697

Closed martinothamar closed 2 years ago

martinothamar commented 6 years ago

We've been getting this exception for a while in various grains. Particularly one which is activated in a StartupTask that should never be deactivated. If I've understood correctly this can happen when there are duplicate activations of a grain (this happens perhaps during deployments) - so question is what do you do when that happens? Do you have to catch it and DeactivateOnIdle? Do you call ReadStateAsync and lose that data? We had a grain that during the lifetime of the activation never again could write to state so I'm wondering how you recover from it. I couldn't find info on this in docs, maybe I missed it

veikkoeeva commented 6 years ago

@martinothamar Just to be clear, was the situation such that it started upon first failure and then kept repeating or you would like to avoid this situation in this particular grain that should be never deactivated? I'm afraid the grains are automatically deactivated upon InconsistentStateException, see at https://github.com/dotnet/orleans/issues/1609.

Just in case, can you share the database vendor? This means there was neither an UPDATE or INSERT, which means, if there's state already that the WHERE condition fails for a reason or another. Version (ETag) violation is likely the most common reason.

Let's page @sergeybykov too. I haven't in fact had problems with this and hence haven't paid attention. I should. :)

martinothamar commented 6 years ago

Yes, it kept repeating. Thing is that I already had a try catch around the block in which I call WriteStateAsync, so the activation is never disabled. I just log the error and move on (rest of the call chain moves on). Wondering what the best practice is here

Database vendor is MSSQL

veikkoeeva commented 6 years ago

@martinothamar Did you try to re-read the state to have the version updated?

martinothamar commented 6 years ago

No, right now it just logs and forget that error - wasn't expecting to run into this as often as I do. Is there any way to find the root cause of how I get into that state? If there's a duplicate activation how can I find out and handle it? What other causes are there?

Xanatus commented 6 years ago

Recently I had to deal with this issue as well. For me the trouble came from interleaving code in stateful grains. In particular orleans timers can be a source of this problem.

As soon as 2 calls end up on WriteStateAsync() at the same time, you have a good chance for this error to happen.

In my opinion ETags should prevent different activations from writing to the same state but the current design doesn't even allow the same grain to write multiple times.

martinothamar commented 6 years ago

Right, one of the grains I've experienced this the most with is in fact triggered by a timer quite often.

Another grain I'm now experiencing this with subscribes to multiple streams (5+) and will write to state upon every message from every stream (maybe this is wrong).

Wouldn't expect interleaving in either case though?

martinothamar commented 6 years ago

Stacktrace:

Exc level 0: Orleans.Storage.InconsistentStateException: Version conflict (WriteState): ServiceId=68e89d8e-60d5-43ba-992e-4759d46921345 ProviderName=SqlServerState GrainType=Grains.SomeGrain GrainId=980726 ETag=46.
Xanatus commented 6 years ago

Timers execute interleaved. Not sure about streams.

I also wanna add that this issue has been discussed before. #2565 A possible workaround via self-invocation for timers was posted here: #2574

martinothamar commented 6 years ago

Interesting, that definitely solves one of my issues, thanks. I would think stream messages are queued like normal, if not I have som refactoring to do

sergeybykov commented 6 years ago

I would think stream messages are queued like normal

They are. Timers are the only interleaving bug that has become a feature that some people now depend on.

martinothamar commented 6 years ago

Then I'm not sure what could go wrong. What could cause this exception for a grain functioning roughly like the one below? Is it always due to double activations?

[StorageProvider(ProviderName = "SqlServerState")]
[ImplicitStreamSubscription("Namespace")]
[ImplicitStreamSubscription("Namespace2")]
[ImplicitStreamSubscription("Namespace3")]
[ImplicitStreamSubscription("Namespace4")]
[ImplicitStreamSubscription("Namespace5")]
public class SomeGrain : Grain<SomeGrainState>, ISomeGrain
{
    public async Task OnActivateAsync()
    {
        var sp = GetStreamProvider("SMSProvider");
        var key = GetPrimaryKey();
        await Task.WhenAll(
            sp.GetStream<Thing>(key, "Namespace").SubscribeAsync((msg, _) => DoStuff()),
            sp.GetStream<Thing2>(key, "Namespace2").SubscribeAsync((msg, _) => DoStuff()),
            sp.GetStream<Thing3>(key, "Namespace3").SubscribeAsync((msg, _) => DoStuff()),
            sp.GetStream<Thing4>(key, "Namespace4").SubscribeAsync((msg, _) => DoStuff()),
            sp.GetStream<Thing5>(key, "Namespace5").SubscribeAsync((msg, _) => DoStuff())
        );
    }
    public async Task DoStuff()
    {
        try {
            ....
            await Task.WhenAll(someOtherStream.OnNextAsync(...), WriteStateAsync()); // throws InconsistentStateException
        } catch (Exception ex) {
            // Log exception
        }
    }
}

Hmm looking at it now I'm wondering since we pass delegates which call DoStuff in that way, do I get interleaving here after all? There's a bit of work involving awaiting db calls and other grain calls and such in the .... in the try catch block

sergeybykov commented 6 years ago

It does look like you'll get multiple WriteStateAsync() calls interleaving because they are at the tails of fan-out calls that get joined via Task.WhenAll().

martinothamar commented 6 years ago

What, does the order of the calls in Task.WhenAll matter? (SomeGrain does not subscribe to someOtherStream if that's what you mean) How do I make DoStuff not interleave?

sergeybykov commented 6 years ago

Sorry, I think I might have misread the code. If the grain in non-reentrant (default), then an InconsistentStateException could only be thrown if there's more than one WriteStateAsync() call in this try block that can interleave.

try {
            ....
            await Task.WhenAll(someOtherStream.OnNextAsync(...), WriteStateAsync()); // throws InconsistentStateException
}

Unless, of course, I'm missing something here.

martinothamar commented 6 years ago

That's the thing, WriteStateAsync is only called right there, and the grain is non-reentrant

sergeybykov commented 6 years ago

Hmm... that's puzzling. I wonder if @jason-bragg or @xiazen have some thoughts.

To answer your origianl question (one of them):

so question is what do you do when that happens? Do you have to catch it and DeactivateOnIdle? Do you call ReadStateAsync and lose that data?

If you let InconsistentStateException escape, the grain should get automatically deactivated. That's the simplest solution, essentially equivalent to calling DeactivateOnIdle.

Alternatively, you can choose to refresh state via ReadStateAsync. Just need to make sure there aren't any private variable that would conflict with that state.

jason-bragg commented 6 years ago

If you're catching the InconsistentStateException and not rethrowing it, the grain will not deactivate so it will stay in an inconsistent state forever. You either need to reload the state or kill the grain on InconsistentStateException. Otherwise, the pattern looks fine, as far as I can tell.

What stream provider are you using? It shouldn't affect the WriteStateAsync behavior but different stream providers support different recoverability patterns.

martinothamar commented 6 years ago

The SMSProvider currently, what is the reccommended persistent stream these days?

martinothamar commented 6 years ago

Thats kind of what I though in terms of deactivate or read state, but then I still dont know why it happens in the first place and the state in that grain is pretty important. It’s supposed to hold and store some last known numerical values, history isnt important. I’d be tempted to just override the ETag stuff but that’s probably not possible in the provider as is

jason-bragg commented 6 years ago

What is the reccommended persistent stream these days?

Event hub stream provider is the most widely used. If recoverablity is desired, but data loss is acceptable during silo crashes, the memory stream provider could be considered as well.

I still dont know why it happens in the first place

If it happens before the silo joins the cluster it's more likely that duplicate activations will occur. When is the bootstrap logic called in the lifecycle?

Some services have used the subscription manager to setup subscriptions without activating the grain, allowing activity on the stream to activate the instance. Using implicit subscriptions would also allow such behavior.

Taking a second look at the activation logic, subscriptions are created on every activation, which could lead to duplicate subscriptions. I'd advocate checking for existing subscriptions prior to creating new subscriptions. If existings subscriptions exist, resuming processing the stream from the existing subscriptions is recommended. Or minimally removing old subscriptions.

What storage provider are you useing for your pubsub state? Are streams being generated or consumed from orleans clients, or is this all silo-to-silo streaming?

veikkoeeva commented 6 years ago

@martinothamar I would be curious to know the reason too. Can you estimate the update frequency?

Regargless of the reason, it seems that the grain will be stuck once it gets a wrong state number. That can happen in some rare situations. Would it be more robust to deacticate or reload in problem situation in any case?

martinothamar commented 6 years ago

@jason-bragg

Event hub stream provider is the most widely used. If recoverablity is desired, but data loss is acceptable during silo crashes, the memory stream provider could be considered as well.

Thanks! Are there any examples on how to configure the Memory stream provider? I tried looking once but didn't find it.

If it happens before the silo joins the cluster it's more likely that duplicate activations will occur.

I though of this too but this happened when the cluster has been up already for hours, I'm pretty sure the activation was relatively new (minutes).

I'll refactor in resuming of existing streams.

Using memory grain storage for the stream provider with

sOpts.FireAndForgetDelivery = true;
sOpts.OptimizeForImmutableData = true;

It's all silo-to-silo communication.

@veikkoeeva

This grains has to do with sports events (think football/soccer), so the grain is called through streams every time something happens in a sport event. Update frequency can therefore be atleast less than every second. Every 5 seconds for example.

I'm not sure yet if deactivate or reload would be best yet. Maybe - read, set the values and then write again. I know this can be a little insecure

jason-bragg commented 6 years ago

Are there any examples on how to configure the Memory stream provider?

No examples. :/ Only test code. https://github.com/dotnet/orleans/blob/master/test/Tester/StreamingTests/MemoryStreamProviderClientTests.cs

The memory streams were developed for dev/test use, allowing developers of persistent streams to run and test their logic locally without each developer having to have their own configured persistent queue (which can get cumbersome and expensive). I am unaware of any team using them in production, but there is no reason they can't be used in production, assuming one understands that all the events are stored in memory so data loss can occur during silo failure.

Recoverability For some insights into recovering from failures using recoverable streams you may want to check out our stream recoverability test harness and some of the error faults generated in the grain. All of the cases we cover in the tests are from cases found by teams running production services where recovery was needed (using eventhub)

Test Harness: https://github.com/dotnet/orleans/blob/master/test/Tester/StreamingTests/ImplicitSubscritionRecoverableStreamTestRunner.cs

EventHub tests using stream recoverablity test harness: https://github.com/dotnet/orleans/blob/master/test/Extensions/ServiceBus.Tests/Streaming/EHImplicitSubscriptionStreamRecoveryTests.cs

Grains simulating recovery error cases: https://github.com/dotnet/orleans/blob/master/test/Grains/TestGrains/ImplicitSubscription_NonTransientError_RecoverableStream_CollectorGrain.cs https://github.com/dotnet/orleans/blob/master/test/Grains/TestGrains/ImplicitSubscription_RecoverableStream_CollectorGrain.cs https://github.com/dotnet/orleans/blob/master/test/Grains/TestGrains/ImplicitSubscription_TransientError_RecoverableStream_CollectorGrain.cs

The above grains and tests use implicit subscriptions, so they may not always check for subscription prior to subscribing (they should, but may not), but you'll need to if you're using explicit subscriptions. The error cases may help you think through some possible error scenarios prior to going into production, especially those related to deduping events.

For recoverable streams, events are delivered from (inclusive) the sequence token provided during subscribe (or resume). Any error returned by the OnNext will trigger the redelivery of the same event until the opperation succeeds or the retry logic gives up, in which case OnError will receive an error telling the grain that the event could not be delivered, in which case the event will be skipped, and the stream will move to the next event, unless the grain takes some action (like subscribing to an earlyer point, unsubscribes, .. ?). This allows a grain to accumulate results for some period, persisting the accumulated state and the sequence token periotically (rather than on each call). This is possible because any error which causes a deactivation of the grain will rewind the stream to the previous token, allowing the reprocessing of the events since the last checkpoint.

jason-bragg commented 6 years ago

@martinothamar, akk.. some times my brain just doesn't work.. you're using implicit subscriptions.. For some reason I got it in my head that you were using explicit subscriptions. Please ignore my comments regarding resuming from existing subscriptions and the subscription manager. Implicit subscriptions always have only a single subscription, so you're code is fine. I appologize for any confusion. :/

eramosr16 commented 3 years ago

Any updates here?, I'm having this error blowing in my face constantly, at the point that I might switch await from Orleans to another framework. I had a grain with a reminder that trigger every minute this grain pull a stock price save it in a state and raise an event, I tried Reentrant, AlwaysInterleave, everything, Reading the state, not reading it at all, very frustrating.

amccool commented 3 years ago

we use SMS (heavily) and ado.net storage on sql server. at one point we were using "fire and forget", and under heavier load we would sometimes get those inconsistent state exceptions. We switched to fire and forget=FALSE, and it reduced the exceptions to only under ludicrous load (10k+ events simultaneously to the same grains). From what I understand about SMS it was really about not fully awaiting a grain call, which in orleans (IMHO) will lead to state problems, hence the inconsistent state exceptions.

eramosr16 commented 3 years ago

My temporary solution was moving to Stateless grains and implement a separate Storage for States. I found in the process that the column (on the DB) where the state is stored had different values when the grain is stateless and when not, it seems to me that probably the issue is that adding a State to the grain corrupt the data on that field, I mean probably missing props or serialization not working properly.
Perhaps leaving a column only to store the State of the grain and other for the underlying information the grain needs would do the trick.

Now the errors went away and everything works like charm.

veikkoeeva commented 3 years ago

Is there something to solve here? This is quite an old thread, but I think concerns should be recorded and hopefully written out to documents as why and how things work like they do. Maybe we can discover also improvements on how the storage works.

Likely the issue here is that that storage is called with etag/version number that does not match the one in the storage, in which case the version number check fails. Which means some other instance has updated it. As the storage is indexed by {grain type, grain identifier, service identifier} it probably means there are two or more of those, which should happen fairly rarely with ordinary grains. If the case here is stateless grains, Orleans can create many of them and most likely some of them end up holding stale versions (= old version number) and fail to update the DB as a consequence: https://dotnet.github.io/orleans/docs/grains/stateless_worker_grains.html.

eramosr16 commented 3 years ago

Well, I appreciate your explanation of what might be the issue, my intention was to call the attention about this particular issue that causes the grain to go in a corrupted state causing problems activating or calling grain after until the cluster reset the status, to me it seems a bug and a big issue specially in critical applications when you have to make sure your code executes, like in this particular case which is checking the price of Stock Market. I believe that the storage Provider implementation could be improved and the example is that I went to use my own storage provider to handle the state and make all grains stateless and the server started working ok. I appreciate all your efforts and please keep the good work. Have a great day.

Sent from my Email account ;-).

On Feb 9, 2021, at 9:55 AM, Veikko Eeva notifications@github.com wrote:

 Is there something to solve here?

Likely the issue here is that that storage is called with etag/version number that does not match the one in the storage, in which case the version number check fails. Which means some some other instance has updated it. As the storage is indexed by {grain type, grain identifier, service identifier} it probably means there are two or more of those, which should happen fairly rarely with ordinary grains. If the case here is stateless grains, Orleans can create many of them and most likely some of them end up holding state versions and fail to update the DB as a consequence: https://dotnet.github.io/orleans/docs/grains/stateless_worker_grains.html.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

veikkoeeva commented 3 years ago

@eramosr16 When you have time, do share thoughts and ideas how things could be improved. It's of course possible there is something that could work also in a general case.

That being noted, I do not know what is "corrupted status" here. Checking version while writing is done to prevent data corruption. Clearing state or removing the row should fix it (but clearing too checks state version, maybe shouldn't?).

amccool commented 3 years ago

@veikkoeeva IMHO the InconsistentStateException is only happening when grain awaits are not handled properly in application code, and under thread starvation situations. (Yes you will likely run out of SQL pool connections tooooo :) )

veikkoeeva commented 3 years ago

@amccool That is good to know and record also here. :)

Running out of connections is a good thing to note. I link https://github.com/dotnet/orleans/issues/6944 and from there https://github.com/dotnet/orleans/issues/6944 and specifically https://github.com/dotnet/orleans/issues/6944#issuecomment-776005911 that may help to ensure there's some available for important cases.

eramosr16 commented 3 years ago

My apologies because English is not my native language, corrupt state for me is that the grain keep raising an exception due to the invalid state, those solutions you suggest are ok, but the goal is to have something a little more robust that don’t fall in that kind of states for example this issue caused that I missed several price checks until the grain restarted.

I was not doing something fancy or hardcore programming, just a grain with reminder that pull prices every minute and store those prices and send an event afterward.

Anyway if you think that there is nothing we can do here feel free of ignore my comments.

Have a great day!!.

Sent from my Email account ;-).

On Feb 9, 2021, at 10:49 AM, Veikko Eeva notifications@github.com wrote:

 @eramosr16 When you have time, do share thoughts and ideas how things could be improved. It's of course possible there is something that could work also in a general case.

That being noted, I do not know what is "corrupted status" here. Checking version while writing is done to prevent data corruption. Clearing state or removing the row should fix it (but clearing too checks state version, maybe shouldn't?).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

veikkoeeva commented 3 years ago

@eramosr16 No problem, mine either. It's that I try to see if there's something we could improve.

Maybe one strategy you could try is combining timers and reminders to decrease load on the database, like at https://dotnet.github.io/orleans/docs/grains/timers_and_reminders.html#combining-timers-and-reminders. Further, indexes to the values that are being read, e.g. https://github.com/dotnet/orleans/blob/master/src/AdoNet/Orleans.Reminders.AdoNet/SQLServer-Reminders.sql#L103. These actions should reduce load on the database and make queries faster. Avoid operations that introduce etag mismatches (e.g. don't use persistence with stateless grains).

For the record, I think the DB should be indexed by default. I should make investigate this finally if no one else will. At least as a temporary measure I think any index will help, one can think later then refactoring those tables and introducing purpose built indexes.

eramosr16 commented 3 years ago

Thanks for your suggestions, I’ll keep monitoring the Silo closely so far the trick was done using my own store to keep state and moving all grains to stateless, I’ll keep it that way in the meantime.

Thanks again for you time and help. Have an excellent day.

Sent from my Email account ;-).

On Feb 9, 2021, at 5:56 PM, Veikko Eeva notifications@github.com wrote:

 @eramosr16 No problem, mine either. It's that I try to see if there's something we could improve.

Maybe one strategy you could try is combining timers and reminders to decrease load on the database, like at https://dotnet.github.io/orleans/docs/grains/timers_and_reminders.html#combining-timers-and-reminders. Further, indexes to the values that are being read, e.g. https://github.com/dotnet/orleans/blob/master/src/AdoNet/Orleans.Reminders.AdoNet/SQLServer-Reminders.sql#L103. These actions should reduce load on the database and make queries faster. Avoid operations that introduce etag mismatches (e.g. don't use persistence with stateless grains).

For the record, I think the DB should be indexed by default. I should make investigate this finally if no one else will. At least as a temporary measure I think any index will help, one can think later then refactoring those tables and introducing purpose built indexes.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

ghost commented 2 years ago

We are marking this issue as stale due to the lack of activity in the past six months. If there is no further activity within two weeks, this issue will be closed. You can always create a new issue based on the guidelines provided in our pinned announcement.

ghost commented 2 years ago

This issue has been marked stale for the past 30 and is being closed due to lack of activity.