dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.06k stars 2.03k forks source link

WriteStateAsync() works a lot slower since upgrading to Orleans2 #4612

Closed Gabrn closed 2 years ago

Gabrn commented 6 years ago

Since upgrading to Orleans2 we have had general slowness in writing grain state. We tried to pin down the issue and we gauged the time it took for us to write jsons of about 1mb each to azure blob storage and the mean went up from about 100ms to 400ms in peak times. We looked in the blob storage to see if the data got any larger but it was about the same size (1mb). We suspected the upgrade to orleans2 to be the cause and when we downgraded back to orleans1 the write time got back to normal (this was the only change we made) so if the cause was something else the downgrade should not have effected this.

This is the graph of the writeStateAsync() time from one of our grains "expert-grain": screen shot 2018-05-20 at 15 11 21

We noticed that after restarting the nodes the write time goes down for a few hours before going up again (we used a script to restart the nodes every 4 hours for that reason).

Any suggestions to how we can investigate this further?

sergeybykov commented 6 years ago

Sorry about delayed response.

Are you running on .NET Core or Full Framework?

In theory, the Orleans runtime shouldn't get in the way of writing state to storage. That's almost a straight pass-through.

There was an upgrade in WindowsAzure.Storage from 7.0 to 8.2.1. Can you try running the Orleans 1.x version (BTW, which one are you running there) with WindowsAzure.Storage 8.2.1 to eliminate the possibility that this upgrade caused something?

Are you seeing any memory leak in those few hours?

talarari commented 6 years ago

Hi @sergeybykov, i work with @Gabrn, Looks like were using WindowsAzure.Storage 8.6.0. latest version is 9.2.0. We're using orleans 1.5.2

sergeybykov commented 6 years ago

So with WAS 8.6.0 1.5.2 runs fine but 2.0 with the same 8.6.0 on Full .NET Framework runs slower?

talarari commented 6 years ago

We're experiencing some slowdown for writeStateAsync on 1.5.2 as well. It might be the same problem but I'm not sure.

As we get higher traffic we make a lot of writeStateAsync calls, we're using azure blob storage. We're seeing write requests to the blob storage average at 150 ms but getting slower 3+ seconds every now and then. After a while we see our monitoring of writeStateAsync showing slower and slower times untill a point we're we get grain timeouts and nothing works anymore. Restarting all the cluster nodes fixes this for a short time until it happens again.

I'm suspecting these slow requests take up threads or something and the scheduler gets overloaded to a point where things break.

How our system currently works with sms streams also means when a grain is stuck writing to blob it acts as a bottleneck since we await the publish from the grain that brings in the work to the cluster.

If you can shed some light on what could be done here would be awesome.

talarari commented 6 years ago

Anyway for us to monitor this and see that this is really the problem would also be great, this is our current hunch but we'd like to verify it.

We're also working on switching to a more performant storage, but that will take a while.

jason-bragg commented 6 years ago

@talarari, @Gabrn, unfortunately our azure blob storage provider does not have much detailed logging in it, even under higher logging levels. To get more info as to what's causing this, I think you'll need a custom storage provider.

I advocate either copying AzureBlobGrainStorage implementation or modifying it in your own branch and adding in timing logging around the lowest level azure calls. With measurements of how long the grain's write state call takes and how long the azure call takes, you'll be able to determine if the performance degradation is coming from the provider or azure.

talarari commented 6 years ago

That's a good idea. We'll try that. I see that the azure blob storage implementation uses configureAwait(false) quite a lot. According to this: https://dotnet.github.io/orleans/Documentation/Advanced-Concepts/External-Tasks-and-Grains.html That's supposed to break out of the Orleans scheduler, why is this used in the storage provider then?

jason-bragg commented 6 years ago

Why is this (configureAwait(false)) used in the storage provider then?

It is expected to be faster, though I've not profiled it. It should' be safe to return on a non-Orleans threading context because that provider doesn't (shouldn't?) have any logic that expects or relies upon the Orleans thread protections. I've not looked closely at the provider but it should be written in a thread safe manner. If it is not, that may be an issue we should look at.

talarari commented 6 years ago

Ok that makes sense thanks

jason-bragg commented 6 years ago

@Gabrn, @talarari, Anything new on this?

MittalNitish commented 6 years ago

@jason-bragg , We are also facing intermittent slowness while executing WriteStateAsync(). In our case it is bit weird. I enabled Azure Storage Analytics to check if there is any latency issue (Server or E2E) but didn't find any delay because of latency. Also, I enabled TRACE logging for Orleans.Storage.AzureTableGrainStorageFactory.BlobStore namespace. Following is my analysis:

EventName="DirectWrite" Message="2018-07-08 19:42:34.656 +00:00 [Verbose] Orleans.Storage.AzureTableGrainStorageFactory.BlobStore Writing: GrainType=Platform.AwsomeGrainClass Grainid=GrainReference:*grn/91DE5597/b9d62ab7 ETag="0x8D5E50AE1FBD528" to BlobName=Platform.AwsomeGrainClass-GrainReference=42470cfd94f0477e32abd484b9d62ab703ffffff91de5597.json in Container=grainstate141"

EventName="DirectWrite" Message="2018-07-08 19:44:40.210 +00:00 [Verbose] Orleans.Storage.AzureTableGrainStorageFactory.BlobStore Written: GrainType=Platform.AwsomeGrainClass Grainid=GrainReference:*grn/91DE5597/b9d62ab7 ETag="0x8D5E50B3EB36A27" to BlobName=Platform.AwsomeGrainClass-GrainReference=42470cfd94f0477e32abd484b9d62ab703ffffff91de5597.json in Container=grainstate141"

It took almost 2m6s to write grain state. From storage analytics logs I found this (based on ETag): Write start time: 2018-07-08T19:44:40.2113752Z E2E Latency: 7ms Server Latency: 7ms From logs I see that write operation from Orleans was supposed to start at around 19:42:34 but is started 19:44:40.2113752Z. Since the E2E latency is only 7ms, I guess request was stuck at the Orleans host side. The weird thing is whenever it happened, the delay is always 2m6s. It would be awesome if you can help identify the cause of slowness. We are using Orleans 2.0 and .NET 4.6.2.

jason-bragg commented 6 years ago

@MittalNitish hmm, not seeing this on my side. I added a benchmark test which calls a persistent grain as many times as it can in a 30 second windows. Storage calls against azure blob, on average, take between 50-60ms. This does seem high to me as I expected 10-30ms, but still not close to what you're seeing.

Please check out #4763 when you get time. It would be very helpful if you could help identify what's different about your environment.

talarari commented 6 years ago

I think I can add some more info here. We added a 3 second client timeout using a cancellation token with the blob client, and we're hitting that timeout here and there. Mostly under load. I did some digging and looked at how azcopy (azures data movement cli) works so fast compared to what we're seeing. Basically grain states are really small, so we're uploading a bunch of small files, there's a setting called Service point manager.DefaltConnecetionLimit (or something close to that) that is by default 2 and azopy sets to 8 times the number of cores.

Haven't tested that out in production but that could explain some things.

For the benchmark I would stress the whole system not just one grain, do as much work as you can and you'll start seeing things become slow.

MittalNitish commented 6 years ago

@talarari , I checked that, it looks like PerformanceTuningOptions is setting DefaultConnectionLimit to 200 by default. I also checked if it is getting overwritten any where in application, but that is not the case as well. https://github.com/dotnet/orleans/blob/9614355b68d42521b6c385b73e55a1cad6bdaef6/src/Orleans.Core/Configuration/Options/PerformanceTuningOptions.cs

Plasma commented 5 years ago

This may not be the cause, but blob storage client will checksum/md5 data before upload (cpu time) and if your using any synchronous storage methods and not their async version that can slow things down.

ghost commented 2 years ago

We are marking this issue as stale due to the lack of activity in the past six months. If there is no further activity within two weeks, this issue will be closed. You can always create a new issue based on the guidelines provided in our pinned announcement.

ghost commented 2 years ago

This issue has been marked stale for the past 30 and is being closed due to lack of activity.