Closed Gabrn closed 2 years ago
Sorry about delayed response.
Are you running on .NET Core or Full Framework?
In theory, the Orleans runtime shouldn't get in the way of writing state to storage. That's almost a straight pass-through.
There was an upgrade in WindowsAzure.Storage
from 7.0 to 8.2.1. Can you try running the Orleans 1.x version (BTW, which one are you running there) with WindowsAzure.Storage
8.2.1 to eliminate the possibility that this upgrade caused something?
Are you seeing any memory leak in those few hours?
Hi @sergeybykov, i work with @Gabrn, Looks like were using WindowsAzure.Storage 8.6.0. latest version is 9.2.0. We're using orleans 1.5.2
So with WAS 8.6.0 1.5.2 runs fine but 2.0 with the same 8.6.0 on Full .NET Framework runs slower?
We're experiencing some slowdown for writeStateAsync on 1.5.2 as well. It might be the same problem but I'm not sure.
As we get higher traffic we make a lot of writeStateAsync calls, we're using azure blob storage. We're seeing write requests to the blob storage average at 150 ms but getting slower 3+ seconds every now and then. After a while we see our monitoring of writeStateAsync showing slower and slower times untill a point we're we get grain timeouts and nothing works anymore. Restarting all the cluster nodes fixes this for a short time until it happens again.
I'm suspecting these slow requests take up threads or something and the scheduler gets overloaded to a point where things break.
How our system currently works with sms streams also means when a grain is stuck writing to blob it acts as a bottleneck since we await the publish from the grain that brings in the work to the cluster.
If you can shed some light on what could be done here would be awesome.
Anyway for us to monitor this and see that this is really the problem would also be great, this is our current hunch but we'd like to verify it.
We're also working on switching to a more performant storage, but that will take a while.
@talarari, @Gabrn, unfortunately our azure blob storage provider does not have much detailed logging in it, even under higher logging levels. To get more info as to what's causing this, I think you'll need a custom storage provider.
I advocate either copying AzureBlobGrainStorage implementation or modifying it in your own branch and adding in timing logging around the lowest level azure calls. With measurements of how long the grain's write state call takes and how long the azure call takes, you'll be able to determine if the performance degradation is coming from the provider or azure.
That's a good idea. We'll try that. I see that the azure blob storage implementation uses configureAwait(false) quite a lot. According to this: https://dotnet.github.io/orleans/Documentation/Advanced-Concepts/External-Tasks-and-Grains.html That's supposed to break out of the Orleans scheduler, why is this used in the storage provider then?
Why is this (configureAwait(false)) used in the storage provider then?
It is expected to be faster, though I've not profiled it. It should' be safe to return on a non-Orleans threading context because that provider doesn't (shouldn't?) have any logic that expects or relies upon the Orleans thread protections. I've not looked closely at the provider but it should be written in a thread safe manner. If it is not, that may be an issue we should look at.
Ok that makes sense thanks
@Gabrn, @talarari, Anything new on this?
@jason-bragg , We are also facing intermittent slowness while executing WriteStateAsync(). In our case it is bit weird. I enabled Azure Storage Analytics to check if there is any latency issue (Server or E2E) but didn't find any delay because of latency. Also, I enabled TRACE logging for Orleans.Storage.AzureTableGrainStorageFactory.BlobStore namespace. Following is my analysis:
EventName="DirectWrite" Message="2018-07-08 19:42:34.656 +00:00 [Verbose] Orleans.Storage.AzureTableGrainStorageFactory.BlobStore Writing: GrainType=Platform.AwsomeGrainClass Grainid=GrainReference:*grn/91DE5597/b9d62ab7 ETag="0x8D5E50AE1FBD528" to BlobName=Platform.AwsomeGrainClass-GrainReference=42470cfd94f0477e32abd484b9d62ab703ffffff91de5597.json in Container=grainstate141"
EventName="DirectWrite" Message="2018-07-08 19:44:40.210 +00:00 [Verbose] Orleans.Storage.AzureTableGrainStorageFactory.BlobStore Written: GrainType=Platform.AwsomeGrainClass Grainid=GrainReference:*grn/91DE5597/b9d62ab7 ETag="0x8D5E50B3EB36A27" to BlobName=Platform.AwsomeGrainClass-GrainReference=42470cfd94f0477e32abd484b9d62ab703ffffff91de5597.json in Container=grainstate141"
It took almost 2m6s to write grain state. From storage analytics logs I found this (based on ETag): Write start time: 2018-07-08T19:44:40.2113752Z E2E Latency: 7ms Server Latency: 7ms From logs I see that write operation from Orleans was supposed to start at around 19:42:34 but is started 19:44:40.2113752Z. Since the E2E latency is only 7ms, I guess request was stuck at the Orleans host side. The weird thing is whenever it happened, the delay is always 2m6s. It would be awesome if you can help identify the cause of slowness. We are using Orleans 2.0 and .NET 4.6.2.
@MittalNitish hmm, not seeing this on my side. I added a benchmark test which calls a persistent grain as many times as it can in a 30 second windows. Storage calls against azure blob, on average, take between 50-60ms. This does seem high to me as I expected 10-30ms, but still not close to what you're seeing.
Please check out #4763 when you get time. It would be very helpful if you could help identify what's different about your environment.
I think I can add some more info here. We added a 3 second client timeout using a cancellation token with the blob client, and we're hitting that timeout here and there. Mostly under load. I did some digging and looked at how azcopy (azures data movement cli) works so fast compared to what we're seeing. Basically grain states are really small, so we're uploading a bunch of small files, there's a setting called Service point manager.DefaltConnecetionLimit (or something close to that) that is by default 2 and azopy sets to 8 times the number of cores.
Haven't tested that out in production but that could explain some things.
For the benchmark I would stress the whole system not just one grain, do as much work as you can and you'll start seeing things become slow.
@talarari , I checked that, it looks like PerformanceTuningOptions is setting DefaultConnectionLimit to 200 by default. I also checked if it is getting overwritten any where in application, but that is not the case as well. https://github.com/dotnet/orleans/blob/9614355b68d42521b6c385b73e55a1cad6bdaef6/src/Orleans.Core/Configuration/Options/PerformanceTuningOptions.cs
This may not be the cause, but blob storage client will checksum/md5 data before upload (cpu time) and if your using any synchronous storage methods and not their async version that can slow things down.
We are marking this issue as stale due to the lack of activity in the past six months. If there is no further activity within two weeks, this issue will be closed. You can always create a new issue based on the guidelines provided in our pinned announcement.
This issue has been marked stale for the past 30 and is being closed due to lack of activity.
Since upgrading to Orleans2 we have had general slowness in writing grain state. We tried to pin down the issue and we gauged the time it took for us to write jsons of about 1mb each to azure blob storage and the mean went up from about 100ms to 400ms in peak times. We looked in the blob storage to see if the data got any larger but it was about the same size (1mb). We suspected the upgrade to orleans2 to be the cause and when we downgraded back to orleans1 the write time got back to normal (this was the only change we made) so if the cause was something else the downgrade should not have effected this.
This is the graph of the
writeStateAsync()
time from one of our grains "expert-grain":We noticed that after restarting the nodes the write time goes down for a few hours before going up again (we used a script to restart the nodes every 4 hours for that reason).
Any suggestions to how we can investigate this further?