Open turowicz opened 10 hours ago
Hi @turowicz , are you able to provide more info? Situations like this are hard to diagnose from counters alone, but they are a good first indicator. CPU profiling traces or a memory dump from the malperforming process would be very useful. With that, we should be able to quickly identify the issue. If it's an Orleans issue, it's most likely the directory cache, which we are hoping to replace soon. I wouldn't be surprised if it were something else, though, like logging or IO.
We have noticed a large correlation between the .NET contention rate and Orleans storage write latency. We are using a custom StorageProvider that is fully async and are trying to pinpoint where are all the threads being locked. We get logs that .NET is hanging.
The situation is caused by temporarily slow storage backend, that we are in progress of upgrading. That said, I don't think we should be having a high contention rate just because some async I/O calls are taking longer than usual.
Are you able to tell if there is something inside Orleans that is causing a sync call that saturate the thread pool?
As you can see below, as soon there is any discernable latency bump, the contention explodes through the roof.
We have a high throughput system using BroadcastChannels. Grains are long lived and all state writes happen on 5 minute timers.
Latency last 2 days:
Contention last 2 days: (some of them have gaps because contention caused .NET to freeze completely.)
cc @ReubenBond