Performance issues using Streams as backchannels to Frontend Servers

mikkoairaksinen commented 5 years ago

We are currently load testing our Orleans cluster and are experiencing some performance issues with streams. The general architecture is split into three major components: Mobile Clients, Frontend Servers and Silohosts.

Mobile clients connect to Frontend Servers with a TCP socket. Each client has one PlayerGrain in the Silohost containing the state of the player. Each player may belong to up to one Guild, which are represented by a GuildGrain. Once a Mobile Client opens a TCP connection to Frontend, a PlayerConnection object is created in the FrontEnd, which then calls SubscribeAsync on a stream (SimpleMessageStreamProvider) in the Player namespace and a stream in the Guild namespace. These streams are then used by PlayerGrains and GuildGrains to push data to the mobile clients. Once the PlayerConnection object is disposed, we call Unsubscribe with the stream handles.

In the load test, we are creating mock mobile clients that register a new player, join a guild and start sending chat messages every ~5 seconds. When the guild grain receives the message, it pushes it down to the guild stream and all subscribed PlayerConnections push it down to the mobile clients via the TCP socket.

The weird behaviour is exhibited when we change the number of players per guild between 30 and 10 players. The performance of the Silohost drops drastically (approximately a 50% drop in concurrently online player capability) if we increase the players per guild to 30. The only effect this has is that now each chat message is sent 3x as many subscribers on the FrontEnd, all other variables should be the same. The FrontEndServer's performance doesn't seem to suffer much.

Is this possibly caused by the fact that the FrontEndServer has 10 (or 30) separate subscriptions to the same Guild stream and if so, how can it be shared without building some kind of a Hub that holds the subscriptions? Also, what is worrying is if sharing the subscriptions is the answer, then at a very large scale when players in a guild are load balanced to different FrontEndServers the I assume the problem will reappear.

One other performance-related issue we've seen with streams is in scenarios where a large number of PlayerGrains are deactivated (around 600 or so), we can see very large latencies (~300ms) for the UnregisterProducer method in the Dashboard, causing massive CPU spikes in the Silohost.

Do any of these symptoms sound like we have somehow incorrectly implemented streams and/or is there any specific diagnostic data we could provide to help investigate?

jsteinich commented 5 years ago

We have observed similar issues. I'm planning to switch to the client hub approach you mentioned, but haven't done so yet.

sergeybykov commented 5 years ago

@michaelahern It does seem to me that the performance drop you are seeing is likely caused by the increase in the fan-out factor for each stream. Every stream event is delivered to every subscriber independently, and extra messaging isn't cheap performance wise. A single subscriber per stream per frontend server would limit the number of subscriptions. I guess that's what you mean by Hub.

Note also that SMS doesn't have an actual queue between producers and consumers. So they end up being tightly coupled with every producer call awaiting for all subscribers to process an event. With a persistent queue instead of SMS, you could otentially reduce the load on the silos, because they would write a single event for each stream to the queue instead of fanning out 30 messages for each event. The frontends would have to do more work to pull events from the queues, etc., plus there will be extra cost of communicating with a remote queueing system. I'm not suggesting this as an alternative, just highlighting the tradeoff.

One other performance-related issue we've seen with streams is in scenarios where a large number of PlayerGrains are deactivated (around 600 or so), we can see very large latencies (~300ms) for the UnregisterProducer method in the Dashboard, causing massive CPU spikes in the Silohost.

In this case, do those 600 PlayerGrains (or most of them) belong to the same guild? In theory, UnregisterProducer should be pretty cheap calls to the respective PubSubRendezvousGrains. If most of those calls were to go to a single PubSubRendezvousGrain, that could explain the latency.

The other reason for latency I can think of is if the PubSubRendezvousGrains were deactivated and hence needed to get activated to handle the UnregisterProducer calls. Do you change the default activation collection time for all grain types by chance?

It would be interesting to look at the silo logs if you can share them.

mikkoairaksinen commented 5 years ago

@sergeybykov Thank you for the suggestions. The "single subscription per frontend per stream" model is something we could experiment with, seems like at least at not-massive scales that should be enough to alleviate the pressure caused by the fanout.

The other reason for latency I can think of is if the PubSubRendezvousGrains were deactivated and hence needed to get activated to handle the UnregisterProducer calls. Do you change the default activation collection time for all grain types by chance?

We have actually changed the collection time for grains down to 1 minute, since the for player grains an inactivity period of 1 minute almost certainly means the client has disconnected. I had not considered the effeects this has on system grains such as PubSubRendezvous. Our load tests have also shown that CPU is the first bottleneck and memory pressure is relatively low, so we could probably increase this. I will make configuration changes to increase the global collection timeout to something much higher and lower it specifically for PlayerGrains to something around 5 or 10 minutes and see if this helps.

It would be interesting to look at the silo logs if you can share them.

Is there a specific class / namespace that could provide useful information? I could look into enabling logs for either all Orleans internal logging or more targeted parts.

sergeybykov commented 5 years ago

We have actually changed the collection time for grains down to 1 minute, since the for player grains an inactivity period of 1 minute almost certainly means the client has disconnected. I had not considered the effeects this has on system grains such as PubSubRendezvous. Our load tests have also shown that CPU is the first bottleneck and memory pressure is relatively low, so we could probably increase this. I will make configuration changes to increase the global collection timeout to something much higher and lower it specifically for PlayerGrains to something around 5 or 10 minutes and see if this helps.

Okay. So maybe indeed the need to reactivate PubSubRendezvousGrains to unsubscribe is the cause of latency here. It would be interesting to confirm that.

Is there a specific class / namespace that could provide useful information? I could look into enabling logs for either all Orleans internal logging or more targeted parts.

We usually look at full Info level silo logs. That provides better context than just individual Error/Warning messages. But if changing collection time for PubSubRendezvousGrain solves the latency problem, then there is no need for logs.

sergeybykov commented 5 years ago

@mikkoairaksinen Were you able to solve this?

mikkoairaksinen commented 5 years ago

@sergeybykov We increased the default collection age to 30 minutes and player grains to 5 minutes. This did at least solve the massive latencies experienced during mass deactivations of playergrains. We did not yet implement a stream hub or another streamprovider (such as SQS), but those are on our shortlist of things to try if we keep running into issues.

We are still seeing occasional latency spikes in PubSubRendezvousGrain methods (image attached), but it's hard to be sure if this is due to the load test runner generating an unnatural spiky load or something wrong in the framework. We are approaching launch and will soon have data with real world load, I will get back to this once we see that in action.

shlomiw commented 5 years ago

We experience the same occasional spikes in these methods. We're using SMS. As far as I understand it might happen when the PubSubRendezvousGrain tries to communicate with one of the producers/consumers and it can't reach them, up to a timeout. During this time PubSubRendezvousGrain wont receive other messages. So, for example, if you're a new producer/consumer, you'd be stuck until PubSubRendezvousGrain is free again. I'm not 100% sure about this.

jason-bragg commented 5 years ago

A rather low-effort test you could try is to use MemoryStreams rather than SMS. These streams are (like sms) not reliable, but they do decouple producers and consumers as they use the Persistent streams feature set but write to an in-memory queue rather than a persistent queue.

sergeybykov commented 5 years ago

@mikkoairaksinen Any update?

sergeybykov commented 5 years ago

Closing due to inactivity for now. Feel free to reopen.

dotnet / orleans

Performance issues using Streams as backchannels to Frontend Servers #5625