Co-hosting Orleans and ASP.NET in the same process - performance recommendations

JorgeCandeias commented 5 years ago

As requested on Gitter, this is an open-ended thread to discuss what are the general performance benefits and drawbacks regarding co-hosting an ASP.NET Front-end (e.g. a RESTful API over Kestrel on ASP.NET Core) in the same process (and by consequence physical host) of an Orleans silo. This question takes special aim at Orleans' use of its own cooperative scheduler vs ASP.NET's use of the Thread Pool and how they would compete for host resources.

Specifically, what are the engineering team's thoughts regarding:

CPU expenditure on thread context switching.
CPU cache use and possible trashing.
Scalability benefits and drawbacks.
Performance benefits and drawbacks.
Memory contention and possible virtual memory trashing.
Benefit/drawback relationship as cluster size increases.

The idea is to have some good practice pointers for Orleans users like me to decide when/if it is a good idea to host ASP.NET and Orleans together in the same process (and thereby physical host) or to keep them separate in a client-silo relationship across separate processes (and also machines via the wire), and in what conditions.

If there is already some official documentation regarding this, then apologies for raising the issue, and "preemptive" thanks for redirecting me to it. If not, then happy to gather relevant input and PR the docs.

ReubenBond commented 5 years ago

There will be some noisy neighbor effects due to having two thread pools (one fixed and one dynamic) and having more code on one CPU (both in terms of cache eviction and context switching). I haven't measured the negative effect due to that. Of course there's also some significant boosts:

Co-hosting allows for smarter routing and there's no need for a gateway, so there's one less network hop and one less serialization + deserialization cycle
In some cases requests can be served locally, particularly in smaller clusters or with [PreferLocal]/[StatelessWorker] placement.

A round-trip grain ping from a localhost client to a localhost silo via TCP typically takes around 190us on my machine. The same call via hosted client takes under 10us. There is room for improvement in both cases. I have two branches which each have 25-35% improvements, one which applies only to the networked case (since it's the networking rewrite branch) and one which also applies to the local case (core RPC rework).

I intend to add support for a hybrid thread pool approach where we use the Orleans dedicated thread pool for system targets / high importance items. That will allow the vast majority of work to be performed on the shared thread pool while still avoiding starvation for time-sensitive work (eg, membership pings). It may end up being that we can offload all work onto the shared thread pool with no detriment (perhaps until the silo hits extremely high load). This is quite easy to implement for us on .NET Core. One of the larger reasons we continued using a separate, dedicated thread pool for Orleans 2.1+ (when the scheduler rewrite was merged) was because .NET Standard / .NET Framework didn't support some important optimizations which were supported in Core.

When you say memory contention, do you mean false sharing or memory bandwidth saturation? I don't imagine we will see too much of those issues, but there's an opportunity to analyze and improve upon the current situation regardless, since I have not specifically looked into those areas.

Regarding benefits/drawbacks as cluster size increases, there are largely only benefits. Primarily because you skip a hop in messaging. There are benefits to separating different kinds of tasks to different CPUs (eg, on separate machines), but I have not performed the analysis to determine where there's room for optimization here. I would say generally the architecture of having hosts which act both as web servers and silos is not going to be too detrimental, however if that ever did become evident (after careful analysis) then you could always deploy your "frontend silos" with zero grain classes present and still gain some of the benefits of the hosted client (direct routing to the right silo, skipping the gateway) without the extra load from hosting grains locally.

Please let me know if that answers your questions or if there's anything else you'd like a comment on.

JorgeCandeias commented 5 years ago

Many thanks for your reply. This aligns with our anecdotal experience (e.g. networking and productivity benefits far outweigh any compute drawbacks) though we're still gathering data to support it, hence my questions.

A round-trip grain ping from a localhost client to a localhost silo via TCP typically takes around 190us on my machine. The same call via hosted client takes under 10us. There is room for improvement in both cases. I have two branches which each have 25-35% improvements, one which applies only to the networked case (since it's the networking rewrite branch) and one which also applies to the local case (core RPC rework).

This is were we have seen the most differences in one of our use cases, a real-time data projection flow with a REST front-end for queries, where the resultset can vary a lot in size. Eliminating the front-end-to-silo-hop-and-back makes a significant difference in end-to-end response time.

I intend to add support for a hybrid thread pool approach where we use the Orleans dedicated thread pool for system targets / high importance items. That will allow the vast majority of work to be performed on the shared thread pool while still avoiding starvation for time-sensitive work (eg, membership pings). It may end up being that we can offload all work onto the shared thread pool with no detriment (perhaps until the silo hits extremely high load). This is quite easy to implement for us on .NET Core. One of the larger reasons we continued using a separate, dedicated thread pool for Orleans 2.1+ (when the scheduler rewrite was merged) was because .NET Standard / .NET Framework didn't support some important optimizations which were supported in Core.

👍

When you say memory contention, do you mean false sharing or memory bandwidth saturation? I don't imagine we will see too much of those issues, but there's an opportunity to analyze and improve upon the current situation regardless, since I have not specifically looked into those areas.

Apologies for the lack of precision. I meant transient memory usage spikes from the stateless side (e.g. memory-hungry controllers) inducing Orleans to deactivate and garbage collect expensive grains. This question came from my incorrect understanding that Orleans Activation GC responds to memory pressure as suggested in the original white paper. Just realized that the current docs say otherwise. Gotta keep up.

Please let me know if that answers your questions or if there's anything else you'd like a comment on.

That's it for me, many thanks for your answers. This gives us more confidence in committing to the single host process approach, which further simplifies our work-stream.

I can't seem to find a doc page for EnableDirectClient() so if there isn't one yet and no one's working on it, I can attempt to PR this content along with some samples.

ReubenBond commented 5 years ago

I believe we can close this as the question is answered

dotnet / orleans

Co-hosting Orleans and ASP.NET in the same process - performance recommendations #5421