sergeybykov commented 8 years ago

This question has been brought up more than once - should we 'outsource' the low level networking/socket layer of the Orleans stack to a 3rd party library instead of maintaining the code that deals with socket connections, sending/receiving messages, and buffer pool management within the Orleans codebase.

Potential benefits:

Reduction of the codebase and its maintenance cost
'Free' support for features like TLS
Pluggable networking layer that would allow to swap implementations in the future easily
Ability to easily add additional network protocols, such as HTTP, AMQP, MQTT

Concerns:

Loss of control and flexibility in detecting and handling network issues
Loss of low level performance optimizations
Serialization needs buffer pools for efficiency even if we outsource buffer management for communication

Another somewhat related question that's been discussed is about opening up the Orleans messaging protocol, so that requests to grains could be sent without the need for OrleansClient and from different software stacks and languages.

The first step is to have a virtual meeting to go over the existing implementation of the networking/messaging layer to get interested people more familiar with the status quo and to answer any questions about it.

This issue is intended for accumulating initial questions and ideas in preparation for the meeting. Please comment to add or correct.

wanderq commented 8 years ago

The current network layer is battle tested, and I think, it haven't changed for the last year. So instead of throwing it away - let's just keep it among the others libs to come, so users could decide which lib suits their needs. And it can be thrown away at any time, if it's supporting cost will become too much. Also such choice will remove all concerns from the list.

wanderq commented 8 years ago

About opening up the Orleans messaging protocol - something worth looking at is https://github.com/apache/thrift. Maybe we can provide both protocols at the same time - current, heavily optimized, Orleans's internal, and thrift for external apps.

veikkoeeva commented 8 years ago

Cross-links: https://github.com/dotnet/orleans/issues/307 and https://github.com/dotnet/orleans/issues/828 for futher context for the discussion.

gabikliot commented 8 years ago

Of course I totally agree with @wanderq - we would want to refactor the existing messaging protocol implementation to fit into the new "pluggable" layer. We will probably also keep it as the default one at least for some time, until any of the other ones proves better.

Sounds like there won't be any objections to the above. It is obvious that this is a good idea and will benefit everyone. The only downsides I see are: a) the time spend by someone who is going to do the work. b) the time spend by the core team (or someone else) reviewing the changes, commenting on the design, testing, ... That of course comes at the price of lost opportunity to do something else, like geo-distribution, better deployment, version tolerance, event sourcing, CoreCLR, improved streaming APIs, physical grains, ... all we discussed in the last meetup. But we don't need to discuss it now. Its up to the people involved to decide on their priorities.

What else will we be discussing in the virtual meetup?

jthelin commented 8 years ago

+1 to what @wanderq and @gabikliot said.

I also would hate to loose the current "internal" messaging functionality which has proved itself to work well over several years worth of production usage.

But some refactoring to a "pluggable" transport layer would be useful to allow greater extensibility for specific environments.

Not least of those extensibility scenarios would be to allow someone to implement layered "secure transport" functionality -- say tunneling existing TCP socket messaging over TLS mutual auth network connection [for example #828].

So overriding of either/both "connection establishment" and/or "message transmission" parts of the messaging layer are likely useful extensibility points for the long term.

One interesting example of true "messaging extensibility" would be to allow plug-in of native code / non-C# messaging implementations, such as MSRA's rDSN. https://github.com/Microsoft/rDSN I haven't used it myself, but others in Bldg 99 swear by it [and not at-it!]

One additional thing to bear in mind is there may be different extensibility options for (1) "gateway" (external facing) and (2) "silo-to-silo" (internal) communications. Allowing external callers to talk to grains in an Orleans cluster using HTTP/JSON endpoint(s) [for example] is probably the canonical example of the "gateway" option 1 scenario. This might also require exposing additional "grain type" metadata, and not just "silo address" metadata though.

+1 to further discussions.

sergeybykov commented 8 years ago

We'll have a meetup to discuss the current implementation Thursday February 11th,

12:00-13:00 PST / 20:00-21:00 GMT

Join Skype Meeting

This is an online meeting for Skype for Business, the professional meetings and communications app formerly known as Lync. Join by phone +14257063500 (USA - Redmond Campus) English (United States) +18883203585 (USA - Redmond Campus) English (United States)
Find a local number

Conference ID: 215813217

centur commented 8 years ago

Side note: for those who doesn't have SkypeForBiz installed yet - plugin is required to join and the download size is ~8Mb ( actually it's a full "Skype for Business" but it allows you to join anonymously)

gabikliot commented 8 years ago

Sorry, but I can;t make it at this time. I can either before 13:00, or after 14:00, or Friday at 13:00.

sergeybykov commented 8 years ago

@gabikliot We definitely need you there I think.

How about we move by an hour to 12:00-13:00?

gabikliot commented 8 years ago

Yes, that would work well for me.

sergeybykov commented 8 years ago

Ok, moving to one hour earlier then. I'll update the comment above to minimize the potential confusion.

gabikliot commented 8 years ago

And I can just call in via regular phone, right? I don't have to install Skype for business?

sergeybykov commented 8 years ago

And I can just call in via regular phone, right? I don't have to install Skype for business?

Yes, you can simply dial the phone number and punch in the conference ID. Obviously, you'll miss all the visuals.

gabikliot commented 8 years ago

So why not use hangouts, like we usually do?

sergeybykov commented 8 years ago

This is not a normal meetup. We are doing it out of band for people that are interested in this specific topic.

We want to try Lync this time instead of Hangouts because of the limits it imposes and issues we've been having with hosting and joining them: two different links, inability to stop/resume, no phone access, camera issues, etc.

veikkoeeva commented 8 years ago

I can't likely make to it, but the following are a few of my personal observations compiled here haphazardly, in case they provide additional ideas.

Performance, latency and stability

Do we have figures how much resources socket code (also kernel resources) makes use of in production loads on current implementations?

Any really fast and stable implementation:

Need to provide a (pinned) buffer to the socket API.
Need to account for NUMA.
As a corollary to the previous two, be mindful of the threads operating on data due to caching.

The tuning instructions vendor have look like the Mellanox one Performance Tuning Guidelines for Mellanox Network Adapters or in more general terms Windows Server 2012 Networking Performance and Management. One nice reading for me has been Windows 8 Registered I/O Networking Extensions and its blog posts and measurements:

The first is for high performance, jitter free, low latency connections where you have a small number of connections to deal with an want the best performance possible.

The Technet page about RIO mentions

In addition, RIO API extensions provide high IOPS when you deploy many Hyper-V virtual machines (VMs) on a single physical computer.

This is something I've found difficult to verify, but if true (could someone ask from sinde MS?), it looks like RIO can take a better advantange of the VT-x (I/O) virtualization extensions to provide a more stable and performant environment on Hyper-V. It looks to me any future, high-performance socket stack should make use of RIO and the abstraction in Orleans should be planned so as to make operating efficiently with it. Maybe even so as to lean towards a 3rd party implementation that is the likeliest to implement it.

Testing and tuning

It would be great if Microsoft could provide the testing ground for setups that make use of NIC teaming when and if someone with bandwidth comes along that can provide the scripts to configure actual virtual machines.

Security

Likely a given in this world. A support for TLS is needed. This likely involves also an idea how to provide -- and store -- certificates and hence how to abstract that.

Oh, I haven't programmed sockets for a long time. I haven't used RIO, so take with a large pinch of salt. :)

gabikliot commented 8 years ago

Summary of the meetup:

@jason-bragg presented detailed walk through Orleans messaging stack. We explained the current layering: a) the difference between the Networking (IPs, sockets, byte[]) and the Messaging (RPC stack with threads and queues and Messages) layers b) 6 code places: silo to silo sender/receiver, silo to client sender/receiver, client so silo sender/receiver

Suggested plan: 1) cleanup the code, unify, layer more appropriately

[ ] cleanup redundancy in Message class
[ ] remove old batching code
[ ] unify 6 places to ideally 2, with extra wrapping layers is needed, but zero code dup

2) separate client gateway selection out of client Networking and Messaging. This belongs to "Client Routing". 3) find the right cutting across abstraction - either at the networking or messaging layers. Priority wise focusing on client to silo networking. 4) refactor current code to this abstraction 5) add new implementation to allow secure client to silo connection.

galvesribeiro commented 8 years ago

Item 3 say we should focus on client->silo. I dont think so... We cant't have (mutual-)TLS silo-to-silo with current implementation. (Something @reubenbond questioned me a while ago)

RIO is not xplat and has nothing to do with TLS actualy. It is a very low level I/O layer tha some xplat socket library should use if there is the case.

I agree with Gabi's strategy except that we should replace the current regular .Net socket library with something xplat and more specific for the job. This must be made in a configurable way so we can enable/disable TLS for example and other features as suggested by @jdom.

I think we shoukd create a new issue with checkboxes and tasks + a milestone so we can work on that just like we did with .Net Core migration.

veikkoeeva commented 8 years ago

I went through the recording. Maybe in order to clarify here about RIO is that I used it as a vehicle to expose and API that is different than one's regular socket and it takes into account concepts that I believe are important on stability and performance. In that way it looks like a good API surface to study when thinking what higher level abstractions to take. In other words, that part was not about TLS or even cross-platform more than to point out that it looks plausible that whatever cross-platform library implements it on Windows stands to gain a lot. The high-performane part will be something else on Linux, FreeBSD or Mac.

Unfortunately I'm not actively following this space to state well-defined observation, but it looks to me user-space networking might be the nex evolution in "sockets" programming. Other resoures to extract commonalities might be Intel: High Performance Packet Processing and about user-space networking, netmap, for instance.

<edit: Further interesting resources on user-space networking Cloudflare: Single RX queue kernel bypass in Netmap for high packet rate networking.

<edit 2: Some Linux specific resources Monitoring and Tuning the Linux Networking Stack: Receiving Data.

ReubenBond commented 8 years ago

For posterity, here's the recording.

sergeybykov commented 8 years ago

We started a discussion with @nayato about potentially replacing the Orleans messaging stack with DotNetty. So far looks very promising. @jason-bragg is going to try to do that in the next couple of weeks.

grahamehorner commented 8 years ago

While I understand why your looking at DotNetty I'd prefer it if the team looked at libuv at the work already done by the ASP.NET core team with regards Kestrel and performance. It also makes sense to use/extend the libuv wrapper/library created to allow .net core to send/recive udp datagrams in a platform agnostic maner, rather than push network communication onto another different 3rd party library

jason-bragg commented 8 years ago

Still ramping up on DotNetty. The examples are clear but not very advanced. Anyone know of any open source projects using DotNetty (or Netty if necessary) that I can check out to see more advanced usages?

luomai commented 8 years ago

The Akka .Net project is using DotNetty as a networking layer.

veikkoeeva commented 8 years ago

@luomai It looks like Akka is moving to Aeron (via @dVakulen), so perhaps Akka.NET will move to its .NET equivalent if/when it is done.

jason-bragg commented 8 years ago

@luomai, I don't see the comment here, but I thought I read something you wrote about networking with high performance streams. We've done a fair amount of work on this, but something you may want to be mindful of is that, at it's heart, Orleans is an RPC based system. The networking models for such a system will -never- be as fast as direct dedicated high throughput architectures.

Part of the streaming work we've done has considered using dedicated channels between silos for stream data. We've abstained from this approach for a number of reasons, but if the highest throughput is critical to the success of your project, you may want to implement a stream provider that uses dedicated network connections for streams and dispatch the events as immutable data in grain calls. This would avoid all unnecessary data copies and avoid messaging overhead of the RPC model.

luomai commented 8 years ago

@jason-bragg Thank you very much for the detailed reply. I deleted my comments as I felt that the performance numbers that I used is not really an apple-to-apple comparison yet. So it might be unfair to the current networking layer design. :) Also, thanks a lot for the suggestion of using a throughput-optimized stream provider. It sounds like a pretty reasonable option for us. I will definitely look into it.

luomai commented 8 years ago

@jason-bragg Following your suggestion, we have been playing with stream providers and move from the RPC based communication (only the throughput-intensive part) to a CPS-like model based on Orleans streams. This moment we are using the default SMS stream provider with a FireAndForget option as well as immutable data objects.

We are also wondering if there is an easy way to configure (or easily extend) the SMS stream to use dedicated network connections instead of writing our own stream implementation which may require touching a lot of internal components in the runtime. We are a bit worrying about it seems that using streams that wrap dedicated network connections may break the virtual stream (or actor) interface in the Orleans, right? Can you share some thoughts?

Thanks a lot! :)

jason-bragg commented 8 years ago

"If the highest throughput is critical to the success of your project, you may want to implement a stream provider that uses dedicated network connections for streams."

The qualification (in bold) is important, because writing a stream provider that uses dedicated connections is non-trivial. I don't see a straight forward way to modify the SMS stream provider to support this either.

The prototyping I worked on for this used a networking layer that supported different 'channels' on a connection. I had one dedicated connection to each silo (modeled off of the existing Orleans networking) and a channel per stream. Where this became a problem was that this prototype did not support features like independent delivery of stream data per grain, guaranteed delivery, or recovery. In general, it violated some of the behaviors we'd established for our streaming infrastructure. It’s viable for a more specialized system and is necessary for the best performance, but it means giving up many of the advantages of the RPC model.

It really all comes down to requirements, and I don’t know the requirements of your system.

If the service has many high throughput streams that need processed immediately, where some data loss (due to transient errors) is acceptable, piping the events into dedicated connections will give you the fastest service, at the cost of development time.

If performance requirements allow, optimizing the SMS stream provider may be a better approach. In addition to gains from micro-optimizations, bulking and batched delivery can be added, reducing the cost of the RPC model on high throughput streams. This, however, would not help much if there were large numbers of sparse streams.

If the service requires reliability and high throughput, but response time is less important, a persistent stream provider can be used that stores the events in a backend queue and processes them as fast as the cluster can consume them.

Can you provide some details regarding the performance requirements of the services streaming components?

Estimated max concurrent number of streams?
Estimated max number of events per second per stream?
Estimated average event size?
Estimated average consumers per stream?
Estimated average producers per stream?
Estimated average stream lifetime?
How much data loss can the service withstand under load?

luomai commented 8 years ago

Hi Jason, thank you very much for the detailed reply. It helps us a lot to clarify the design of our system in using Orleans in a proper way.

We are prototyping the system right now and haven't had very clear numbers for the concurrent number of streams. To roughly estimate it: the target deployment cluster has 100 - 1000s servers and each server can have 100s grain instances (we are relying on the default grain deployment strategy this moment). A grain's ingress and egress streams are around the level of 10s. In total, this can sum up to millions of streams in a cluster.

We can control the topology of the grains so that the numbers of consumers and producers can be adjusted to resolve single-node bottlenecks. The stream is expected to last for a long time because we aim for supporting continuous stream queries. We have attempted to address the data loss problem in the grain level by implementing a combination of techniques that include message acknowledgement, re-transmission, state checkpoint, and back pressure.

Like you suggested, we actually have done quite a lot of work to improve the grain performance. A major part is to mitigate the networking layer's workload, including event batching and lazy-deserialization (i.e., all deserialization are pushed into the grain level using pre-registered types parsing over unsafe byte buffers). After doing this, the messages contain basically raw byte arrays that can be passed over network in a cheap manner.

Regarding performance, in a single-server benchmark, 6 virtual CPU cores can read 10 - 20M events (20 to 32 bytes each) from a network and compute results in real-time. They are doing complex stream processing logic including stream aggregation and join. The benchmark generates around 1.5-3.2 Gbps network throughput. The current bottleneck is again the automatically generated Orleans deserializer which are heavily triggered between the Orleans clients and servers, where [Immutable] cannot help. It would be great to see if this number can be further improved when the new networking layer (DotNetty) comes into play.

Thanks again for your detailed reply and suggestions.

BTW: I noticed that in the Orleans roadmap. There is an opened to-do task of implementing declarative query processing using Orleans. This is similar to what we are doing here right now. Once our codebase become established and well tested. We will try to look at if we can port our code to support the example declarative query in the documentation, and eventually contribute back to the repository.

jason-bragg commented 8 years ago

This is great to hear. To my knowledge, this is the first time the SMS stream provider has been used in such an ambitious project under such strict performance requirements. Please let us know of any bottlenecks you encounter, as this is a great opportunity to further tune these systems.

The current bottleneck is again the automatically generated Orleans deserializer which are heavily triggered between the Orleans clients and servers, where [Immutable] cannot help

The optimizations you've described, transmitting batches of raw binary data, is likely the best one can do. If these batches are large, you may be able to get some additional performance tuning some of the networking knobs, like the default buffer size (BufferPoolBufferSize).

EDIT: I missed the "client to server" part, I was thinking grain to grain. The below is not helpful for client to server.

Something worth considering is the use of stateless workers as stream consumers. This is not usually suggested, and I'm not actually suggesting it now, as much as raising it as something to consider if the performance of the current architecture is insufficient.

Grains that are marked as stateless workers have some very interesting characteristics. First, they are always local to the caller, so no network copy is necessary. Since they are always local grains, they are not registered in the distributed grain directory, making them more resilient to silo outages and cheaper to activate/deactivate.

Despite the name, their is nothing strictly prohibiting them from being stateful, they are merely optimized for grains that have no state. Stateless workers, and their performance benefits, can be used for stateful grains given the following conditions.

Stateless worker must be configured to have only one instance. This is not the case by default, but can be specified in the attribute.
All calls to an instance of a stateless worker must be confined to a single silo, as calls from multiple silos will create a stateless worker per silo, which can make grain state management very difficult.

If your stream topology is such that stream producers and consumers can be limited to being on the same silo, using stateless workers for consumer grains may be a viable optimization. As an optimization, this is not a good general solution as it is viable for a rather narrow set of stream topologies, but for those situations it can make a significant difference.

dotnet / orleans

Outsourcing of network layer to 3rd party libraries #1372

12:00-13:00 PST / 20:00-21:00 GMT

Performance, latency and stability

Testing and tuning

Security