Support deep Service Fabric integration

ReubenBond commented 8 years ago

Service Fabric is a great platform for application hosting which offers high-density hosting, application upgrade, and state replication.

Many users, such as myself, would rather use Orleans than Service Fabric Reliable Actors. There are many reasons for this, including that Orleans is much more mature, open source, and feature rich.

We have integration packages for hosting Orleans atop Service Fabric. This integration is simple and effective but does not allow Orleans to gain all of the benefits which Service Fabric has to offer, such as collocated state.

Orleans should be the obvious choice for Virtual Actors in any hosting environment, so there is some additional work to be done which I would like insight into and assistance with.

This issue is for discussion and tracking of that work.

Basic Integration:

[x] Hosting: Basic stateless, unpartitioned service hosting. Use the existing NuGet packages.
Integration Required for Collocated State:
[x] Gateway Provider: Implement an IGatewayListProvider based on Service Fabric's NamingService. Added in #2542
[x] Cluster Membership Provider: Silos need to be discoverable by clients & each other. ~~This should involve creating an IMembershipTable implementation which leverages Fabric's NamingService or PropertyService.~~ This involves creating an IMembershipOracle implementation which can receive updates directly from Service Fabric in addition to periodically polling partitions. Added in #2542
[ ] Grain Placement: Orleans should be partitioned and grain placement should deterministically map GrainId to partitions. Using consistent hashing means that we do not need to hold/maintain a grain directory. One potential downside to this approach is that we rely on the randomness of GrainIds to balance load across hosts and cannot perform load shedding in a granular fashion. This is mitigated by using a significant number of partitions, allowing Service Fabric to perform this load balancing at the partition level by shifting partitions between nodes based. This change requires only that we configure a new PlacementStrategy/PlacementDirector which resolves GrainId to SiloAddress based upon Service Fabric partitions. We can optionally eliminate the Grain Directory when we are using Service Fabric, but keeping the Grain Directory around allows us to use different placement strategies (as long as they are stateless or use a state provider which doesn't require consistent mapping between GrainId & Fabric partition.)
[ ] Partitioned hosting: This should only require a Service Fabric configuration change: Placement handles the important parts.
[ ] State Providers: deterministic grain placement allows us to use collocated, replicated state. State providers will be based upon IReliableDictionary<GrainId,byte[]>, using Service Fabric's Reliable Services model.
Extras:
[ ] Reminder Service: Reminder Services should be pluggable so that we can provide a Service Fabric-based reminder service.
[ ] Stream Provider: Service Fabric has an IReliableQueue<T> which can be used to produce an IStreamProvider.
[ ] Storage Provider: Collocated state requires the Grain Placement change, but we can also create a simple state service for in-cluster, non-collocated state. That would allow us to use the existing placement strategies.
General Work Which Would Help:

Much of this work would be much easier if dependency injection was used more pervasively throughout Orleans. There should not be a GatewayProviderType enum and there definitely should not be a GatewayProviderType.ZooKeeper value. Those things should be configured in a strongly typed manner.

Result:

Ultimately, this will give us integration which is on-par with Service Fabric's Reliable Actors implementation. We will also have all of the added features of Orleans.

Anything I've missed? Any obvious/subtle challenged which I may have overlooked?

galvesribeiro commented 8 years ago

:+1: for this Issue. :+1: for DI on all core/customizable component. :+1: be truly host-agnostic.

I think with our current efforts on CoreCLR move, we are cleaning the core but, we still need "more DI".

On ASP.Net pre-coreCLR, DI was there, but very basic and non-required. Now on ASP.Net 5 and coreCLR, EVERYTHING that you need, you must use the internal DI system.

I think that we should add DI to core components and make Orleans a framework truly host-agnostic as it should be.

I would like to finish coreCLR port and then help on implementing this issue.

sergeybykov commented 8 years ago

This is a great start. Thank you, @ReubenBond for taking the initiative!

I think we should use this issue as a higher level one for tracking overall progress, and open separate issue for individual pieces.

One other thing that I don't see listed is leveraging SF for an alternate implementation of grain directory, with replication. I think we should carefully consider AP and CP modes for it, and likely support both, maybe even at the grain type granularity.

In general, for each issue I think we need to be very explicit about the consistency/availability tradeoff, so that we don't accidentally sacrifice availability where we don't intent to, and we are clear where we consciously support a choice for stronger consistency at the expense of availability.

gabikliot commented 8 years ago

Agree that all those multiple additional integrations would be helpful. Here is my analysis of this topic.

If I had to order them in the importance/most benefits order I would say:

1) Hosting. This is already done in @ReubenBond 's integration packages. Hosting of Orleans silos service on SF (as opposite to a bare Azure Role) has the following advantages: a) faster deployment (presumably [I say presumably since I did not measure it myself], it is faster to deploy a new service instance to SF than a new role to Azure PaaS) and faster redeployment/upgrade. b) denser deployment. One of course could also deploy multiple processes/services into Azure role and this will achieve higher density, but that requires some kind of management layer, which records on what role instance we deployed what service, etc.... This is exactly what SF hosting ("micro services") does.

2) Storage provider: we can leverage SF's in memory replication and build in memory (or in memory + local disks) storage provider. This would provide a good alternative store option, hopefully with higher throughput in the expense of potentiality lower availability (if all replicas are down). The promise for higher throughput should be measured and verified.

3) Alternate implementation of grain directory, with replication, like @sergeybykov listed above.

In addition to those new capabilities, other points provide tighter ingratiation with SF, but dont provide new capability that is not already present in Orleans. When considering each and every one, we should think about pros and cons of each.

Specifically, one of the possible tighter integration points is liveness. We can do it like @ReubenBond suggested: "IMembershipTable implementation which leverages Fabric'sNamingService or PropertyService." Alternatively, we can go ever further and integrate deeper: replace the whole Orleans cluster management with SF Naming Service. It would mean that Orleans will not send "I am alive pings" between silos and will not even use IMembershipTable(like done now). Instead it will leverage SF Naming Service to do all failure detection and membership. Silo will subscribe to Naming Service to provide membership view and membership changes notification. This options has pros and cons (of course, as anything), which should be considered prio to the implementation. One thing is sure: this is totally doable, and even easy, since it has been done in the past by us. We can provide all the details of how this can be implemented, plus the pros and cons. If there is an interest, feel free to open a separate issue.

sergeybykov commented 8 years ago

Specifically, one of the possible tighter integration points is liveness.

This is one of the potential bigger benefits - get much faster and more reliable notification from the fabric than through pinging and voting.

gabikliot commented 8 years ago

Well, fabric does pinging and some form of voting inside as well. Fundamentally, it is not different. Maybe it is faster. Need to run and inject failures and check. Can also tune Orleans's pinging frequency.

sergeybykov commented 8 years ago

Of course. I meant that it hopefully removes the need for pinging and voting at the Orleans Runtime level.

ReubenBond commented 8 years ago

I'll make the consistency/availability trade-off explicit in each of the subsequent issues/PRs.

Service Fabric support won't equate to CP, Orleans can still be AP on SF. CP is only achieved if we perform consensus on every call, which typically means writing to storage in order to replicate state & thus prove we were the primary node at the time of the call. If we truly wanted a CP grain, that would mean performing consensus on every call (even reads), which is quite doable: we would just replicate a null operation. That's another "do it later" thing: not many people will care and those that do can work around it by forcing replication by writing state.

Consumers can have AP grains by selecting a different storage provider (although they're all currently CP...). We can select between an AP directory, a CP directory, or no directory (partitioned placement only) if we want, too. I'm not sure we gain anything with a CP directory: a CP directory doesn't make the system CP.

I want to implement these features (with the help of others) in a "progressive enhancement" fashion where we start with something solid and progressively leverage the features which Fabric can provide.

AceHack commented 6 years ago

When looking at using IReliableQueue for streaming I would be careful, they are 100% FIFO and because of that are very SLOW in many cases in my experience. ReliableConcurrentQueue is a much better and faster implimentation in my opinion and if could be used I would suggest using that instead.

Reliable Concurrent Queue is offered as an alternative to Reliable Queue. It should be used in cases where strict FIFO ordering is not required, as guaranteeing FIFO requires a tradeoff with concurrency. Reliable Queue uses locks to enforce FIFO ordering, with at most one transaction allowed to enqueue and at most one transaction allowed to dequeue at a time. In comparison, Reliable Concurrent Queue relaxes the ordering constraint and allows any number concurrent transactions to interleave their enqueue and dequeue operations. Best-effort ordering is provided, however the relative ordering of two values in a Reliable Concurrent Queue can never be guaranteed.

https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-reliable-services-reliable-concurrent-queue

AceHack commented 6 years ago

You could also go one step further and make Sagas first class and offer Sagas to not only Orleans based actors but also service fabric based services and actors too.

I see this Orleans Saga based implementation https://github.com/creyke/Orleans.Sagas not sure if that's exactly the same as https://www.youtube.com/watch?v=0UTOLRTwOX0 Sagas explained here but would be very helpful if generalized to all of service fabric.

ReubenBond commented 3 years ago

I don't plan on taking this any further, and have not seen a strong need from the community for it. Users inside and outside of Microsoft do run Orleans on Service Fabric, but have not been requesting deeper integration than what we have already or what can be implemented themselves.

Of course, we can open a new issue if the desire and willingness to implement it is there

dotnet / orleans

Support deep Service Fabric integration #1059

Basic Integration:

Integration Required for Collocated State:

Extras:

General Work Which Would Help:

Result: