We want to evaluate the Orleans Actor system (https://dotnet.github.io/orleans/) and see if it is a good run-time engine for Texera.

General Architecture

Orleans Architecture taken from gigi.nullneuron.net

Terms and Concepts

Grain - A grain is another name for a virtual actor. Actor is called "virtual" because it may not be in memory when another component sends it a message. If not present, the grain will be activated on demand.
Silo - A silo is a container for grains. Potentially millions of grains can go in one silo. Typically, we have one silo per machine.
Client - It acts as a gateways between an Orleans cluster (which is running many silos) and the outside world. The name could be misleading, because while it is a client to the Orleans cluster, it is typically also a server for external requests. For example, an Orleans client could provide a REST API that accepts HTTP requests and interacts with grains in the Orleans cluster.
Virtual Actor - Grains in Orleans differ from Akka-style actors. In Orleans, a grain or frontend (client) can call a target grain by using its logical identity (key) without the need to create or instantiate the target grain. If the target grain doesn't exist, it gets created. Also, if the grain is lying idle for a long time, it gets persisted in disk. When referenced again, it is activated and comes in memory. This behavior is similar to virtual memory. Due to virtual actors, Orleans can handle failures transparently to the application logic since grains are automatically re-instantiated on another server when a failure is detected. Physical instantiations of Actors are completely abstracted away and are automatically managed by the Orleans runtime. The sequence of key events in a grain lifecycle looks like the following:
- A grain or a client makes a call to a method of another grain (via a grain reference);
- The target grain gets activated (if not already activated) and an instance of the grain class (called a grain activation) is created;
- The constructor of the grain is executed, which leverages Dependency Injection if applicable;
- If Declarative Persistence is used, the grain state is read from the storage;
- If overridden, OnActivateAsync is called;
- The grain processes incoming requests;
- The grain remains idle for some time;
- The silo runtime decides to deactivate the grain;
- The silo runtime calls OnDeactivateAsync, if overridden;
- The silo runtime removes the grain from memory.

The lifecycle of a grain is shown below: lifecycle of an Orlean grain

Split brain - Normally Orleans guarantees that one grain (with a particular ID) can have at most one instance in the cluster. However, if the silo crashes or gets killed without a proper shutdown, there is a 30-second window in which there can be more than one instance. Once that window closes, the duplicate activations are deactivated (are both instances deactivated?). We just need to consider the rare possibility of having two instances of an actor while writing our application. The persistence model guarantees that no writes to storage are blindly overwritten in such a case (what does this mean?).
Turn - A grain activation performs work in chunks and finishes each chunk before it moves on to the next. Chunks of work include method invocations in response to requests from other grains or external clients, and closures scheduled on completion of a previous chunk. The basic unit of execution corresponding to a chunk of work is known as a turn.

Features of Orleans

Stateful middle tier - Business logic entities appear as a sea of globally addressable .NET objects (grains). A grain encapsulates states of an entity. For example, grains of type "UserProfile" can have a user's email id as the id. So, whenever we need to get the UserProfile of a user, we directly call the grain type with the user's email. This is one of the biggest advantages of the Orleans programming model - we never need to create, instantiate, or delete grains. We can write our code as if all possible grains, for example, millions of user profiles, are always in memory waiting for us to call. When a grain is called, the Orleans runtime creates an instance if it's not there. If the instance was there and we were using Grain Persistence, then the Orleans runtime reads the state from the store upon activation.
Single-threaded grains- Orleans guarantees single-threaded execution of each individual grain, hence protecting the application logic from perils of concurrency and races. Each request to a grain is processed from beginning to completion before the next request can begin being processing. This is true unless reentrancy is explicitly set to true. By design, any sub-task spawned from grain code (for example, by using await or ContinueWith or Task.Factory.StartNew) will be dispatched on the same per-activation TPL Task Scheduler as the parent task, and therefore inherit the same single-threaded execution model as the rest of grain code. This is the main point behind single-threaded execution of grain turn based concurrency. This Stackoverflow question that I asked clearly exemplifies the concept of single-threaded grains in Orleans.
At-most-once delivery that can be transformed to at-least-once - By default Orleans guarantees at most once delivery of messages. However, if we start to use timeout and retry features, then Orleans guarantees "at least once" delivery.
Clients as Observers - Say a client needs regular score updates of a game from grains. One way is polling the GameGrain. Another approach is making the clients as Observers. This approach enables exposing the client side objects as grain-like targets to get invoked by grains. However, these calls do not provide any indication of success or failure, as they are sent as one-way best-effort message. So it is a responsibility of the application code to build a higher-level reliability mechanism on top of observers where necessary. Another mechanism that can be used for delivering asynchronous messages to clients is Streams. Streams expose indications of success or failure of delivery of individual messages, and hence enable reliable communication back to the client.
Orleans streams - supports something like subjects in Angular. An instance of a stream of a particular type is addressable a namespace and an ID. For example:

// This stream is to deliver sms messages to a chat group with a particular GUID. 
// A use case is that a client can send messages to the group and the grain with this group_guid subscribes to this stream.

// A stream for forwarding sms messages
var streamProvider = GetStreamProvider("SMSProvider");
// Get the reference to a stream
var stream = streamProvider.GetStream<int>(group_guid, chatGroups);

Go here for how Orleans streams is superior to other stream processing engines like Apache Storm, Spark Streaming, Kafka etc. The following are use cases of Orleans streams:

Supports timers and reminders - Timers are for periodic tasks within one activation of a grain. However, reminders allow for invocation of tasks even when a grain isn't active. If the grain isn't active, reminders activate that grain.
Request Context - RequestContext is an Orleans feature that allows application metadata, such as a trace ID, to flow with requests. When a thread (whether client-side or within Orleans) sends a request, the contents of the sending thread’s RequestContext is included with the Orleans message for the request; when the grain code receives the request, the metadata is accessible from the local RequestContext.
Stateless worker grains - There are many tasks that can't be allocated to a single grain as it doesn't belong to an entity, e.g., unzipping a file before passing it to grains, and aggregating metrics like average VoIP call time. These can be given to stateless worker grains. These grains can be treated as an auto-managed pool of grain activations that automatically scales up and down based on the actual load. The runtime increases the number of such grains (up to a maximum level) as more requests come in and deletes activations if they are idle. These grains aren't called by ID and there can be more than one activation of that grain in the cluster. For example, to access worker class A, ID 0 is always used, and for worker class B, ID 1 is used. Two subsequent requests to a Stateless Worker grain may be processed by different activations of it. Requests made to Stateless Worker grains are always executed locally, i.e., it doesn't involve any remote calls. Therefore, they can be used to hold hot cache items, which are needed by many grains. Also, it can be used to do Reduce style aggregations.
Inbuilt support for startup tasks which start as soon as the silo starts
Supports call filters - Incoming and outgoing grain call filters can be used to intercept grain calls. Some use cases of filters are - 1) Authorization - checking if permissions are there in Request Context to invoke the method; 2) Logging/telemetry; 3) intercept exceptions and transform it into some other kind of exception.
Grain cancellation tokens - Users can use these tokens to cancel an executing grain operation.
Orleans supports transactions (but is currently in Beta) - (What's this?)
Transparent scalability requiring no special effort from programmer side - Grains get automatically created by the Orleans runtime on servers on an as-needed basis to handle requests for those grains. Activation/deactivation of an actor does not incur the cost of registering/unregistering of a physical endpoint, such as a TCP port or a HTTP URL, or even closing a TCP connection.
High throughput with greater stability - The runtime schedules a large number of single-threaded actors across a custom thread pool with one processor core for one thread. As actor code is written in a non-blocking way (async mode is a requirement for Orleans methods), the CPU utilization is too high (90%+).
Tearing down/deactivating grains - Normally it is best to let Orleans' runtime handle the deactivation because the runtime automatically detects and deactivates idle activations of a grain to reclaim system resources. Moreover, these deactivations are done in batches rather than one by one. However, if one needs to expedite the deactivation, base.DeactivateOnIdle() can be called.
Silos with new grain classes or new versions of existing grain classes can be added to a running cluster.
Automatic propagation of errors - The runtime automatically propagates unhandled errors up the call chain with the semantics of asynchronous and distributed try/catch. As a result, errors do not get lost within an application. This allows the programmer to put error handling logic at the appropriate places, without the tedious work of manually propagating errors at each level.
When a grain fails in the middle of processing a request, the caller receives an exception. The exception is returned by the Orleans runtime (I think so!!) - Note also that there is a delay between the time when a silo fails and when the Orleans cluster detects the failure. The delay is a configurable tradeoff between the speed of detection and the probability of false positives.
Cooperative multitasking - When a grain call takes too much time, the execution of the grain isn't pre-empted. Instead the runtime generates warnings that the programmer can detect. Keep in mind that grain calls should not execute any long running tasks like IO operations synchronously and should not block on other tasks to complete. All waiting should be done asynchronously using the await keyword or other asynchronous waiting mechanisms. Grains should return as soon as possible to let other grains execute for maximum throughput.
Inbuilt support for telemetry
Supports dependency injection and has unit testing framework

Open Questions

[ ] Message delivery guarantees, what should we do if only the special row(row.id == -1) gets dropped.
[ ] Message ordering guarantees, what should we do if the special row(row.id == -1) coming before other normal rows.
[ ] How do we implement the Join operator by using intermediate operators?
[ ] How do we implement a workflow in Orleans?
[ ] Right now the Orleans retry mechanism only "nearly guarantee" at least once delivery by retrying many times, which we used to implement exact once delivery. Do we need to implement Ack from receiver to the sender to completely guarantee at least once delivery?
[ ] During the performance experiment, RPC implementation runs much slower than stream implementation, why?
[ ] The current sequence number implementation only allow communication between two grains. One grain to multiple grains communication is not allowed. As a result, the current implementation is not scalable and does not support communication. How can we solve this issue?
[ ] If we dynamically increase the number of grains(like the implementation of Chi), how do we implement the communication between grains?
[ ] Comparison between Microsoft Orleans and Akka. For example, Orleans supports fault tolerance, does Akka support it?
[ ] Message delivery. Exactly once and FIFO guarantee.
[ ] Orleans Stateless workers.
[ ] Orleans Reactive Streams.
[x] What happens when a client or a grain calls a method call to a grain? Does the caller block till the callee returns a task?
[x] How does Orleans know that more actors need to be created? Suppose there are 80 actors running, but all are fully used, how will Orleans create a new actor?
[x] Say I have already created a grain with ID 2 and some other client also creates a grain with ID 2, then the reference to the same grain will be returned. But that is not the other client wanted. It wanted a new grain.
[x] What happens when a grain method is invoked in Orleans? Is it a straightforward method call with the caller's thread getting blocked?

1. How does Orleans know that more actors need to be created? Suppose there are 80 actors running, but all are fully used, how will Orleans create a new actor?

Orleans doesn't create actors on its own. Say a client tries to get an actor (grain) of type UserProfile, and it calls the grain with ID which can be say the user's email id. Then, Orleans runtime creates a new grain with that ID in one of the silos.

2. Say I have already created a grain with ID 2 and some other client also creates a grain with ID 2, then the reference to the same grain will be returned. But that is not the other client wanted. It wanted a new grain.

We don't randomly assign IDs to grains. The IDs are something which can map to the real world. For example, for UserProfile grain, the ID is the user's email id. If the grain doesn't exist, it gets created. If it exists, it just gets activated. In both cases, the caller gets what it wanted.

3. What happens when a grain method is invoked in Orleans? Is it a straightforward method call with the caller's thread getting blocked? This stackoverflow post talks about method invocation in Orleans. This blog says that when a method is invoked, it is actually a message being sent. The message (the method's parameters) are deep-copied, serialized, transmitted to the correct silo, deserialized and then queued for processing by the receiving grain. Then, the method is invoked on the receiving grain. The message is passed asynchronously and thus the caller isn't immediately aware of the success/failure of method invocation. The method invocation results in a Promise which is implemented as a .Net Task. The Orleans runtime schedules work as a sequence of turns. A turn is the execution of a grain till a Promise has been returned (which happens when an await statement is reached, closure following await is reached or a completed or uncompleted task is returned). So, one request (method invocation) can result in several turns as the method can have await at many places. Now, as the grain is single threaded, only one turn executes at a time. But turns of different requests aren't interleaved. In fact, to maintain consistency, the runtime schedules all turns of a request before processing any other requests.

Now consider the case where a method (request) in grain-1 calls a method (request) of grain-2. grain-2 method will return a Task. Now, grain-1 can do two things after receiving the Task from grain-2. It can await the Task or it can continue without awaiting. If grain-1 does await on the Task, the current Turn on grain-1 gets over. However, the method isn't over. So, the grain-1 thread gets blocked as it can't process any other turns till it processes this method completely. So, any sub-Tasks spawned from grain code (for example, by using await or ContinueWith or Task.Factory.StartNew) which run the new Task in grain-context (which is the default case) will block the grain thread. If in rare cases, you want to escape this, you will have to start the new task using Task.Run() or endMethod which start the process in .Net ThreadPool Task Scheduler and not Orleans Task Scheduler. More on this here.

4. What happens when a client or a grain calls a method call to a grain? Does the caller block till the callee returns a task?

We did a few experiments for this. In the first experiment, we has a client call the grain every 1 second like below:

    while (1)
    {
        Console.WriteLine("Client giving another request");
        int grainId = random.Next(0, 500);
        double temperature = random.NextDouble() * 40;
        var sensor = client.GetGrain<ITemperatureSensorGrain>(500);
        Task t = sensor.SubmitTemperatureAsync((float)temperature);
        Console.WriteLine(t.Status);
        Thread.Sleep(1000);
     }

And the grain waited for 10 seconds before returning a Task. So, if this is a blocking call for the client, the next request will be made by the client only after 10s. But, that's not the case, as we see by the console output.

Client giving another request
Task Status - WaitingForActivation
500 outer received temperature: 32.29987
Client giving another request     <--------------------- client continues
Task Status - WaitingForActivation
Client giving another request
Task Status - WaitingForActivation
Client giving another request
Task Status - WaitingForActivation
Client giving another request
Task Status - WaitingForActivation
Client giving another request
Task Status - WaitingForActivation
Client giving another request
Task Status - WaitingForActivation
Client giving another request
Task Status - WaitingForActivation
Client giving another request
Task Status - WaitingForActivation
Client giving another request
Task Status - WaitingForActivation
Client giving another request
Task Status - WaitingForActivation
500 outer complete

Design Questions

How big/small should a grain be? - A grain should represent entities which have isolated context of states and computation. eg: SessionManager, UserManager. A single grain can handle upto thousand trivial calls per second. But the logic in a grain should be such that it should be wary of receiving hundreds of requests per second. If it reaches this point, it is a signal that the grain should be decomposed into smaller grains. On the other hand, if a grain logic is such that it involves too much interaction with other grains (which leads to high messaging overhead, it is a signal that all these closely interacting smaller grains should be aggregated into bigger grains, so that they would invoke each other directly).
Grain are single threaded i.e. at a time a single grain activation acts on a single request. This can cause hotspots - So, we should avoid having a grain which handles a lot of requests because all requests will pass serially through it. However, this doesn't mean that we can't have logical central points of communication. eg: Say a grain is an aggregator of some statistic, then instead of all metrics going to a single grain for aggregation , we can have intermediate aggregators which aggregate metrics from a section of grains. Then, one grain can aggregate metrics from these intermediate grains.

Important coding snippets/practices (Things to be kept in mind)

Sometimes a frontend facing grain may have to invoke tasks on many backend grains and continue when all of those tasks have been done.

List<Task> tasks = new List<Task>();
Message notification = CreateNewMessage(text);

foreach (ISubscriber subscriber in subscribers)
{
  tasks.Add(subscriber.Notify(notification));
}

// WhenAll joins a collection of tasks, and returns a joined Task that will be resolved when all of the 
individual notification Tasks are resolved.
Task joinedTask = Task.WhenAll(tasks);
await joinedTask;

// Execution of the rest of the method will continue asynchronously after joinedTask is resolve.

A target grain can be called from a grain or from client. Making calls to grain from a client is really no different from making such calls from within grain code. However, the major difference between making calls to grains from client code and from within another grain is the single-threaded execution model of grains. Grains are constrained to be single-threaded by the Orleans runtime, while clients may be multi-threaded. Orleans does not provide any such guarantee on the client side, and so it is up to the client to manage its own concurrency using whatever synchronization constructs are appropriate for its environment – locks, events, Tasks, etc.
Grain Persistence - If programmer wants grain state to be persisted in storage, then he should inherit the grain class from Grain where T is the grain state type. Also, if grain is marked for persistence, whenever it is activated, Orleans will automatically read its state from the storage before OnActivateAsync() is called. However, the writing of grain state has to be done explicitly by calling base.WriteStateAsync(). If the grain state has to be read again, we can call base.ReadStateAsync() but this will cause the current state to be overwritten.

public class MyGrainState
{
  public int Field1 { get; set; }
  public string Field2 { get; set; }
}

[StorageProvider(ProviderName="store1")]
public class MyPersistenceGrain : Grain<MyGrainState>, IMyPersistenceGrain
{
  ...
}

\\ Grain state write
public Task DoWrite(int val)
{
  State.Field1 = val;
  return base.WriteStateAsync();
}

\\ Grain state  refresh
public async Task<int> DoRead()
{
  await base.ReadStateAsync();
  return State.Field1;
}

Timers and reminders - These enable developers to enable some periodic behavior for grains. Timers are used to create periodic grain behavior that isn't required to span multiple activations. In addition, it is subject to single threaded execution guarantees within the grain activation that it operates. Each activation may have zero or more timers associated with it. The runtime executes each timer routine within the runtime context of the activation that it is associated with. On the other hand, reminders are associated with a grain and not with an activation. So, if there is no activation and a reminder ticks, the activation will be created. Reminders are delivered by message and are subject to the same interleaving semantics as all other grain methods. Reminders need storage to work as they are persistent.

Hello World implementation on two machines

We will be using two Ubuntu servers to run a basic HelloWorld program. One machine will run the silo and other the client. We will provide MySQL to Orleans to maintain cluster membership. The code is a simple modification of the Gigilabs code from the link mentioned above. The client sends fake temperature readings to different grains which just outputs it to the terminal.

1. Install .Net Core on both machines

Use this to install pre-requisite libraries.
Use this to install both .Net SDK and runtime.

2. Start Silo on one machine

Create a mysql database called 'orleanstest'. Then, run the scripts MySQL-Main.sql, MySQL-Clustering.sql to create the necessary tables and insert entries in the database.

Use the following code snippet (entire code is in sandbox folder). Note that in the connection string you have to set username and password. Also, we won't be currently using ssl connections to mysql server. The below code starts the silo and waits unless some input is given to it:

        const string connectionString = "server=<hostname>;uid=<username>;pwd=<password>;database=orleanstest;SslMode=none";

        var silo = new SiloHostBuilder()
        .Configure<ClusterOptions>(options =>
        {
            options.ClusterId = "dev";
            options.ServiceId = "Orleans2GettingStarted2";
        })
        .UseAdoNetClustering(options =>
        {
            options.ConnectionString = connectionString;
            options.Invariant = "MySql.Data.MySqlClient";
        })
        .ConfigureEndpoints(siloPort: 11111, gatewayPort: 30000)
        .ConfigureLogging(builder => builder.SetMinimumLevel(LogLevel.Warning).AddConsole())
        .Build();

        await silo.StartAsync();
        // Wait for user's input, otherwise it will immediately exit.
        Console.ReadLine();

3. Start client on different machine

Client uses the same mysql as the silo. So, the connection string should be the same.

Use the following code to start client

        const string connectionString = "server=<hostname>;uid=<username>;pwd=<password>;database=orleanstest;SslMode=none";
        var clientBuilder = new ClientBuilder()
            .Configure<ClusterOptions>(options =>
            {
                options.ClusterId = "dev";
                options.ServiceId = "Orleans2GettingStarted2";
            })
            .UseAdoNetClustering(options =>
            { 
            options.ConnectionString = connectionString;
            options.Invariant = "MySql.Data.MySqlClient";
            })
            .ConfigureLogging(builder => builder.SetMinimumLevel(LogLevel.Warning).AddConsole());

        using (var client = clientBuilder.Build())
        {
            await client.Connect();

            var random = new Random();

            while (true)
            {
                int grainId = random.Next(0, 500);
                double temperature = random.NextDouble() * 40;
                var sensor = client.GetGrain<ITemperatureSensorGrain>(grainId);
                await sensor.SubmitTemperatureAsync((float)temperature);
            }
        }

Orleans evaluation is successful.

Texera / texera

Microsoft Orleans evaluation #645