evaluate the C++ actor framework

1st Weekly Update

Installing CAF:

git clone https://github.com/actor-framework/actor-framework
cd actor-framework
./configure
make
make install [as root, optional]
make test

(require GCC 4.8 as minimal compiler version)

Issue when running CAF program:

the CAF shared object file(.so) compiled from my ubuntu is different from the ones compiled from Google Cloud machines. When I tried to execute my CAF program in Google Cloud machines, it cannot find some symbols in the .so files. After I replaced the .so files with these files in my ubuntu, it can run normally.

Solution: change the OS version of Google cloud machines from Ubuntu 14.04 to 16.04 (Xenial image)

Manager & Workers system architecture:

untitled diagram 1 1

How it works:

In the system shown above, we have 2 types of actors: Manager and Worker. We only have one Manager in the example(may have more in the future). The Manager maintains a list of workers(can also be a CAF pre-defined class actor_pool), initialize it when spawned by the system. Similarly, the Worker maintains an address of the Manager it serves in order to send communications.

We also have 3 types of communications: message, work and switch. varlu4rfpzpk 3gv205cd a

A "message" is a string, containing information to let the receiver know and it can be sent from both(e.g. Workers need to send a message when it finishes work).

From Manager's view: 3w8k0 pd tou86_ f 8g_w From Worker's view: n3r2j 5ydh kf2 o a0ck54

A "work" is a general serializable thing. Only the Manager can send work to Workers, let them do some specific works. From Manager's view: aj3zspqut_ ubd xahw ui From Worker's view: mrt1rc w_jfcfy pq aq2

A "switch" is another general serializable thing. Only the Manager can send switch to Workers, let them change their behaviors. From Manager's view: rlehucs5 pts 3 i 3d5cn From Worker's view: vkcu m88vx gekt stqei

How the system runs:

Starts from a fixed entry point called "caf_main" (all CAF programs use this as the entry point). Before calling any user-defined code, CAF will initialize itself as well as other required components(e.g. middleman) and store them in a class called "actor_system". What is Middleman? The middleman is the main component of the I/O module and enables distribution. It transparently manages proxy actor instances representing remote actors, maintains connections to other nodes, and takes care of serialization of messages(page 68 of the CAF manual).
- In manager machines, the system will spawn a Manager and publish itself through a user-specified port.
- In worker machines, the system will spawn a Worker and tries to connect to a Manager through user-specified host and port.
Users can send messages or works through the Manager to all connected Worker or one specific connected Worker.

Inheritance:

CAF does support a part of the inheritance. We can make composable behaviors and generate new behaviors by composing old behaviors. It is different from the real inheritance since we cannot inherit member variables of actors, we can only inherit their behaviors.

2nd Weekly Update

Introduce RapidJSON(a thrid-party lib under MIT lincense):

RapidJSON is a JSON parser and generator for C++. It was inspired by RapidXml.

RapidJSON is small but complete. It supports both SAX and DOM style API. The SAX parser is only a half thousand lines of code.
RapidJSON is fast. Its performance can be comparable to strlen(). It also optionally supports SSE2/SSE4.2 for acceleration.
RapidJSON is self-contained and header-only. It does not depend on external libraries such as BOOST. It even does not depend on STL.
RapidJSON is memory-friendly. Each JSON value occupies exactly 16 bytes for most 32/64-bit machines (excluding text string). By default it uses a fast memory allocator, and the parser allocates memory compactly during parsing.
RapidJSON is Unicode-friendly. It supports UTF-8, UTF-16, UTF-32 (LE & BE), and their detection, validation and transcoding internally. For example, you can read a UTF-8 file and let RapidJSON transcode the JSON strings into UTF-16 in the DOM. It also supports surrogates and "\u0000" (null character).
documentation: http://rapidjson.org/
Installing RapidJSON:
```
git clone https://github.com/Tencent/rapidjson.git
cd rapidjson
cmake .
make
sudo make install
```
Workflow for Distributed System with CAF:

basic elements of the system:
Communication Types:
- message (plain-text messages, directly print out once received)
- work (send strings (serialized work) to the receiver)
- register (register a worker to an dispatcher)
- response (send the result of the work to the sender)
- check (tell a worker to start requesting work)
- request (request work, only workers -> the dispatcher)
- connect (a worker machine successfully connects to the manager machine)
- disconnect (a worker machine disconnects from the manager machine)
Operator(JSON object)(a mimic of real Texera Operators):
- ID (the unique identifer of the operator)
- type (the type of the operator)
- from (list of operator ID's that it takes input from)
- to (list of operator ID's that it sends output to)
- other attributes (varies among different operators)
Actor:
- Internal state (variables maintained by the actor ,like member variables)
- Behaviors (defines what it should do when it receives certain communications(communication_type,args...))
- ID (the unique identifer of the actor, created by the actor system)
Manager(derived from Actor):
- Internal state:
  - list of <hostname,port> of connected machines in the cluster.
- Behaviors(When receiving):
  - (message,string): print the string out
  - (work,string): read the .json file indicated by the string, then create 1 dispatcher and several workers, send(register, the IDs of workers) to the dispatcher, then send(work, JSON as string) to the dispatcher
  - (connect,<hostname,port>): add the <hostname,port> to the list in its internal state
  - (disconnect,<hostname,port>): remove the <hostname,port> from the list in its internal state
Dispatcher(a.k.a. Agent)(derived from Actor):
- Internal state:
  - current JSON it is processing
  - a set of working workers under its control
  - a set of idle workers under its control
  - a map maps from operator.ID to file streams
  - a map maps from operator.ID to its results (JSON objects, transfers as strings)
- Behaviors(When receiving):
  - (register, worker.ID): put the worker into the set of working workers
  - (work, string): parse the string to a JSON object and create file streams from the Scan operators, then send(check) to every worker.
  - (response, operator.ID,string): put string into the map[operator.ID], then find available operator with its input from the current JSON, send(work, operator) to the sender. If there is no available operator, put the sender into the set of idle workers. If all workers are in the set of idle workers, kill itself and all workers.
  - (request): find available operator with its input from the current JSON, send(work, operator) to the sender. If there is no available operator, put the sender into the set of idle workers. If all workers are in the set of idle workers, kill itself and all workers.
Worker(derived from Actor):
- Internal state:
- Behaviors(When receiving):
  - (check): send(request) to the sender
  - (work, string): parse the string to an operator and "execute the operator", than send(response, operator.ID, the result) to the sender

Step-by-step demostration:

https://www.youtube.com/watch?v=HXvK5hyRc5M

3rd Weekly Update

About architecture change

Current architecture:

tim 20180717155332 Pros:

easy to implement pause & resume feature
Workers don't need know the entire workflow (save memory)
Workers can execute every operator (versatility)

Cons:

a lot of useless communications
agent has too much work to do (bottleneck)

Two ways to modify the current architecture to avoid bottleneck:

Spark-like:

tim 20180717154821 Pros:

easy to implement pause & resume feature
release the pressure of the Agent
Workers can execute every operator (versatility)

Cons:

everyone needs some space to save the entire workflow
Agent still have the risk of being the bottleneck when there is a lot of blocking operators in the workflow

Multiple Agents (a live example: DtCraft):

tim 20180717164157 Pros:

solve the bottleneck problem completely
need little extra space
when completely finishing one part, it's possible to reuse or kill the part's Workers.

Cons:

need perform a large modification of existing codebase
hard to implement pause & resume feature
some Workers and Agents will be completely idle at the beginning of the execution
hard to determine how to split the workflow and how many Workers should be spawned
Manager need to do more work

Performance measure

My machine information:

Machine type: g1-small (1 vCPU, 1.7 GB memory)
CPU platform: Intel Haswell
Actor creation:
From batchmarks: This experiment computes 220 by recursively creating actors. In each step N, an actor spawns two additional actors of recursion counter N1 and waits for the (sub) results of the recursive descent. This benchmark creates more than one million actors, primarily revealing the overhead for actor creation. Note that this algorithm does not imply the coexistence of one million actors at the same time.
My experiment: My experiment creates one million trivial actors and guarantees the coexistence of them at the same time. The result is the average of 10 runs on one of my Google Cloud machines. Result: average time usage: 10.465 s => 0.010465 ms per trivial actor average memory usage: 14676 KB = 14.3 MB
Mailbox performance in N:1 communication scenario ("bottleneck" scenario):
From batchmarks: This experiment uses 100 actors, each sending 1,000,000 messages to a single receiver. The minimal runtime of this benchmark is the time the receiving actor needs to process its 100,000,000 messages.
My experiment: My experiment uses 100 actors, each sending 100,000 messages (since my machine only has 1.7G RAM) to a single receiver using exact same code in batchmarks' experiment. The result is the average of 10 runs on one of my Google Cloud machines. Result: average time usage: 6.674 s average memory usage: 724054 KB = 707.08 MB

4th Weekly Update

Pause & Resume functionality

I've implemented the pause & resume functionality in my demo distributed system, based on two special message types - Pause and Resume message.

The whole process in detail:

When the Manager receives a signal of "pause the execution process of Agent A", it will send a Pause message to Agent A. (Resume has exactly the same behavior)

[=](pause_atom,int idx)
{
    auto& running_agents = self->state.running_agents;
    if (running_agents.find(idx)!=running_agents.end())
        self->send(running_agents[idx], pause_atom::value);
    else
        aout(self) << "invaild Agent index" << endl;
}

Then the Agent will continue propagating the Pause message to its Workers. The Agent will iterate over its Workers, send each worker the Pause message and wait for the response in 200ms. if there is no response, the Worker will be put into a set of failed Workers. After iterating all the Workers, the Agent will send the Pause message again and again to the failed Workers until every Worker receives. The Agent will send all Workers a Pause message, if a Worker fails to respond, the Agent will send Pause or Resume message again depending on the current workflow state since there is a chance that the user want to resume the workflow when partial Workers paused successfully. (Resume has exactly the same behavior)

[=](pause_atom)
{
    self->state.is_paused = true;
    for (auto i : self->state.workers)
    {
        self->request(i, chrono::milliseconds(200), pause_atom::value).await(
        [=]()
        {
            self->state.workers.erase(i);
            self->state.paused_workers.insert(i);
            if (self->state.workers.empty())
                aout(self) << "all Workers paused!" << endl;
        },
        [=](const error& err)
        {
            if (self->state.is_paused)
                self->send(self, pause_atom::value, i);
            else
                self->send(self, resume_atom::value, i);
            aout(self) << self->system().render(err) << endl;
        });
    }
    aout(self) << "sent pause to all Workers" << endl;
},
[=](pause_atom, const Worker& worker)
{
    self->request(worker, chrono::milliseconds(200), pause_atom::value).await(
    [=]()
    {
        self->state.workers.erase(worker);
        self->state.paused_workers.insert(worker);
        if (self->state.workers.empty())
            aout(self) << "all Workers paused!" << endl;
    },
    [=](const error& err)
    {
        if (self->state.is_paused)
            self->send(self, pause_atom::value, worker);
        else
            self->send(self, resume_atom::value, worker);
        aout(self) << self->system().render(err) << endl;
    });
}

When the Worker receives the Pause message, it will set its need_pause flag to true, whenever it receives new work or wants to continue working, that flag will stop it until a Resume message set the flag to false.

[=](work_atom,int idx,vector<vector<string>>& input)
{
    // check if the work is vaild
    if (idx == -1)return;
    //check if the work need to be paused
    if (self->state.need_pause)
    {
        self->state.pending_work.emplace_back(make_pair(idx, move(input)));
        return;
    }
    // do the work...
},
[=](pause_atom) 
{
    self->state.need_pause = true;
},
[=](resume_atom)
{
    self->state.need_pause = false;
    //release the pending work
    while (!self->state.pending_work.empty())
    {
        auto temp = self->state.pending_work.back();
        self->send(self, work_atom::value, temp.first,move(temp.second));
        self->state.pending_work.pop_back();
    }
}

Why I didn't use:

Messages With Priority It seems that CAF don't have the support for spawning actor with priority-aware mailbox remotely. After investigating their codebase. I found remote_spawn function will finally call This spawn function is different from the normal one and marked as "experimental", and it doesn't have the template argument called "spawn_options" (maybe it's difficult to transfer template argument through the network) Other spawn functions all have "spawn_options", where the programmer can use "priority_aware" flag to spawn an actor with the priority-aware mailbox. According to the manual, "Actors that should evaluate priorities must be spawned using the priority_aware flag". Thus, I believe that CAF cannot spawn actor with the priority-aware mailbox remotely for now. However, losing that functionality is not a big deal. Workers will request for another work only after they finish their current work. And the Agent won't send any Worker work spontaneously. That means the Worker can never have massive messages in its mailbox. So Pause & Resume messages can be handled in a short time even with normal priority. In addition, even with priority messages, there is no way to Pause a Worker when it is processing an operator since every behavior is atomic.
~~Simultaneous Message Sending~~ ~~It is not supported to send Pause & Resume messages to all Workers simultaneously then wait for all their response in CAF. But the current approach is similar to that.~~

Working Process Optimization

Spark-like architecture:

After careful consideration, I decided to use the spark-like architecture which I mentioned in 3rd Weekly Update to release the pressure on the Agent.

Connection between Agent and Workers:

After inspecting the network traffic of the Manager, I find all packages between the Agent and Workers go through the Manager. This is because the Agent and Workers are all local actors, they don't expose themselves to the network so they cannot find each other directly. Their messages must pass through an intermediate actor, which is the Manager. Due to this, if the Manager shutdowns accidentally, the bridge connects the Agent and Workers will break, they cannot communicate with each other anymore.

In order to solve the problem, the Agent will publish itself after initialization.

How should we design Workers (How large should an entity be)?

Just one type of Worker handles everything?

Pros:

less inter-Worker messages
easy to sync

Cons:

use slightly more space (only store the addresses of the functions in class instances)
high coupling?

Use different types of Worker to handle different operators?

Pros:

high traffic between actors

Cons:

Workers should be dynamically created during the workflow
Workers must know each other

5th Weekly Update

Remaining Problems

Priority-aware mailbox (This feature in CAF has a bug)

Solution 1: Use a delegate actor instead of calling remote_spawn.

Solution 2: Modify the source code to enable priority-aware mailbox by default.

template <class Handle, class T, class... Ts>
struct dyn_spawn_class_helper {
  Handle& result;
  actor_config& cfg;
  void operator()(Ts... xs) {
    CAF_ASSERT(cfg.host);
    //change "no_spawn_options" to "priority_aware" may work
    result = cfg.host->system().spawn_class<T, no_spawn_options>(cfg, xs...);
  }
};

Instant pause

On normal workers: By enabling priority-aware mailbox, they should be able to pause immediately.

On blocking workers (doable): Create a monitor actor and a boolean flag, which both of the monitor actor and the blocking worker can access it(but only the monitor actor can modify it). And use while(flag) in worker's blocking behaviors.

code snippet:

class Blocker :public event_based_actor
{
public:
    Blocker(actor_config& cfg,actor a) : event_based_actor(cfg) 
   {
        flag_ptr = new bool();
        *flag_ptr = true;
        send(a, (uintptr_t)flag_ptr);
        send(this, 1);
    }

    behavior make_behavior() override {
        return
        {
            [=](int i)
            {
                cout << "blocking start" << endl;
                while (*flag_ptr)
                {
                        ;
                }
                cout << "blocking end" << endl;
            }
        };
    }
private:
    bool* flag_ptr;
};

class Monitor :public blocking_actor
{
public:
    Monitor(actor_config& cfg) :blocking_actor(cfg){}
    void act() override
    {
        bool blocking=true;
        receive_while(blocking)
        (
            [&](bool flag)
            {
                *flag_ptr = flag;
            },
            [&](uintptr_t flag)
            {
                flag_ptr = (bool*)flag;
            },
            [&](const exit_msg& x)
            {
                blocking=false;
            }
        );
    }
private:
    bool* flag_ptr;
};

6th Weekly Update

Shortcomings of CAF

Actor Creation

Hard to remotely create actors in runtime, actors must use extra space to store hostnames and their ports.
The feature of creating actors is experimental, and has a lot of limits. (No support for priority-aware mailbox and class-based actors)
No stateless worker actors, which is very useful when doing pipelining.
Message Passing
Streaming under construction.
Language Limit
Hard to generate tuples dynamically according to the datatypes of rows in file.
Hard to manage pointers.
No await, which may cause problems when running outside packages parallelly.

finished evaluation, close issue

Texera / texera

C++ Actor Framework (CAF) evaluation #631

1st Weekly Update

Installing CAF:

Issue when running CAF program:

Manager & Workers system architecture:

How it works:

How the system runs:

Inheritance:

2nd Weekly Update

Introduce RapidJSON(a thrid-party lib under MIT lincense):

Installing RapidJSON:

Workflow for Distributed System with CAF:

basic elements of the system:

Step-by-step demostration:

3rd Weekly Update

About architecture change

Current architecture:

Two ways to modify the current architecture to avoid bottleneck:

Spark-like:

Multiple Agents (a live example: DtCraft):

Performance measure

My machine information:

Actor creation:

Mailbox performance in N:1 communication scenario ("bottleneck" scenario):

4th Weekly Update

Pause & Resume functionality

The whole process in detail:

Why I didn't use:

Working Process Optimization

Spark-like architecture:

Connection between Agent and Workers:

How should we design Workers (How large should an entity be)?

Just one type of Worker handles everything?

Use different types of Worker to handle different operators?

5th Weekly Update

Remaining Problems

Priority-aware mailbox (This feature in CAF has a bug)

Instant pause

6th Weekly Update

Shortcomings of CAF

Actor Creation

Message Passing

Language Limit