akka / akka-meta

This repository is dedicated to high-level feature discussions and for persisting design decisions.
Apache License 2.0
201 stars 23 forks source link

Artery Design Ideas (new remoting) #16

Open patriknw opened 8 years ago

patriknw commented 8 years ago

A place to capture ideas for Artery, the new remoting.

patriknw commented 8 years ago

Artery Design

Aeron

The following sections assume that we are using Aeron as underlying transport layer.

We considered Aeron vs. using plain UDP and here are the pros and cons for using Aeron.

Pros:

Cons:

Quarantine

After quarantine we must not accept messages from the quarantined node. That is important because we must not deliver messages from an actor after it has been declared as terminated, i.e. after the Terminated message. That is what we often refer to as “no-zombies policy”.

It is also important to not send messages to a quarantined node. Otherwise it will get gossip updates, etc.

Quarantined nodes are explicitly notified of being quarantined.

ActorSelection should pass through quarantined state to allow probing for a restarted system.

Quarantine is performed when:

We still need the following existing configuration properties:

Handshake

When opening an association we need a handshake to exchange uid. We need the uid of the destination system to be able to quarantine it and also to properly distinguish system message redelivery/ack information coming from different system incarnations.

When the destination system is restarted and we continue to send to that address a new handshake should be initiated, i.e. new inbound association must be detected by the restarted system.

Such handshake must trigger quarantine and DeathWatchNotification of the old destination incarnation in the sending system. Gating

We should avoid the gating feature but it might be needed to protect against too frequent handshake attempts.

If we can’t establish the handshake for an outgoing connection it might make sense to gate that destination for a while, if the handshake attempt is a costly process.

Note that if the remote system initiates a handshake the gating, if any, on the other side must be removed and the handshake should be accepted.

System Messages

We must not drop system messages on the way, and if buffers get full we must quarantine.

Aeron delivery is reliable as long as the session is alive, very similar to a TCP connection. For long network partitions the session will timeout and be dropped, resulting in lost messages. This was confirmed and clarified by Martin Thompson.

Therefore we need acked delivery for system messages. It can take advantage of that the messages are ordered and a simple protocol would be:

We still need the following existing configuration properties:

Security

There is an Aeron security discussion in ticket: https://github.com/real-logic/Aeron/issues/203

One possible approach is that as part of the handshake negotiate a key with the remote system using a side-channel, using TCP+SSL and a Diffie-Hellman key-exchange (ECDH), then use symmetric encryption per sent message. The drawback is that with Aeron we have no access to the lowest level (just before packet hits UDP layer) and we can only encrypt whole messages (this leaks more information than doing this after the fragmenting/multiplexing layer). Other option is to extend the Aeron media driver to have access to the necessary layer.

We would like to avoid any kind of algorithm negotiation, if algorithms used turns out to be weak, the remoting protocol should be updated instead (explicitly breaking compatibility). Key renegotiation is also not planned.

As a first approach, every encrypted message should have

There is one drawback of the above setup is that the UID must be used to select the proper session key, before the HMAC has been verified (somewhat violating the Cryptographic Doom Principle ). No other operation is allowed to be performed on the message before the HMAC has been verified. The above design is almost a 1-to-1 correspondence with IPSec Encapsuating Security Payload with the notable difference that the sequence number is also encrypted.

Notes:

Sub-channels

System messages / priority lanes

System messages should be delivered over a separate sub-channel, since they have different delivery semantics and should not be interfered by ordinary messages.

Heartbeat messages for remote and cluster failure detection should be delivered over a separate sub-channel. They are ordinary messages, but should not be interfered by ordinary messages. This sub-channel is selected by message type (PriorityMessage marker trait) but it is not available for users’ messages.

Large messages

It should be possible to send bulk data, large messages, over a separate sub-channel to avoid head of the line blocking. This sub-channel is selected based on configured destination path.

General

For ordinary messages not falling into the above categories we can use one or several sub-channels. If we use several they would be selected by hashing on the destination ref to preserve send order per sender-receiver pair.

It should be possible to configure buffer sizes and max message size separately for each sub-channel type.

Pruning of associations

Inactive associations should be pruned. This is important because each channel/stream makes use of a considerable amount of memory. It could be a two step pruning where in the first step a shallow placeholder for the association remains but other memory usage is released. In the second step the placeholder can be removed after longer idle timeout.

Protocol Format

MessageEnvelope:

This will be encoded by hand or with SBE (needs investigation).

Handshake:

All information is in the MessageEnvelope so we only need an empty message type for the handshake. Same can be used for both request and response.

Another message type is needed for initiating a new handshake when receiving message from unknown uid.

ActorRef compression

We’ll find the most commonly communicated-with Actors (likely to use approximate data structure like Count-Min-Sketch), and the receiving end will advertise a mapping for that actor path to an identifier that it creates and stores locally in a table.

Wire format draft: is to allow either encoding ActorRef as number or the path string. That field is prefixed with a marker in which mode we are, e.g. [0][full/path/to/actor] or [1][42].

If the receiving node crashes, and gets messages from someone it must initiate a fresh handshake anyway, we piggyback onto the handshake to also mean “invalidate table”, as the new receiving end does not have a table prepared yet. We of course detect and see if a compressed ref came in and does not match the table, so we could log warnings too.

Configuration

Sane default values that should work for most users without changes.

We should try to have a few high level settings for choosing between different trade-offs, e.g. low latency vs. cpu usage. Low level settings should be derived from these high level settings.

Stable Protocol

We need a stable protocol to support rolling upgrades involving Akka version updates.

All Java serialization should be converted to protobuf serialization to make it easier to evolve the schema. We can include the new serializers in Akka 2.4.x, but not have them enabled by default. In Akka 3.0 they will be enabled by default.

Regression tests are needed.

Performance and Scalability

We aim for a solution that can handle high throughput, low latency, and good enough scalability.

Goals (given fast network, and good servers such as EC2 m4.4xlarge):

Overview of Stream Stages

artery_stages3

Out of Scope