basho-labs / riak-mesos-legacy

DEPRECATED: An Apache Mesos framework for Riak KV. Do not run in production.
Apache License 2.0
37 stars 10 forks source link

Rid ourselves of the beast that is EPMd #29

Closed sargun closed 9 years ago

sargun commented 9 years ago

Mesos prohibits cross executor interaction. Since only one executor will run epmd (at a time) this creates weird inter-cluster interaction issues.

jonmeredith commented 9 years ago

Would it be possible to run epmd outside of Mesos - kicked off by systemd/initd as part of the slave install?

sargun commented 9 years ago

No. That would violate the following Mesos design tenants:

  1. Executors MUST use dynamic ports. Ports MUST be configurable to the highest level possible. They should be requested explicitly through resource offers. In the case that dynamic ports can not be used, the Service should only launch a task if the port is available in the resource offer
  2. Service MUST NOT require anything to be pre-installed on a node. All dependencies MUST be shipped with your package. Although any given node in a given cluster might have arbitrary software installed, that software MUST NOT be relied upon. You MAY rely on the local installation of: ● A Linux kernel ● Mesos ● Docker As a common negative example, a JRE is not ​provided by DCOS. If your Service requires a JRE, you must ship one with your package.

Although we could ship an EPMd as a task to a slave before allowing Riak to run on it, it would violate pt (20).

randysecrist commented 9 years ago

Can we limit one Riak instance per node, and composite both the riak process and epmd together (as the service) controlled by the executor task?

Sounds like we have to ship EPMd with our package to best meet design principals?

jonmeredith commented 9 years ago

It's a tricky one - wonder how they solve that problem for anything that uses ONC/RPC - epmd is very similar in function to portmapper.

Also interesting about the JREs, obviously the framework scripts don't provide their own JREs looking at the chronos startup script.

sargun commented 9 years ago

@randysecrist Limiting ourselves to one Riak per node also violates the Mesos specification.

21. Executors of the same Service type MUST safely co-exist on a given
node.
Executors of the same Service type should have the ability to co­exist on the same node and
should have configurable options such that there is no interference.

We are already shipping EPMd with our process (Via the nice binary that @drewkerrigan and co. built with nodetool).

The problem is that when two Riaks run on one node. And one dies, along with its EPMd. Then that Riak becomes unreachable.

JeetKunDoug commented 9 years ago

How about setting the port differently for each? You can set the ERL_EPMD_PORT environment variable to change the port, which lets you run multiple independent epmd daemons, one for each riak.

You'd probably have to pick the port for the cluster you're building if I had to guess, but it might work?

See http://www.erlang.org/doc/man/epmd.html for more information - note you can also set ERL_EPMD_ADDRESS to a list of comma-separated IP addresses if you need to control which IPs it binds to.

jonmeredith commented 9 years ago

It would work, but would require changed rippled through nodetool too as we use a hidden node frequently to talk to the running node.

sargun commented 9 years ago

There is no guarantee that you'll have the same port available on all of the slaves. From my understanding, ERL_EPMD_PORT makes it a global thing.

Alternatively, we could replace erl_epmd, and have it talk to the store that the executor uses.

jonmeredith commented 9 years ago

epmd is like portmapper - it listens for the whole node and connects up ports for nodename@hostname.

You can have multiple epmds running, but they'll never see each other. node1@hostname on epmd port 1234 will not see node1@hostname on epmd port 4321

So you can use for clusters, but those clusters couldn't see one another.

On Thu, Jul 16, 2015 at 10:27 AM Sargun Dhillon notifications@github.com wrote:

No. That would violate the following Mesos design tenants:

1.

Executors MUST use dynamic ports. Ports MUST be configurable to the highest level possible. They should be requested explicitly through resource offers. In the case that dynamic ports can not be used, the Service should only launch a task if the port is available in the resource offer 2.

Service MUST NOT require anything to be pre-installed on a node. All dependencies MUST be shipped with your package. Although any given node in a given cluster might have arbitrary software installed, that software MUST NOT be relied upon. You MAY rely on the local installation of: ● A Linux kernel ● Mesos ● Docker As a common negative example, a JRE is not ​provided by DCOS. If your Service requires a JRE, you must ship one with your package.


Although we could ship an EPMd as a task to a slave before allowing Riak to run on it, it would violate pt (20).

— Reply to this email directly or view it on GitHub https://github.com/basho-labs/riak-mesos/issues/29#issuecomment-122011185 .

sargun commented 9 years ago

So, from experimentation, we can over ride the module erl_epmd (just put a new BEAM into basho-patches). We already have the executors running on the box. These executors talk to Zookeeper.

The API that erl_epmd exposes isn't particularly complicated. It could very easily use Zookeeper as a central store instead of EPMd. The APIs exposed at this point in initialization of the interpreter are minimal.

Opinions? I might be a crazy person.

sargun commented 9 years ago

cEPMd! Custom Erlang Port Mapper Daemon OR Crazy Erlang Port Mapper Daemon! Your pick!

All nodes that participate in Bletchley run our own code. This is a Go daemon. My plan is that all nodes that are within an instance of the framework participate in SWIM. The way they do this is by leveraging Hashicorp's SWIM library. When a scheduler comes up, it establishes a memberlist listener. When it launches a task, in that task's TaskData lies the hostname, and port which the scheduler's memberlist is running. In order to make this process somewhat more fault-tolerant, we can also supply some other framework tasks as seeds, in case for whatever reason the scheduler is unavailable. In addition, when the scheduler starts up, it tries to join all of the current tasks memberlist instances.

In order to proceed to starting the Riak Node, a task must be able to join the group membership service. This should prevent the situation from getting into a hairy split-brained situation. We can potentially also add some more fault-tolerance here, and say that it can also be considered to successfully join the cluster if it can reach more than 1/2 of the nodes.

In order to integrate into Erlang, we'll "monkey-patch" erl_epmd, by dropping in our own module that sits earlier in the path, so it gets loaded instead of the actual erl_epmd. This module provides the following APIs:

Most of these functions don't translate to cEPMd.

Instead, the way that the APIs work would be the following:

If we use virtual hostnames, in order to get address portability, we also need to implement a name resolution system that can leverage the same channel.

sargun commented 9 years ago

This is mostly done as part of https://github.com/basho-labs/riak-mesos/pull/49. One change was that more than erl_epmd had to be modified. erl_epmd, even in long names only gets the first part of the hostname. We had to change net_kernel, and inet_tcp_dist to pass the full node name.

Unfortuately, we use nodetool in order to monitor riak_kv starting, and running. This automatically starts up epmd no matter what: https://github.com/basho/node_package/blob/develop/priv/base/nodetool#L156-L174 -- we're going to replace the EPMd binary with a simple echo binary.