Closed sargun closed 9 years ago
Would it be possible to run epmd outside of Mesos - kicked off by systemd/initd as part of the slave install?
No. That would violate the following Mesos design tenants:
Although we could ship an EPMd as a task to a slave before allowing Riak to run on it, it would violate pt (20).
Can we limit one Riak instance per node, and composite both the riak process and epmd together (as the service) controlled by the executor task?
Sounds like we have to ship EPMd with our package to best meet design principals?
It's a tricky one - wonder how they solve that problem for anything that uses ONC/RPC - epmd is very similar in function to portmapper.
Also interesting about the JREs, obviously the framework scripts don't provide their own JREs looking at the chronos startup script.
@randysecrist Limiting ourselves to one Riak per node also violates the Mesos specification.
21. Executors of the same Service type MUST safely co-exist on a given
node.
Executors of the same Service type should have the ability to coexist on the same node and
should have configurable options such that there is no interference.
We are already shipping EPMd with our process (Via the nice binary that @drewkerrigan and co. built with nodetool).
The problem is that when two Riaks run on one node. And one dies, along with its EPMd. Then that Riak becomes unreachable.
How about setting the port differently for each? You can set the ERL_EPMD_PORT environment variable to change the port, which lets you run multiple independent epmd daemons, one for each riak.
You'd probably have to pick the port for the cluster you're building if I had to guess, but it might work?
See http://www.erlang.org/doc/man/epmd.html for more information - note you can also set ERL_EPMD_ADDRESS to a list of comma-separated IP addresses if you need to control which IPs it binds to.
It would work, but would require changed rippled through nodetool too as we use a hidden node frequently to talk to the running node.
There is no guarantee that you'll have the same port available on all of the slaves. From my understanding, ERL_EPMD_PORT makes it a global thing.
Alternatively, we could replace erl_epmd, and have it talk to the store that the executor uses.
epmd is like portmapper - it listens for the whole node and connects up ports for nodename@hostname.
You can have multiple epmds running, but they'll never see each other. node1@hostname on epmd port 1234 will not see node1@hostname on epmd port 4321
So you can use for clusters, but those clusters couldn't see one another.
On Thu, Jul 16, 2015 at 10:27 AM Sargun Dhillon notifications@github.com wrote:
No. That would violate the following Mesos design tenants:
1.
Executors MUST use dynamic ports. Ports MUST be configurable to the highest level possible. They should be requested explicitly through resource offers. In the case that dynamic ports can not be used, the Service should only launch a task if the port is available in the resource offer 2.
Service MUST NOT require anything to be pre-installed on a node. All dependencies MUST be shipped with your package. Although any given node in a given cluster might have arbitrary software installed, that software MUST NOT be relied upon. You MAY rely on the local installation of: ● A Linux kernel ● Mesos ● Docker As a common negative example, a JRE is not provided by DCOS. If your Service requires a JRE, you must ship one with your package.
Although we could ship an EPMd as a task to a slave before allowing Riak to run on it, it would violate pt (20).
— Reply to this email directly or view it on GitHub https://github.com/basho-labs/riak-mesos/issues/29#issuecomment-122011185 .
So, from experimentation, we can over ride the module erl_epmd (just put a new BEAM into basho-patches). We already have the executors running on the box. These executors talk to Zookeeper.
The API that erl_epmd exposes isn't particularly complicated. It could very easily use Zookeeper as a central store instead of EPMd. The APIs exposed at this point in initialization of the interpreter are minimal.
Opinions? I might be a crazy person.
cEPMd! Custom Erlang Port Mapper Daemon OR Crazy Erlang Port Mapper Daemon! Your pick!
All nodes that participate in Bletchley run our own code. This is a Go daemon. My plan is that all nodes that are within an instance of the framework participate in SWIM. The way they do this is by leveraging Hashicorp's SWIM library. When a scheduler comes up, it establishes a memberlist listener. When it launches a task, in that task's TaskData lies the hostname, and port which the scheduler's memberlist is running. In order to make this process somewhat more fault-tolerant, we can also supply some other framework tasks as seeds, in case for whatever reason the scheduler is unavailable. In addition, when the scheduler starts up, it tries to join all of the current tasks memberlist instances.
In order to proceed to starting the Riak Node, a task must be able to join the group membership service. This should prevent the situation from getting into a hairy split-brained situation. We can potentially also add some more fault-tolerance here, and say that it can also be considered to successfully join the cluster if it can reach more than 1/2 of the nodes.
In order to integrate into Erlang, we'll "monkey-patch" erl_epmd, by dropping in our own module that sits earlier in the path, so it gets loaded instead of the actual erl_epmd. This module provides the following APIs:
Most of these functions don't translate to cEPMd.
Instead, the way that the APIs work would be the following:
If we use virtual hostnames, in order to get address portability, we also need to implement a name resolution system that can leverage the same channel.
This is mostly done as part of https://github.com/basho-labs/riak-mesos/pull/49. One change was that more than erl_epmd had to be modified. erl_epmd, even in long names only gets the first part of the hostname. We had to change net_kernel, and inet_tcp_dist to pass the full node name.
Unfortuately, we use nodetool in order to monitor riak_kv starting, and running. This automatically starts up epmd no matter what: https://github.com/basho/node_package/blob/develop/priv/base/nodetool#L156-L174 -- we're going to replace the EPMd binary with a simple echo binary.
Mesos prohibits cross executor interaction. Since only one executor will run epmd (at a time) this creates weird inter-cluster interaction issues.