Nearly done ... - Githubissues

ghost commented 10 years ago

Thank you, for the huge chunk of work. While looking at Travies, apparently Compilation failed because of ConfigurationException at DistributedWorkerSpec.scala.

Stack-trace:

DistributedWorkerSpec:

[info] Distributed workers [info] - should perform work and publish results * FAILED * [info] akka.ConfigurationException: ActorSystem [akka://DistributedWorkerSpec] needs to have a 'ClusterActorRefProvider' enabled in the configuration, currently uses [akka.actor.LocalActorRefProvider] [info] at akka.cluster.Cluster.(Cluster.scala:79) [info] at akka.cluster.Cluster$.createExtension(Cluster.scala:42) [info] at akka.cluster.Cluster$.createExtension(Cluster.scala:37) [info] at akka.actor.ActorSystemImpl.registerExtension(ActorSystem.scala:711) [info] at akka.actor.ExtensionId$class.apply(Extension.scala:79) [info] at akka.cluster.Cluster$.apply(Cluster.scala:37) [info] at org.remotefutures.core.impl.akka.pullingworker.DistributedWorkerSpec$$anonfun$1.apply$mcV$sp(DistributedWorkerSpec.scala:62) [info] at org.remotefutures.core.impl.akka.pullingworker.DistributedWorkerSpec$$anonfun$1.apply(DistributedWorkerSpec.scala:61) [info] at org.remotefutures.core.impl.akka.pullingworker.DistributedWorkerSpec$$anonfun$1.apply(DistributedWorkerSpec.scala:61) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] ...

error Compilation failed [error] Total time: 4 s, completed Sep 6, 2014 8:28:05 AM

MartinSenne commented 10 years ago

Hey Marvin,

I'm still doing code changes :)

Right now, the controller concept is implemented. See multi-jvm -> MasterWorker for how it will work. A controller is responsible for start up and shut down of a node.

Cheers,

Martin

MartinSenne commented 10 years ago

Right now, the reading of config file (remotefutures.conf) is not working.

I moved all configuration files into one (remotefutures.conf) in oder to avoid having a separate config file (plus akka application.conf) per node type.

MartinSenne commented 10 years ago

Test is also broken. Will fix that soon, so we can hopefully merge tonight.

ghost commented 10 years ago

Hi Martin,

merging everything into one config file certainly simplifies the setup in the long-run. While reading through your latest change-set, I noticed your comment about ensuring that a cluster (master & worker) is up & running. Currently by using actors there isn't terrible much one can do except defining a meta-protocol for cluster status & tracking changes. This really gets ugly and IMHO should not be resolved on this level of abstraction.

Maybe a different thought would be an ad-hoc registration mechanism built-into a "node-runner" a level below using reactive variables from Scala.Rx, which gives you essentially a protocol state-machine for free. It does not need to be Scala.Rx, but the idea is simple

1) Built a "Runner" that wraps masster / nodes for "Start / Stop / Restart" 2) The "Runner" communicates real-time through a simple protocol
3) Everything has to be "below" the actor implementation to make sure it is not affected by any software crash.

A good analogy would be the OBP system on Spark Servers that monitors hardware and starts the Solaris OS. The real value of OBP is its interactive interface for configuration, testing and debugging a box regardless of the OS status. Each Spark Server has a separate Lan port that connects the OBP with the admin-network, so you actually do remote debugging, diagnoses and hard-restarting over ssh.

Going back to a tiny "Runner" might offer a few neat side-effects, such as:

Provide Docker Containers / AWS images with a "listening" Runner
Remote deploy any kind of Jar to the "Runner" (Regardless of master, node, actor whatever)
The "Runner" starts / stops the Jar to run, which then (because the configuration is inside the Jar) Registered itself to whatever domain-name the Master may have
It case something ugly happens, the runner can "hard restart" the node by simply terminating and re-running the Jar.

The last point implies fully automatic error-handling, which after all, is dev-op friendly.

In case the runner does not respond anymore, the Container / AWS instance is supposed to be dead and should be attempted to either reboot or blacklisted.

Just imagine you want to deploy a cluster of, lets say ~15k nodes within one hour. How do we do that? Use a dockker/ aws template, flash it on 15k boxes, pre-configured to auto-start the runner, which then pulls a config about which Jar to pull from which source. Once the correct Jar is pulled, it just starts the Jar which then, if its a node, registered itself with a master through a domain name placed in the node config inside the Jar.

That would be one possible solution ensuring that a cluster (master & worker) is up, running and monitored in real-time regardless of the actual number of nodes.

ScalaRX https://github.com/lihaoyi/scala.rx OBP intro http://solarisfacts.blogspot.co.at/p/solaris-10-boot-process.html Nailgun http://martiansoftware.com/nailgun/

MartinSenne commented 10 years ago

Hi Marvin,

I'm not sure if I understand you completely.

What is the benefit of using Scala.Rx for deployment/provisioning/node start? I see no interconnection.

You speak about a runner (and you are totally right that this should stay away from the cluster (Akka etc.) stuff) But why do we need that?

I do not want to reinvent the wheel as there is docker to start/download containers and Chef for environment config. Are you going to implement a runner?

And even if we implement a custom runner we are faved with the "Henne-Ei" problem. We must at least start a daemon which receives commands to start/stop/restart the real application. This can be a custom one but also ssh-daemon does a perfect job IMO.

In the first place, I would stick to docker and ssh. What is your opinion about that? Are there issues we do not cover with this approach?

One comment about commits:

I changed implementation: Now NodeControllers instance is created by fully qualified class name FQCN via reflection. (see remotefuture.conf) NodeControllers returns a specific Nodecontroller dependent on the node type (worker, master, frontend) Certain NodeController (like Frontend) provide the RemoteExecutionContext if it is available for that kind of node. What do you think?

At the moment, the master is not in an operable state, as no workers are started. Will proceed here today... ....

Cheers, Martin

ghost commented 10 years ago

Hi Martin,

thank you for all your considerations. Answering your questions:

1) "What is the benefit of using Scala.Rx for deployment/provisioning/node start?"

Real-time event propagation with minimum fuzz. We can leave it out.

2) You speak about a runner (and you are totally right that this should stay away from the cluster (Akka etc.) stuff) But why do we need that?

There is no "need" to have one; While thinking about cluster deployment, it was one possible solution. BTW, the runner concept is something I have already sketched roughly 2 years ago for this project stage, which was before Docker took off.

3) "In the first place, I would stick to docker and ssh. What is your opinion about that?"

I am perfectly fine with using docker, chef and SSH instead. I am with you in terms of avoiding re-inventing the wheel.

4) "Are there issues we do not cover with this approach?"

I guess, we have to figure that out along the way. The only issue I can imagine at the moment is the "loss" of control sequences. Rethinking the loss of passing commands to nodes, I would say it's not a big deal. At the end, all you might need is to start, stop or restart a node and Docker already does all this perfectly well.

I have installed Docker today and while working with it, I believe it will do the Job for our deployment. There are already Ubuntu/JDK8 containers on DockerHub and for the next few days, I am working through getting a reproducible setup, which is useable for testing, and if things go well for AWS deployment. I still need to read through all the docs and tutorials, but that's just a matter of doing.

Any thoughts or comments?

Marvin Hansen, Msc. Unternehmensberatung

Telefon: +436603954165 eMail: marvin.hansen@gmail.com

On 7 September 2014 12:21, Martin Senne notifications@github.com wrote:

Hi Marvin,

I'm not sure if I understand you completely.

What is the benefit of using Scala.Rx for deployment/provisioning/node start? I see no interconnection.

You speak about a runner (and you are totally right that this should stay away from the cluster (Akka etc.) stuff) But why do we need that?

I do not want to reinvent the wheel as there is docker to start/download containers and Chef for environment config. Are you going to implement a runner?

And even if we do we are stick to the "Henne-Ei" problem. We must at least start a daemon which receives commands to start/stop/restart the real application. This can be a custom one but also ssh-daemon does a perfect job IMO.

In the first place, I would stick to docker and ssh. What is your opinion about that? Are there issues we do not cover with this approach?

One comment about commits:

I changed implementation so now NodeControllers is created by fully qualified class name FQCN via reflection. NodeControllers returns a Nodecontroller dependent on the node type (worker, master, frontend) The NodeController then provides the RemoteExecutionContext if it is available for that kind of node. What do you think?

Still, right now the master is not in an operable state, as no workers are started. Will proceed here today ....

Cheers, Martin

— Reply to this email directly or view it on GitHub https://github.com/DistributedRemoteFutures/DistributedRemoteFutures/pull/60#issuecomment-54742877 .

ProjectZetta / RemoteFutures

Nearly done ... #60