bitwalker / libcluster

Automatic cluster formation/healing for Elixir applications
MIT License
1.98k stars 188 forks source link

[WIP] Add Nomad strategy #181

Open mikenomitch opened 2 years ago

mikenomitch commented 2 years ago

Summary of changes

This PR adds a strategy for clustering on HashiCorp Nomad using Nomad's (relatively new) native service discovery feature.

This is something I am planning to use on my own projects, and I figured I would open a WIP PR in case this is something you want to upstream. If so, I will clean up the code and add better testing, docs, and typespecs.

I am a bit biased (I work for HashiCorp), but I would love to see Nomad alongside K8s and Rancher as a natively supported orchestrator for libcluster. :)

If you're interested, let me know and I can make this prod-ready, if not, feel free to close it out and I'll just maintain my fork.

Checklist

To Dos

HammamSamara commented 2 years ago

Thanks for your work @mikenomitch, as someone who also looks for a strategy that works with Nomad, I would like to know if you have used it in a production environment, and if I can do so as well. Much appreciated!

mikenomitch commented 2 years ago

Hey @HammamSamara, I haven't gotten this working (even in testing) yet but I'm pretty sure it is close. Unfortunately, this was something I was doing for fun on parental leave and now I am back at work full time, so I haven't had a chance to get it done yet.

I believe I ended up getting stuck in the networking stage and couldn't get the apps connecting correctly. I think this was due to me misconfiguring either the erlang port range ("inet_dist_listen_min/inet_dist_listen_max), or my AWS port security rules, or my Nomad networking rules. Got cut off mid-debug though!

In case you or anybody else wants to take a crack at this, I can push up my latest code and the jobspecs I was using to test it in case that is helpful.

Side note: I think we'll get around to this task soon on the Nomad team, which should make it easier to open up the whole port range properly.

EDIT: for anybody who would be interested in running with this: Here is a link to my nomad jobspec and parts of my elixir app where I tried getting this working: https://gist.github.com/mikenomitch/14f3214789f5b3335b466b42721682e4

Also - hope you're liking Nomad! :)

HammamSamara commented 2 years ago

Thank you for the great details, and pointing out the port range issue as well. This is my first time using Nomad with Elixir and getting this to work is key to convince Elixir teams to use it more often.

Will use your excellent gist as a kick start to test your work on my setup and post back once I have any tangible results.

P.S. Do you think having a docker image containing a production release instead of running elixir in development mode (iex -S mix) has anything to do with it? It's highly irrelevant but pointing it out anyway since I am used to precompiled elixir on prod. which offers control over VM flags and commands to connect to the running system remotely.

sukidhar commented 1 year ago

Hey @HammamSamara, I haven't gotten this working (even in testing) yet but I'm pretty sure it is close. Unfortunately, this was something I was doing for fun on parental leave and now I am back at work full time, so I haven't had a chance to get it done yet.

I believe I ended up getting stuck in the networking stage and couldn't get the apps connecting correctly. I think this was due to me misconfiguring either the erlang port range ("inet_dist_listen_min/inet_dist_listen_max), or my AWS port security rules, or my Nomad networking rules. Got cut off mid-debug though!

In case you or anybody else wants to take a crack at this, I can push up my latest code and the jobspecs I was using to test it in case that is helpful.

Side note: I think we'll get around to this task soon on the Nomad team, which should make it easier to open up the whole port range properly.

EDIT: for anybody who would be interested in running with this: Here is a link to my nomad jobspec and parts of my elixir app where I tried getting this working: https://gist.github.com/mikenomitch/14f3214789f5b3335b466b42721682e4

Also - hope you're liking Nomad! :)

After going through some testing and understanding the problem, I have figured out that it is to do with container networking than with Erlang port range. I have personally tested and discovered that docker containers with in same network bridge are able to discover and connect to each other.

image image

It is found that Nomad by default assumes host based networking. The containers on the host are isolated from the network based on documentation from docker. To achieve connection we have to enable nomad jobspec to have bridge mode in case of docker driver and configure the network bridging. Even if we use Consul and try the DNS lookup or HTTP API instead of nomad HTTP API, the scenario is same as long as the containers are not bridged between each other. Kubernetes does some heavy lifting of setting up network bridge between containers from multiple hosts. However, I found that using exec driver, I had luck on deploying elixir applications without taking the network bridge approach which slightly seemed to be a hustle to set up.