datacleaner / DataCleaner

The premier open source Data Quality solution
GNU Lesser General Public License v3.0
599 stars 181 forks source link

Automatic discovery of nodes in cluster module #28

Closed kaspersorensen closed 8 years ago

kaspersorensen commented 10 years ago

Right now our cluster setup in DataCleaner requires a configured list of node URLs in the cluster. This is fine if the cluster is constant, but it would be even better if nodes can automatically be added, so we should rather maintain a mutable list of nodes in the master(s) of the cluster.

Consider getting inspiration from discovery modules of e.g. ElasticSearch; described here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-discovery-zen.html

maphysics commented 9 years ago

Have you considered using sniffing on the transport port? The client will try to connect to other nodes advertised in the publish_address. I've seen some issues with firewall settings on hosted clusters. So it may be good to leave the ability to add hosts by hand too.

From [1]:

The client allows to sniff the rest of the cluster, and add those into its list of machines to use. In this case, note that the IP addresses used will be the ones that the other nodes were started with (the "publish" address). In order to enable it, set the client.transport.sniff to true:

Settings settings = ImmutableSettings.settingsBuilder() .put("client.transport.sniff", true).build(); TransportClient client = new TransportClient(settings);

[1] http://www.elastic.co/guide/en/elasticsearch/client/java-api/1.x/client.html

kaspersorensen commented 9 years ago

Ah this is an interesting suggestion, but I think maybe you misread the story here to be about elasticsearch clusters, which is not what I had in mind. I was talking about a DC monitor server cluster. In this server app we currently support setting up a cluster, but only by specfiying IP addresses of each slave on the master instance. Instead the suggestion here was in deed to make the cluster configuration elastic in the sense that more DC servers could dynamically join the cluster at runtime.

kaspersorensen commented 8 years ago

Closing this story since this functionality will be provided by Spark or YARN in DataCleaner 5.0.