fabric8io / fabric8

fabric8 is an open source microservices platform based on Docker, Kubernetes and Jenkins
http://fabric8.io/
1.76k stars 504 forks source link

Zookeeper Ensemble #1547

Open ghost opened 10 years ago

ghost commented 10 years ago

Hi all,

I have used the prepackaged distribution and downloaded and fully compiled the source. I stil have the same issue.

How to create a multi-node zookeeper ensemble. Here is as close as I have come:

Setting: Three hosts (virtual) ssh keys configured modified etc/system.properties to name first container registry1

Issue these commands:

fabric:create (successfully creates registry1) fabric:container-create-ssh --host host2 --user root registry2 fabric:container-create-ssh --host host3 --user root registry3 shell:watch fabric:container-list (shows all are successful and running)

fabric:create --verbose --clean registry1 registry2 registry3

This successfully creates a three node zookeeper ensemble with one exception. Node 1 had a zookeeper instance running and so the second instance does not start on the standard port of 2181. Instead, that node uses port 2182.

What is the proper way to create a three or five node ensemble where all nodes have one instance of zookeeper running on the default port?

Thanks in advance for any thoughts,

P.S. I've been completely unsuccessful using fabric:ensemble-add so if that is the answer, can you please give specific details?

davsclaus commented 10 years ago

Have you tried only creating 2 and 3, and then join them together. There is a fabric:join command.

ghost commented 10 years ago

Hi,

Thanks for the response. I have found the basics of creating an ensemble to be very frustrating. I have been successful, but it is not straightforward. Based on your comments above, I tried something new. Here is what I did and I'll discuss the issues. I have screenshots that I've attached as well:

Build fabric8 from sources. This produces a .zip file that I copied to 5 virtual machines (running on 4 physical hosts).

1) Unzipped on all 5 hosts 2) Edited the etc/system.properties file to name each something better than root (used zoo0x for this example) 3) bin/start on each of the 5 hosts 4) bin/client on each of the 5 hosts (now ready to issues commands etc) 5) on zoo01: fabric:create (wait for complete) 6) on zoo02-zoo05: fabric:join --zookeeper-password admin zoo01:2181 zoo0[2-5](proper number for each). See the attached screenshot to show succsss screen shot 1

7) fabric:ensemeble-add zoo02 zoo03 zoo04 zoo05 (and select yes at prompt) 8) The resulting string shows the first node zoo01 using non-standard zookeeper port (see screenshot)

screen shot 2

9) For some reason two of the 5 failed to join. (see screenshot)

screen shot 3

10) Trying to stop the container with fabric:container-stop resulted in this error:

screen shot 4

11) After Manually stopping and starting the service on zoo03 and zoo04, they came back up and I get success. See screenshot:

screen shot 5

12) However, when logging into the management web interface, I get varying version of info about what is installed. This changes over time (icons to the right): screen shot 6

Issues:

The first node in the cluster uses port 2182 instead of 2181 (probably because the previous zookeeper process was using 2181. However, this means it is difficult to manage with outside tools becuase one of the systems is running on a non-standard port. Any way to fix this?

If you try to "specify" the container name as a final (optional) argument during step 6 (instead of editing the etc/system.config file), I lose contact.

Step 6 fails if you do not use the --zookeeper-password argument even though i'm using the default "admin" password on all systems. This argument seems to be mandatory.

Once the cluster is up, I can watch it with client by: shell:watch fabric:container-list However, if I take nodes up and down in the cluster, the client will often times freeze up. This is true even though zk-smoketest shows that the zookeeper cluster is still running. I typically run the client in a shell on all 5 machines. All 4 remaining will freeze when I make cluster changes. Sometimes they come back alive and sometimes they do not.

In summary... I'd like to see the process of creating the ensemble work easier... it is at the core of the whole system... This is just my $0.00125. I love where this is going and I hope that my feedback may help. I'm not sure that I have the skillset to dive in and find out why things only partially work...