ClusterHQ / unofficial-flocker-tools

A tool to make installing Flocker with container orchestration tools easier and more fun
https://clusterhq.com/
11 stars 9 forks source link

uft-cluster-config hangs #51

Open exarkun opened 8 years ago

exarkun commented 8 years ago
$ uft-flocker-config cluster.yml
Initialized cluster CA.
Created control cert.
Generated fc43dc40-8a58-41c8-a664-05405f1e073b for ....
Generated 4c763764-834b-42a8-917c-9bac4817c02d for ....
Generated f96c7b9f-b7e2-46d1-b66b-33f62c3e568e for ....
Created user key for coreuser
Making /etc/flocker directory on all nodes
Uploading keys to respective nodes:
 * Uploaded control cert & key to control node.
 * Uploaded cluster cert to ....
 * Uploaded agent.yml to ....
 * Uploaded node key to ....
 * Uploaded node key to ....
 * Uploaded node crt to ....
 * Uploaded node crt to ...
 * Uploaded cluster cert to ....
 * Uploaded agent.yml to ....
 * Uploaded node key to ....
 * Uploaded cluster cert to ....
 * Uploaded control key to ....
 * Uploaded control crt to ....
 * Uploaded cluster cert to ....
 * Uploaded agent.yml to ....
 * Uploaded node crt to ....

After the last line, many minutes pass with no further progress. The ssh process being run by the uft-cluster-config container is blocked on select().

wallnerryan commented 8 years ago

does this happen every time you run it? I haven't come across this issue, just curious, and ran it last night.

exarkun commented 8 years ago

If I try to run it again it fails with an error very quickly:

$ uft-flocker-config cluster.yml
Error: Unable to write certificate file. File exists /pwd/cluster.crt
main function encountered error
Traceback (most recent call last):
  File "/opt/flocker/bin/flocker-config", line 9, in <module>
    load_entry_point('UnofficialFlockerTools==0.5', 'console_scripts', 'flocker-config')()
  File "/opt/flocker/lib/python2.7/site-packages/unofficial_flocker_tools/config.py", line 147, in _main
    react(main, sys.argv[1:])
  File "/opt/flocker/lib/python2.7/site-packages/twisted/internet/task.py", line 875, in react
    finished = main(_reactor, *argv)
  File "/opt/flocker/lib/python2.7/site-packages/twisted/internet/defer.py", line 1253, in unwindGenerator
    return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
  File "/opt/flocker/lib/python2.7/site-packages/twisted/internet/defer.py", line 1107, in _inlineCallbacks
    result = g.send(result)
  File "/opt/flocker/lib/python2.7/site-packages/unofficial_flocker_tools/config.py", line 21, in main
    c.run("flocker-ca initialize %s" % (c.config["cluster_name"],))
  File "/opt/flocker/lib/python2.7/site-packages/unofficial_flocker_tools/utils.py", line 208, in run
    result = subprocess.check_output(command, shell=True)
  File "/usr/lib/python2.7/subprocess.py", line 573, in check_output
    raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command 'flocker-ca initialize jp_coreos_flocker_testing' returned non-zero exit status 1

Perhaps I'll move all the generated authentication files aside and see if that lets me run it again...

exarkun commented 8 years ago

If I do that then I get a different error:

Retrying running ['-o', 'LogLevel=error', '-o', 'UserKnownHostsFile=/dev/null', '-o', 'StrictHostKeyChecking=no', '-i', '/host/tmp/flocker-coreos/jean-paulcalderoneinsecure-temporary.pem', 'root@....', "bash -c 'echo; echo\necho > /tmp/flocker-command-log\ndocker run --restart=always -d --net=host --privileged \\\n    -v /etc/flocker:/etc/flocker \\\n    -v /var/run/docker.sock:/var/run/docker.sock \\\n    --name=flocker-container-agent \\\n    clusterhq/flocker-container-agent\ndocker run --restart=always -d --net=host --privileged \\\n    -e DEBUG=1 \\\n    -v /tmp/flocker-command-log:/tmp/flocker-command-log \\\n    -v /flocker:/flocker -v /:/host -v /etc/flocker:/etc/flocker \\\n    -v /dev:/dev \\\n    --name=flocker-dataset-agent \\\n    clusterhq/flocker-dataset-agent\n'"] on .... given result 'Process exited with error code 1: \n\nError response from daemon: Conflict. The name "flocker-container-agent" is already in use by container e296946b20e2. You have to delete (or rename) that container to be able to reuse that name.\nError response from daemon: Conflict. The name "flocker-dataset-agent" is already in use by container 09b2ba7ae2e4. You have to delete (or rename) that container to be able to reuse that name.\n'...
wallnerryan commented 8 years ago

Yeah this is an issue because it doesn't clean up certs on re-run. +1 to making it re-runnable. Relates to or may duplicate https://github.com/ClusterHQ/unofficial-flocker-tools/issues/42

exarkun commented 8 years ago

Nevertheless, uft-flocker-volumes now succeeds... Is the config command just expected to take a long time and not generate any progress updates?

wallnerryan commented 8 years ago

yeah, it can potentially take a little bit of time. I've seen it hang before but not for too long (<1 to 2 min). Would be nice to have progress bars or updates.

lukemarsden commented 8 years ago

I've seen it hang like this before. Maybe if we try and scp at just the wrong time, TCP screws us. We could add timeouts on each network operation.

lukemarsden commented 8 years ago

Also, +1 for fixing #42