BD2KGenomics / cgcloud

Image and VM management for Jenkins, Spark and Mesos clusters in EC2
Other
22 stars 17 forks source link

cluster authentication issues after register-key --force #175

Open Jeltje opened 8 years ago

Jeltje commented 8 years ago

I followed the READMEs for cgcloud-core and cgcloud-toil to set up on my (firewalled) podcloud VM.

Because I already had a key registered (from my old VM, which crashed and took its id_rsa.pub with it), I used cgcloud register-key --force ~/.ssh/id_rsa.pub

cgcloud create-cluster --leader-instance-type m3.medium --instance-type c3.8xlarge --share shared/ --spot-bid 1.0 -s 1 toil failed at the rsync step to copy from shared/, so I tried the same command without that option. The cluster was created: cgcloud list toil-leader

INFO: Using zone 'us-west-2a' and namespace '/jeltje.van.baren/'
i-abcb3770      jeltje.van.baren_toil-leader    0       172.31.31.92    52.40.118.17    i-abcb3770      2016-05-26T17:48:29.000Z        running

However, cgcloud ssh toil-leader gets an ssh error (full error pasted below) I can't ping the machine either.

Ping and ssh to other machines work fine from the VM, so I'm assuming the authentication at EC2 is somehow messed up?

Full error:

INFO: Using zone 'us-west-2a' and namespace '/jeltje.van.baren/'
INFO: Binding to instance ...
INFO: ... waiting for instance i-abcb3770 ...
INFO: ... running, waiting for assignment of public IP ...
INFO: ... assigned, waiting for SSH port ...
INFO: ... open ...
INFO: ... instance ready.
Permission denied (publickey).
Traceback (most recent call last):
  File "/home/ubuntu/cgcloud/bin/cgcloud", line 9, in <module>
    load_entry_point('cgcloud-core==1.3.8', 'console_scripts', 'cgcloud')()
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/cli.py", line 49, in main
    app.run( args )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/lib/util.py", line 300, in run
    command.run( options )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 81, in run
    return self.run_in_ctx( options, ctx )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 105, in run_in_ctx
    return self.run_on_role( options, ctx, role )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 124, in run_on_role
    return self.run_on_box( options, box )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 164, in run_on_box
    self.run_on_instance( options, box )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 232, in run_on_instance
    self.ssh( options, box )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 219, in ssh
    status = box.ssh( user=self._user( box, options ), command=options.command )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/box.py", line 1050, in ssh
    raise RuntimeError( 'ssh failed' )
RuntimeError: ssh failed
hannes-ucsc commented 8 years ago

Delete your instances. Delete your key pair in the EC2 console and try register-key again, but without --force.

Jeltje commented 8 years ago

I tried it. Same error:

INFO: === Copying the contents of /home/ubuntu/production/shared/ to ~/shared on leader ===
Connection closed by 52.40.186.164
hannes-ucsc commented 8 years ago

You didn't delete the key pair because I can still see the old one.

hannes-ucsc commented 8 years ago

You may also want to start from scratch with a new SSH key pair locally. Maybe the private key doesn't match the public key.

Jeltje commented 8 years ago

I tried a few new key pairs, with and without password protection. I verified that the key pair fingerprint changed on EC2 after running register-key. Below is the error I get from trying to create a cluster using --shared

INFO: .
INFO: ... cloud-init done.
INFO: === Copying the contents of /home/ubuntu/production/shared/ to ~/shared on leader ===
Connection closed by 52.40.25.136
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(226) [sender=3.1.1]
INFO: Terminating instance ...
Traceback (most recent call last):
  File "/home/ubuntu/cgcloud/bin/cgcloud", line 9, in <module>
    load_entry_point('cgcloud-core==1.3.8', 'console_scripts', 'cgcloud')()
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/cli.py", line 49, in main
    app.run( args )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/lib/util.py", line 300, in run
    command.run( options )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/cluster_commands.py", line 115, in run
    super( CreateClusterCommand, self ).run( options )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 81, in run
    return self.run_in_ctx( options, ctx )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/cluster_commands.py", line 37, in run_in_ctx
    self.run_on_cluster_type( ctx, options, cluster_type )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/cluster_commands.py", line 121, in run_on_cluster_type
    self.run_on_role( options, ctx, self.cluster.leader_role )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 124, in run_on_role
    return self.run_on_box( options, box )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 471, in run_on_box
    box.terminate( wait=False )
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 467, in run_on_box
    self.run_on_creation( box, options )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/cluster_commands.py", line 128, in run_on_creation
    leader.rsync( args=[ '-r', local_path, ":shared/" ], ssh_opts=options.ssh_opts )
  File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/box.py", line 1057, in rsync
    subprocess.check_call( [ 'rsync', '-e', ' '.join( ssh_args ) ] + args )
  File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['rsync', '-e', u'ssh mesosbox@ec2-52-40-25-136.us-west-2.compute.amazonaws.com -A', '-r', '/home/ubuntu/production/shared/', ':shared/']' returned non-zero exit status 12
Jeltje commented 8 years ago

When I start the cluster without --shared, I can ssh ubuntu@52.40.39.137 just fine. But ssh mesosbox@52.40.39.137 gets Permission denied (publickey).

ssh -vvv mesosbox@52.40.39.137 full log output here

hannes-ucsc commented 8 years ago

What's CGCLOUD_KEYPAIRS set to?

Jeltje commented 8 years ago

on the toil-leader, cat /home/ubuntu/.ssh/authorized_keys shows two different ssh-rsa keys, both ending with my email. The second key matches my id_rsa.pub.

/home/mesosbox/.ssh/authorized_keys shows only the first key, which explains why it won't let me log on.

Jeltje commented 8 years ago

CGCLOUD_KEYPAIRS on the master? Or on my VM? echo $CGCLOUD_KEYPAIRS gives nothing on either.

hannes-ucsc commented 8 years ago

Then you don't have it set.

hannes-ucsc commented 8 years ago

Upon investigation on the actual box, it turns out that dots in the namespace prevented cgcloudagent from creating the SQS queue. We should tweak the __me__ derivation to strip dots. We should also tighten the regex that validates namespaces to disallow dots.

Workaround for now is to CGCLOUD_NAMESPACE=/foo/

Jeltje commented 8 years ago

Changing the namespace hasn't fixed the problem. export CGCLOUD_NAMESPACE=/jeltje/ cgcloud create -IT toil-box cgcloud create-cluster --leader-instance-type m3.medium --instance-type c3.8xlarge --spot-bid 1.0 -s 1 toil

cgcloud list toil-leader

INFO: Using zone 'us-west-2a' and namespace '/jeltje/'
i-19eef3b5      jeltje_toil-leader      0       172.31.46.57    52.34.135.67    i-19eef3b5      2016-05-27T16:40:23.000Z        running

But I can't ssh to it. Yesterday I was at least able to ssh ubuntu@52.34.135.67 (but not ssh mesosbox@52.34.135.67) but that no longer works either. So I can't see what's going on with the ssh keys

hannes-ucsc commented 8 years ago

Most recent failure was the result of misconfiguration on user's end (multiple SSH agent instances).