Open Jeltje opened 8 years ago
Delete your instances. Delete your key pair in the EC2 console and try register-key
again, but without --force
.
I tried it. Same error:
INFO: === Copying the contents of /home/ubuntu/production/shared/ to ~/shared on leader ===
Connection closed by 52.40.186.164
You didn't delete the key pair because I can still see the old one.
You may also want to start from scratch with a new SSH key pair locally. Maybe the private key doesn't match the public key.
I tried a few new key pairs, with and without password protection. I verified that the key pair fingerprint changed on EC2 after running register-key
. Below is the error I get from trying to create a cluster using --shared
INFO: .
INFO: ... cloud-init done.
INFO: === Copying the contents of /home/ubuntu/production/shared/ to ~/shared on leader ===
Connection closed by 52.40.25.136
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(226) [sender=3.1.1]
INFO: Terminating instance ...
Traceback (most recent call last):
File "/home/ubuntu/cgcloud/bin/cgcloud", line 9, in <module>
load_entry_point('cgcloud-core==1.3.8', 'console_scripts', 'cgcloud')()
File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/cli.py", line 49, in main
app.run( args )
File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/lib/util.py", line 300, in run
command.run( options )
File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/cluster_commands.py", line 115, in run
super( CreateClusterCommand, self ).run( options )
File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 81, in run
return self.run_in_ctx( options, ctx )
File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/cluster_commands.py", line 37, in run_in_ctx
self.run_on_cluster_type( ctx, options, cluster_type )
File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/cluster_commands.py", line 121, in run_on_cluster_type
self.run_on_role( options, ctx, self.cluster.leader_role )
File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 124, in run_on_role
return self.run_on_box( options, box )
File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 471, in run_on_box
box.terminate( wait=False )
File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/commands.py", line 467, in run_on_box
self.run_on_creation( box, options )
File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/cluster_commands.py", line 128, in run_on_creation
leader.rsync( args=[ '-r', local_path, ":shared/" ], ssh_opts=options.ssh_opts )
File "/home/ubuntu/cgcloud/local/lib/python2.7/site-packages/cgcloud/core/box.py", line 1057, in rsync
subprocess.check_call( [ 'rsync', '-e', ' '.join( ssh_args ) ] + args )
File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['rsync', '-e', u'ssh mesosbox@ec2-52-40-25-136.us-west-2.compute.amazonaws.com -A', '-r', '/home/ubuntu/production/shared/', ':shared/']' returned non-zero exit status 12
When I start the cluster without --shared
, I can ssh ubuntu@52.40.39.137
just fine. But ssh mesosbox@52.40.39.137
gets Permission denied (publickey).
ssh -vvv mesosbox@52.40.39.137
full log output here
What's CGCLOUD_KEYPAIRS set to?
on the toil-leader, cat /home/ubuntu/.ssh/authorized_keys
shows two different ssh-rsa keys, both ending with my email. The second key matches my id_rsa.pub.
/home/mesosbox/.ssh/authorized_keys
shows only the first key, which explains why it won't let me log on.
CGCLOUD_KEYPAIRS on the master? Or on my VM? echo $CGCLOUD_KEYPAIRS
gives nothing on either.
Then you don't have it set.
Upon investigation on the actual box, it turns out that dots in the namespace prevented cgcloudagent from creating the SQS queue. We should tweak the __me__
derivation to strip dots. We should also tighten the regex that validates namespaces to disallow dots.
Workaround for now is to CGCLOUD_NAMESPACE=/foo/
Changing the namespace hasn't fixed the problem.
export CGCLOUD_NAMESPACE=/jeltje/
cgcloud create -IT toil-box
cgcloud create-cluster --leader-instance-type m3.medium --instance-type c3.8xlarge --spot-bid 1.0 -s 1 toil
cgcloud list toil-leader
INFO: Using zone 'us-west-2a' and namespace '/jeltje/'
i-19eef3b5 jeltje_toil-leader 0 172.31.46.57 52.34.135.67 i-19eef3b5 2016-05-27T16:40:23.000Z running
But I can't ssh
to it. Yesterday I was at least able to ssh ubuntu@52.34.135.67
(but not ssh mesosbox@52.34.135.67
) but that no longer works either. So I can't see what's going on with the ssh keys
Most recent failure was the result of misconfiguration on user's end (multiple SSH agent instances).
I followed the READMEs for
cgcloud-core
andcgcloud-toil
to set up on my (firewalled) podcloud VM.Because I already had a key registered (from my old VM, which crashed and took its id_rsa.pub with it), I used
cgcloud register-key --force ~/.ssh/id_rsa.pub
cgcloud create-cluster --leader-instance-type m3.medium --instance-type c3.8xlarge --share shared/ --spot-bid 1.0 -s 1 toil
failed at the rsync step to copy from shared/, so I tried the same command without that option. The cluster was created:cgcloud list toil-leader
However,
cgcloud ssh toil-leader
gets an ssh error (full error pasted below) I can't ping the machine either.Ping and ssh to other machines work fine from the VM, so I'm assuming the authentication at EC2 is somehow messed up?
Full error: