amplab / training

Training materials for Strata, AMP Camp, etc
150 stars 121 forks source link

rsync: recv_generator: mkdir "/root/spark-ec2" failed: Permission denied (13) #149

Closed ahoffer closed 10 years ago

ahoffer commented 10 years ago

I am unable to diagnose this problem.

(the command and the error are bold)


ahoffer@ubuntu:~/repos/training-scripts$ ./spark-ec2 -i ~/.ssh/ampcamp-key.pem -k ampcamp-key -t m1.medium -u ec2-user --copy launch amplab-training Setting up security groups... Searching for existing cluster amplab-training... Latest Spark AMI: ami-19474270 Launching instances... Launched 5 slaves in us-east-1b, regid = r-fbe711da Launched master in us-east-1b, regid = r-86f90fa7 Waiting for instances to start up... Waiting 120 more seconds... Copying SSH key /home/ahoffer/.ssh/ampcamp-key.pem to master... ssh: connect to host ec2-54-83-73-166.compute-1.amazonaws.com port 22: Connection refused Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i /home/ahoffer/.ssh/ampcamp-key.pem ec2-user@ec2-54-83-73-166.compute-1.amazonaws.com 'mkdir -p ~/.ssh'' returned non-zero exit status 255, sleeping 60 ssh: connect to host ec2-54-83-73-166.compute-1.amazonaws.com port 22: Connection refused Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i /home/ahoffer/.ssh/ampcamp-key.pem ec2-user@ec2-54-83-73-166.compute-1.amazonaws.com 'mkdir -p ~/.ssh'' returned non-zero exit status 255, sleeping 60 ssh: connect to host ec2-54-83-73-166.compute-1.amazonaws.com port 22: Connection refused Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i /home/ahoffer/.ssh/ampcamp-key.pem ec2-user@ec2-54-83-73-166.compute-1.amazonaws.com 'mkdir -p ~/.ssh'' returned non-zero exit status 255, sleeping 60 Warning: Permanently added 'ec2-54-83-73-166.compute-1.amazonaws.com,54.83.73.166' (ECDSA) to the list of known hosts. Connection to ec2-54-83-73-166.compute-1.amazonaws.com closed. Connection to ec2-54-83-73-166.compute-1.amazonaws.com closed. Cloning into 'spark-ec2'... remote: Counting objects: 1371, done. remote: Compressing objects: 100% (657/657), done. remote: Total 1371 (delta 438), reused 1371 (delta 438) Receiving objects: 100% (1371/1371), 214.25 KiB | 0 bytes/s, done. Resolving deltas: 100% (438/438), done. Connection to ec2-54-83-73-166.compute-1.amazonaws.com closed. Deploying files to master... sending incremental file list root/spark-ec2/ rsync: recv_generator: mkdir "/root/spark-ec2" failed: Permission denied (13) * Skipping any contents from this failed directory * sent 115 bytes received 171 bytes 114.40 bytes/sec total size is 679 speedup is 2.37 rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1183) [sender=3.1.0] Traceback (most recent call last): File "./spark_ec2.py", line 926, in main() File "./spark_ec2.py", line 767, in main setup_cluster(conn, master_nodes, slave_nodes, zoo_nodes, opts, True) File "./spark_ec2.py", line 423, in setup_cluster zoo_nodes, modules) File "./spark_ec2.py", line 691, in deploy_files subprocess.check_call(command, shell=True) File "/usr/lib/python2.7/subprocess.py", line 540, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'rsync -rv -e 'ssh -o StrictHostKeyChecking=no -i /home/ahoffer/.ssh/ampcamp-key.pem' '/tmp/tmpzzR6Qt/' 'ec2-user@ec2-54-83-73-166.compute-1.amazonaws.com:/'' returned non-zero exit status 23

shivaram commented 10 years ago

From the following line, it looks like the machines didn't launch successfully. You can check the EC2 console to see if they are up.

ssh: connect to host ec2-54-83-73-166.compute-1.amazonaws.com port 22: Connection refused
ahoffer commented 10 years ago

The machines were running. I could SSH them. I had to use the spark-ec2 -u parameter to change the user from "root" to "ec2-user". Could that be related? On Apr 22, 2014 10:37 AM, "Shivaram Venkataraman" notifications@github.com wrote:

From the following line, it looks like the machines didn't launch successfully. You can check the EC2 console to see if they are up.

ssh: connect to host ec2-54-83-73-166.compute-1.amazonaws.com port 22: Connection refused

— Reply to this email directly or view it on GitHubhttps://github.com/amplab/training/issues/149#issuecomment-41069985 .

shivaram commented 10 years ago

Hmm that is unusual as the training AMIs should have ssh access for the root user enabled. Did you launch the cluster using the scripts ?

ahoffer commented 10 years ago

I did use the scripts. I dub through the python and found:

LATEST_AMI_URL = http://s3.amazonaws.com/ampcamp-amis/latest-ampcamp3 http://s3.amazonaws.com/ampcamp-amis/latest-ampcamp3

I am following the insructions found in the “mini-course””

http://ampcamp.berkeley.edu/big-data-mini-course-home/

From: Shivaram Venkataraman [mailto:notifications@github.com] Sent: Tuesday, April 22, 2014 12:39 PM To: amplab/training Cc: Aaron Hoffer Subject: Re: [training] rsync: recv_generator: mkdir "/root/spark-ec2" failed: Permission denied (13) (#149)

Hmm that is unusual as the training AMIs should have ssh access for the root user enabled. Did you launch the cluster using the scripts ?

— Reply to this email directly or view it on GitHub https://github.com/amplab/training/issues/149#issuecomment-41084582 . https://github.com/notifications/beacon/133284__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcxMzgxNDc0OCwiZGF0YSI6eyJpZCI6MzA1MzMxNjJ9fQ==--cdd4248c3a7b128dc767eba3c34c1abde306b5aa.gif

ahoffer commented 10 years ago

Here is the command I used to clone the repository:

git clone git://github.com/amplab/training-scripts.git -b ampcamp4

I found it on this page:

http://ampcamp.berkeley.edu/big-data-mini-course/launching-a-bdas-cluster-on-ec2.html

From: Shivaram Venkataraman [mailto:notifications@github.com] Sent: Tuesday, April 22, 2014 12:39 PM To: amplab/training Cc: Aaron Hoffer Subject: Re: [training] rsync: recv_generator: mkdir "/root/spark-ec2" failed: Permission denied (13) (#149)

Hmm that is unusual as the training AMIs should have ssh access for the root user enabled. Did you launch the cluster using the scripts ?

— Reply to this email directly or view it on GitHub https://github.com/amplab/training/issues/149#issuecomment-41084582 . https://github.com/notifications/beacon/133284__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcxMzgxNDc0OCwiZGF0YSI6eyJpZCI6MzA1MzMxNjJ9fQ==--cdd4248c3a7b128dc767eba3c34c1abde306b5aa.gif

ahoffer commented 10 years ago

It must have taken a LONG time for SSH demon to start. The instances were running for a couple of minutes before SSH would work. I edited spark_ec2.py to increase the timeouts from 30 seconds to 60 seconds and I increased the number of timeout from 2 to 4. With those settings, SSH would usually become available on the third try.

From: Shivaram Venkataraman [mailto:notifications@github.com] Sent: Tuesday, April 22, 2014 10:38 AM To: amplab/training Cc: Aaron Hoffer Subject: Re: [training] rsync: recv_generator: mkdir "/root/spark-ec2" failed: Permission denied (13) (#149)

From the following line, it looks like the machines didn't launch successfully. You can check the EC2 console to see if they are up.

ssh: connect to host ec2-54-83-73-166.compute-1.amazonaws.com port 22: Connection refused

— Reply to this email directly or view it on GitHub https://github.com/amplab/training/issues/149#issuecomment-41069985 . https://github.com/notifications/beacon/133284__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcxMzgwNzQ2OCwiZGF0YSI6eyJpZCI6MzA1MzMxNjJ9fQ==--ca8b1af009ac3806ff08c225a5b5cc707836b55a.gif

shivaram commented 10 years ago

Yeah EC2 startup times have been getting longer. I usually set the wait time to 180s to avoid this issue. You can also restart the scripts with the --resume option. That should retry things on machines that were launched before.

ahoffer commented 10 years ago

Hi Shivaram, I'm getting closing to making this work. I was able to launch the spark cluster and SSH to the master. However, the HDFS data was not there (ephemeral).

I ran spark-ec2 with the action 'copy-data':

./spark-ec2 -i ~/.ssh/ampcamp-key.pem -k ampcamp-key --wait=15 --copy --s3-stats-bucket=S3_STATS_BUCKET copy-data amplab-training

This the error I get says the data /ampcamp-data/movielens does not exist. Should the data be part of thge AMI or should it have been copied to the vm?

'ssh -t -o StrictHostKeyChecking=no -i /home/ahoffer/.ssh/ampcamp-key.pem root@ec2-54-242-164-169.compute-1.amazonaws.com'/root/ephemeral-hdfs/bin/hadoop fs -copyFromLocal /ampcamp-data/movielens /movielens'' returned non-zero exit status 255 ahoffer@ubuntu:~/repos/training-scripts$ ssh -t -o StrictHostKeyChecking=no -i /home/ahoffer/.ssh/ampcamp-key.pem root@ec2-54-242-164-169.compute-1.amazonaws.com'/root/ephemeral-hdfs/bin/hadoop fs -copyFromLocal /ampcamp-data/movielens /movielens' copyFromLocal: File /ampcamp-data/movielens does not exist.

From: Shivaram Venkataraman [mailto:notifications@github.com] Sent: Tuesday, April 22, 2014 12:39 PM To: amplab/training Cc: Aaron Hoffer Subject: Re: [training] rsync: recv_generator: mkdir "/root/spark-ec2" failed: Permission denied (13) (#149)

Hmm that is unusual as the training AMIs should have ssh access for the root user enabled. Did you launch the cluster using the scripts ?

— Reply to this email directly or view it on GitHubhttps://github.com/amplab/training/issues/149#issuecomment-41084582 .

ahoffer commented 10 years ago

I destroyed the cluster and used this command to rebuilt it. It successfully copied the data: ./spark-ec2 -i ~/.ssh/ampcamp-key.pem -k ampcamp-key -t m1.medium --wait=360 \ --s3-stats-bucket=S3_STATS_BUCKET --copy launch amplab

itissid commented 10 years ago

So I was playing with 0.9.1. I see that you guys are using the ampcamp stuff but it seems that there is the same issue in the distributed spark-ec2: rsync: recv_generator: mkdir "/root/spark-ec2" failed: Permission denied (13)

Note that this is not the SSH issue due to ec2 latency. Do I just need to use the right AMI here or is this a bug in case the user is not root? In any case shouldn't the scripts write things for the user specified to the spark-ec2 script?

shivaram commented 10 years ago

Unfortunately the scripts are not expected to work with any other AMI or other user ids.

ahoffer commented 10 years ago

What is the correct user id? root?

shivaram commented 10 years ago

root is the correct user id

xbsd commented 10 years ago

Been also getting the same error message. Are the scripts on the site working for other users ?

... Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i /Users/raj/aws/key/raj.pem root@ec2-54-82-xx-xxx.compute-1.amazonaws.com 'mkdir -p ~/.ssh'' returned non-zero exit status 255, sleeping 30

I am using the procedures at, http://ampcamp.berkeley.edu/big-data-mini-course/launching-a-bdas-cluster-on-ec2.html

saurk commented 10 years ago

I just started with the amptraining but am stuck at the SSH part. Seems the AMI isn't starting the ssh daemon as I get this errror even after increasing the timeout to 6 minutes:

skumar@ubuntu:~/trainingSpark/training-scripts$ ./spark-ec2 -i ~/demo-ohio.pem -k demo-ohio -w 360 --copy launch amplab-training Setting up security groups... Creating security group ampcamp3-master Creating security group ampcamp3-slaves Creating security group ampcamp3-zoo Searching for existing cluster amplab-training... Latest Spark AMI: ami-19474270 Launching instances... Launched 5 slaves in us-east-1c, regid = r-2e5a8251 Launched master in us-east-1c, regid = r-e05b839f Waiting for instances to start up... Waiting 360 more seconds... Copying SSH key /home/skumar/demo-ohio.pem to master... ssh: connect to host ec2-54-205-20-241.compute-1.amazonaws.com port 22: Connection timed out Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i /home/skumar/demo-ohio.pem root@ec2-54-205-20-241.compute-1.amazonaws.com 'mkdir -p ~/.ssh'' returned non-zero exit status 255, sleeping 30 ssh: connect to host ec2-54-205-20-241.compute-1.amazonaws.com port 22: Connection timed out Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i /home/skumar/demo-ohio.pem root@ec2-54-205-20-241.compute-1.amazonaws.com 'mkdir -p ~/.ssh'' returned non-zero exit status 255, sleeping 30 ssh: connect to host ec2-54-205-20-241.compute-1.amazonaws.com port 22: Connection timed out Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i /home/skumar/demo-ohio.pem root@ec2-54-205-20-241.compute-1.amazonaws.com 'mkdir -p ~/.ssh'' returned non-zero exit status 255, sleeping 30 ssh: connect to host ec2-54-205-20-241.compute-1.amazonaws.com port 22: Connection timed out Traceback (most recent call last): File "./spark_ec2.py", line 925, in main() File "./spark_ec2.py", line 766, in main setup_cluster(conn, master_nodes, slave_nodes, zoo_nodes, opts, True) File "./spark_ec2.py", line 406, in setup_cluster ssh(master, opts, 'mkdir -p ~/.ssh') File "./spark_ec2.py", line 712, in ssh raise e subprocess.CalledProcessError: Command 'ssh -t -o StrictHostKeyChecking=no -i /home/skumar/demo-ohio.pem root@ec2-54-205-20-241.compute-1.amazonaws.com 'mkdir -p ~/.ssh'' returned non-zero exit status 255

xiejuncs commented 10 years ago

Two solutions based on the above answers and my own experience (I am in the west region):

  1. set up a large wait interval by enforcing the -w parameter:

./spark-ec2 -i -k -w 540 --copy launch amplab-training

I tried 300 and it still fails. So I try even large number

  1. after first step, it still fails.

Wait for several minutes, and try to ssh to the master machine. If successful, do the following command:

./spark-ec2 -i -k --copy --resume launch amplab-training

This command picks up the existed cluster and resume the launch process.

Viewing the file spark_ec2.py can give you a taste of what the script does.

mrmcgrewx commented 7 years ago

To deal with the root issue and region issue, I just created my own AMI and installed spark, scala, ganglia, etc (all the spark dependencies needed for a cluster) and then modified the spark-ec2.py file and things should work fine afterwards.