issue launching cluster for amplab tutorial

feffgroup commented 9 years ago

I'm following the recent amplab tutorial using my own AWS account. Cluster launch finishes with an error "ERROR: Cluster health check failed for spark_ec2". I'd be grateful for pointers on how to solve it or insight into what the error message means. Note that I added "-w" and "-z" flags to the launch command to avoid timeout and instance availability errors. I've cut and pasted the stdout lines that look like warnings or errors below. Please also take a look at the full stdout/stderr log here: https://gist.github.com/feffgroup/74a8c2789e582ada5150

bash-3.2$ ./spark-ec2 -i ~/aws/jorissen-account/jorissen/jorissen-us-east.pem -k jorissen-us-east --copy launch amplab-training -w 300 -z us-east-1c Setting up security groups... Searching for existing cluster amplab-training... Latest Spark AMI: ami-19474270 Launching instances... Launched 5 slaves in us-east-1c, regid = r-f775e51a Launched master in us-east-1c, regid = r-2e74e4c3 Waiting for instances to start up... Waiting 300 more seconds... Copying SSH key /Users/jorissen/aws/jorissen-account/jorissen/jorissen-us-east.pem to master... ssh: connect to host ec2-54-152-126-49.compute-1.amazonaws.com port 22: Connection refused Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i /Users/jorissen/aws/jorissen-account/jorissen/jorissen-us-east.pem root@ec2-54-152-126-49.compute-1.amazonaws.com 'mkdir -p ~/.ssh'' returned non-zero exit status 255, sleeping 30

[...]

Initializing ganglia rmdir: failed to remove /var/lib/ganglia/rrds': Not a directory ln: creating symbolic link/var/lib/ganglia/rrds': File exists Connection to ec2-54-152-37-237.compute-1.amazonaws.com closed.

[...]

Setting up mesos Pseudo-terminal will not be allocated because stdin is not a terminal.

[...]

Setting up training

[...]

Connection to ec2-54-152-86-73.compute-1.amazonaws.com closed. Shutting down GANGLIA gmond: [FAILED] Starting GANGLIA gmond: [ OK ] Connection to ec2-54-152-104-184.compute-1.amazonaws.com closed. ln: creating symbolic link `/var/lib/ganglia/conf/default.json': File exists Shutting down GANGLIA gmetad: [FAILED] Starting GANGLIA gmetad: [ OK ] Stopping httpd: [FAILED] Starting httpd: [ OK ] Connection to ec2-54-152-126-49.compute-1.amazonaws.com closed. Done! Waiting for cluster to start... Exception in opening the url http://ec2-54-152-126-49.compute-1.amazonaws.com:8080/json ec2-54-152-105-0.compute-1.amazonaws.com: stopping org.apache.spark.deploy.worker.Worker

[...]

ERROR: Cluster health check failed for spark_ec2 bash-3.2$

Thanks very much.

feffgroup commented 9 years ago

(edited by OP to improve legibility)

anukoolrege commented 9 years ago

Running into the same issue with both east and west region AMIs. Copying SSH key /home/anukool/Downloads/sparkwest1.pem to master... ssh: connect to host ec2-54-67-93-194.us-west-1.compute.amazonaws.com port 22: Connection refused Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i /home/anukool/Downloads/sparkwest1.pem root@ec2-54-67-93-194.us-west-1.compute.amazonaws.com 'mkdir -p ~/.ssh'' returned non-zero exit status 255, sleeping 30 ssh: connect to host ec2-54-67-93-194.us-west-1.compute.amazonaws.com port 22: Connection refused

Aerlinger commented 9 years ago

Getting this exact same error as well...

sharmadp commented 9 years ago

I am using git in windows 7 machine File permission for .pem file -rw-r--r--

I am getting the following error while connecting to the cluster

Copying SSH key sparkstream.pem to master... ssh: connect to host ec2-52-21-237-149.compute-1.amazonaws.com port 22: Bad file number Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i sparkstr eam.pem root@ec2-52-21-237-149.compute-1.amazonaws.com 'mkdir -p ~/.ssh'' return ed non-zero exit status 255, sleeping 30 ssh: connect to host ec2-52-21-237-149.compute-1.amazonaws.com port 22: Bad file number Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i sparkstr eam.pem root@ec2-52-21-237-149.compute-1.amazonaws.com 'mkdir -p ~/.ssh'' return ed non-zero exit status 255, sleeping 30 ssh: connect to host ec2-52-21-237-149.compute-1.amazonaws.com port 22: Bad file number Error connecting to host Command 'ssh -t -o StrictHostKeyChecking=no -i sparkstr eam.pem root@ec2-52-21-237-149.compute-1.amazonaws.com 'mkdir -p ~/.ssh'' return ed non-zero exit status 255, sleeping 30 ssh: connect to host ec2-52-21-237-149.compute-1.amazonaws.com port 22: Bad file number Traceback (most recent call last): File "./spark_ec2.py", line 925, in main() File "./spark_ec2.py", line 766, in main setup_cluster(conn, master_nodes, slave_nodes, zoo_nodes, opts, True) File "./spark_ec2.py", line 406, in setup_cluster ssh(master, opts, 'mkdir -p ~/.ssh') File "./spark_ec2.py", line 712, in ssh raise e subprocess.CalledProcessError: Command 'ssh -t -o StrictHostKeyChecking=no -i sp arkstream.pem root@ec2-52-21-237-149.compute-1.amazonaws.com 'mkdir -p ~/.ssh'' returned non-zero exit status 255

Skeftical commented 8 years ago

I was getting this error as well. It turned out that I was able to manually ssh into the master server using its IP. (You can get that through the aws dashboard ->ec2 instances). instead of "ssh -i "yourkey.pem" root@hostname" you would do "ssh -i "yourkey.pem" root@ipaddress. Once I did that the host must have been automatically added to my known hosts list thus I was able to do a --resume on the setup.

alphago-au commented 8 years ago

you may need to increase the timeout seconds in function: wait_for_spark_cluster replace time.sleep(5) with time.sleep(120)

this error occurs when spark cluster is not started, you may need to give more seconds to reboot cluster.

this works for me.

seufagner commented 8 years ago

Which spark version do you are using @alphago-au ?

amplab / training

issue launching cluster for amplab tutorial #175