amplab / training

Training materials for Strata, AMP Camp, etc
150 stars 121 forks source link

Clutser launched failed with connection timeout error during check_spark_cluster #143

Closed hardik-pandya closed 10 years ago

hardik-pandya commented 10 years ago

./spark-ec2 -k sparkamplab -i sparkamplab.pem --resume launch amplab-training Searching for existing cluster amplab-training... Found 1 master(s), 5 slaves, 0 ZooKeeper nodes Copying SSH key sparkamplab.pem to master... Connection to ec2-54-84-255-84.compute-1.amazonaws.com closed. Connection to ec2-54-84-255-84.compute-1.amazonaws.com closed. Cloning into 'spark-ec2'... remote: Counting objects: 1328, done. remote: Compressing objects: 100% (632/632), done. remote: Total 1328 (delta 423), reused 1328 (delta 423) Receiving objects: 100% (1328/1328), 207.13 KiB, done. Resolving deltas: 100% (423/423), done. Connection to ec2-54-84-255-84.compute-1.amazonaws.com closed. Deploying files to master... sending incremental file list root/spark-ec2/ec2-variables.sh

sent 977 bytes received 45 bytes 681.33 bytes/sec total size is 842 speedup is 0.82 Running setup on master... Connection to ec2-54-84-255-84.compute-1.amazonaws.com closed. Setting up Spark on ip-172-31-30-223.ec2.internal... cp: cannot create regular file /root/mesos-ec2/': Is a directory cp: cannot create regular file/root/mesos-ec2/': Is a directory Setting executable permissions on scripts... Running setup-slave on master to mount filesystems, etc... Setting up slave on ip-172-31-30-223.ec2.internal... /mnt/swap already exists SSH'ing to master machine(s) to approve key(s)... ec2-54-84-255-84.compute-1.amazonaws.com Warning: Permanently added 'ec2-54-84-255-84.compute-1.amazonaws.com,172.31.30.223' (RSA) to the list of known hosts. Warning: Permanently added 'localhost' (RSA) to the list of known hosts. Warning: Permanently added 'ip-172-31-30-223.ec2.internal' (RSA) to the list of known hosts. SSH'ing to other cluster nodes to approve keys... ec2-54-84-255-85.compute-1.amazonaws.com Warning: Permanently added 'ec2-54-84-255-85.compute-1.amazonaws.com,172.31.24.43' (RSA) to the list of known hosts. ec2-54-84-255-86.compute-1.amazonaws.com Warning: Permanently added 'ec2-54-84-255-86.compute-1.amazonaws.com,172.31.24.44' (RSA) to the list of known hosts. ec2-54-84-255-87.compute-1.amazonaws.com Warning: Permanently added 'ec2-54-84-255-87.compute-1.amazonaws.com,172.31.24.45' (RSA) to the list of known hosts. ec2-54-84-255-69.compute-1.amazonaws.com Warning: Permanently added 'ec2-54-84-255-69.compute-1.amazonaws.com,172.31.24.47' (RSA) to the list of known hosts. ec2-54-84-255-88.compute-1.amazonaws.com Warning: Permanently added 'ec2-54-84-255-88.compute-1.amazonaws.com,172.31.24.46' (RSA) to the list of known hosts. RSYNC'ing /root/spark-ec2 to other cluster nodes... ec2-54-84-255-85.compute-1.amazonaws.com id_rsa 100% 1692 1.7KB/s 00:00
ec2-54-84-255-86.compute-1.amazonaws.com id_rsa 100% 1692 1.7KB/s 00:00
ec2-54-84-255-87.compute-1.amazonaws.com id_rsa 100% 1692 1.7KB/s 00:00
ec2-54-84-255-69.compute-1.amazonaws.com id_rsa 100% 1692 1.7KB/s 00:00
ec2-54-84-255-88.compute-1.amazonaws.com id_rsa 100% 1692 1.7KB/s 00:00
Running slave setup script on other cluster nodes... ec2-54-84-255-85.compute-1.amazonaws.com Setting up slave on ip-172-31-24-43.ec2.internal... /mnt/swap already exists Connection to ec2-54-84-255-85.compute-1.amazonaws.com closed. ec2-54-84-255-86.compute-1.amazonaws.com Setting up slave on ip-172-31-24-44.ec2.internal... /mnt/swap already exists Connection to ec2-54-84-255-86.compute-1.amazonaws.com closed. ec2-54-84-255-87.compute-1.amazonaws.com Setting up slave on ip-172-31-24-45.ec2.internal... /mnt/swap already exists Connection to ec2-54-84-255-87.compute-1.amazonaws.com closed. ec2-54-84-255-69.compute-1.amazonaws.com Setting up slave on ip-172-31-24-47.ec2.internal... /mnt/swap already exists Connection to ec2-54-84-255-69.compute-1.amazonaws.com closed. ec2-54-84-255-88.compute-1.amazonaws.com Setting up slave on ip-172-31-24-46.ec2.internal... /mnt/swap already exists Connection to ec2-54-84-255-88.compute-1.amazonaws.com closed. cp: cannot create regular file /root/mesos-ec2/': Is a directory Initializing ephemeral-hdfs Initializing persistent-hdfs Initializing mesos Initializing spark-standalone Initializing training Initializing ganglia rmdir: failed to remove/var/lib/ganglia/rrds': Not a directory Connection to ec2-54-84-255-85.compute-1.amazonaws.com closed. Connection to ec2-54-84-255-86.compute-1.amazonaws.com closed. Connection to ec2-54-84-255-87.compute-1.amazonaws.com closed. Connection to ec2-54-84-255-69.compute-1.amazonaws.com closed. Connection to ec2-54-84-255-88.compute-1.amazonaws.com closed. Creating local config files... Connection to ec2-54-84-255-85.compute-1.amazonaws.com closed. Configuring /etc/ganglia/gmond.conf Configuring /etc/ganglia/gmetad.conf Configuring /etc/httpd/conf/httpd.conf Configuring /etc/httpd/conf.d/ganglia.conf Configuring /root/spark/conf/spark-env.sh Configuring /root/spark-ec2/mesos/hadoop-framework-conf/core-site.xml Configuring /root/spark-ec2/mesos/hadoop-framework-conf/hadoop-env.sh Configuring /root/spark-ec2/mesos/hadoop-framework-conf/mapred-site.xml Configuring /root/spark-ec2/mesos/hypertable/Capfile Configuring /root/spark-ec2/mesos/hypertable/hypertable.cfg Configuring /root/spark-ec2/mesos/haproxy+apache/haproxy.config.template Configuring /root/ephemeral-hdfs/conf/core-site.xml Configuring /root/ephemeral-hdfs/conf/masters Configuring /root/ephemeral-hdfs/conf/hadoop-env.sh Configuring /root/ephemeral-hdfs/conf/hadoop-metrics2.properties Configuring /root/ephemeral-hdfs/conf/slaves Configuring /root/ephemeral-hdfs/conf/hdfs-site.xml Configuring /root/ephemeral-hdfs/conf/mapred-site.xml Configuring /root/zookeeper-3.4.5/conf/zoo.cfg Configuring /root/persistent-hdfs/conf/core-site.xml Configuring /root/persistent-hdfs/conf/masters Configuring /root/persistent-hdfs/conf/hadoop-env.sh Configuring /root/persistent-hdfs/conf/slaves Configuring /root/persistent-hdfs/conf/hdfs-site.xml Configuring /root/persistent-hdfs/conf/mapred-site.xml Deploying Spark config files... RSYNC'ing /root/spark/conf to slaves... ec2-54-84-255-85.compute-1.amazonaws.com ec2-54-84-255-86.compute-1.amazonaws.com ec2-54-84-255-87.compute-1.amazonaws.com ec2-54-84-255-69.compute-1.amazonaws.com ec2-54-84-255-88.compute-1.amazonaws.com Setting up ephemeral-hdfs ~/spark-ec2/ephemeral-hdfs ~/spark-ec2 ec2-54-84-255-85.compute-1.amazonaws.com Connection to ec2-54-84-255-85.compute-1.amazonaws.com closed. ec2-54-84-255-86.compute-1.amazonaws.com Connection to ec2-54-84-255-86.compute-1.amazonaws.com closed. ec2-54-84-255-87.compute-1.amazonaws.com Connection to ec2-54-84-255-87.compute-1.amazonaws.com closed. ec2-54-84-255-69.compute-1.amazonaws.com Connection to ec2-54-84-255-69.compute-1.amazonaws.com closed. ec2-54-84-255-88.compute-1.amazonaws.com Connection to ec2-54-84-255-88.compute-1.amazonaws.com closed. RSYNC'ing /root/ephemeral-hdfs/conf to slaves... ec2-54-84-255-85.compute-1.amazonaws.com ec2-54-84-255-86.compute-1.amazonaws.com ec2-54-84-255-87.compute-1.amazonaws.com ec2-54-84-255-69.compute-1.amazonaws.com ec2-54-84-255-88.compute-1.amazonaws.com Hadoop namenode appears to be formatted: skipping Starting ephemeral HDFS... namenode running as process 1958. Stop it first. ec2-54-84-255-87.compute-1.amazonaws.com: datanode running as process 1718. Stop it first. ec2-54-84-255-86.compute-1.amazonaws.com: datanode running as process 1717. Stop it first. ec2-54-84-255-85.compute-1.amazonaws.com: datanode running as process 1720. Stop it first. ec2-54-84-255-88.compute-1.amazonaws.com: datanode running as process 1715. Stop it first. ec2-54-84-255-69.compute-1.amazonaws.com: datanode running as process 1723. Stop it first. ec2-54-84-255-84.compute-1.amazonaws.com: secondarynamenode running as process 2120. Stop it first. ~/spark-ec2 Setting up persistent-hdfs ~/spark-ec2/persistent-hdfs ~/spark-ec2 Pseudo-terminal will not be allocated because stdin is not a terminal. Pseudo-terminal will not be allocated because stdin is not a terminal. Pseudo-terminal will not be allocated because stdin is not a terminal. Pseudo-terminal will not be allocated because stdin is not a terminal. Pseudo-terminal will not be allocated because stdin is not a terminal. RSYNC'ing /root/persistent-hdfs/conf to slaves... ec2-54-84-255-85.compute-1.amazonaws.com ec2-54-84-255-86.compute-1.amazonaws.com ec2-54-84-255-87.compute-1.amazonaws.com ec2-54-84-255-69.compute-1.amazonaws.com ec2-54-84-255-88.compute-1.amazonaws.com Starting persistent HDFS... namenode running as process 2150. Stop it first. ec2-54-84-255-85.compute-1.amazonaws.com: datanode running as process 1809. Stop it first. ec2-54-84-255-87.compute-1.amazonaws.com: datanode running as process 1807. Stop it first. ec2-54-84-255-88.compute-1.amazonaws.com: datanode running as process 1804. Stop it first. ec2-54-84-255-69.compute-1.amazonaws.com: datanode running as process 1812. Stop it first. ec2-54-84-255-86.compute-1.amazonaws.com: datanode running as process 1807. Stop it first. ec2-54-84-255-84.compute-1.amazonaws.com: secondarynamenode running as process 2376. Stop it first. ~/spark-ec2 Setting up mesos Pseudo-terminal will not be allocated because stdin is not a terminal. Pseudo-terminal will not be allocated because stdin is not a terminal. Pseudo-terminal will not be allocated because stdin is not a terminal. Pseudo-terminal will not be allocated because stdin is not a terminal. Pseudo-terminal will not be allocated because stdin is not a terminal. ./mesos/setup.sh: line 12: /root/zookeeper-3.4.5/bin/zkServer.sh: No such file or directory ./mesos/setup.sh: line 15: /root/zookeeper-3.4.5/bin/zkServer.sh: No such file or directory Starting Mesos master Starting Mesos slaves Pseudo-terminal will not be allocated because stdin is not a terminal. % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 40 100 40 0 0 30143 0 --:--:-- --:--:-- --:--:-- 40000 Starting mesos slave on ip-172-31-24-43.ec2.internal Pseudo-terminal will not be allocated because stdin is not a terminal. % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 40 100 40 0 0 32388 0 --:--:-- --:--:-- --:--:-- 40000 Starting mesos slave on ip-172-31-24-44.ec2.internal Pseudo-terminal will not be allocated because stdin is not a terminal. % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 40 100 40 0 0 32414 0 --:--:-- --:--:-- --:--:-- 40000 Starting mesos slave on ip-172-31-24-45.ec2.internal Pseudo-terminal will not be allocated because stdin is not a terminal. % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 40 100 40 0 0 24009 0 --:--:-- --:--:-- --:--:-- 40000 Starting mesos slave on ip-172-31-24-47.ec2.internal Pseudo-terminal will not be allocated because stdin is not a terminal. % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 40 100 40 0 0 31897 0 --:--:-- --:--:-- --:--:-- 40000 Starting mesos slave on ip-172-31-24-46.ec2.internal Setting up spark-standalone RSYNC'ing /root/spark/conf to slaves... ec2-54-84-255-85.compute-1.amazonaws.com ec2-54-84-255-86.compute-1.amazonaws.com ec2-54-84-255-87.compute-1.amazonaws.com ec2-54-84-255-69.compute-1.amazonaws.com ec2-54-84-255-88.compute-1.amazonaws.com cp: cannot create regular file `/root/mesos-ec2/cluster-url': No such file or directory RSYNC'ing /root/spark-ec2 to slaves... ec2-54-84-255-85.compute-1.amazonaws.com ec2-54-84-255-86.compute-1.amazonaws.com ec2-54-84-255-87.compute-1.amazonaws.com ec2-54-84-255-69.compute-1.amazonaws.com ec2-54-84-255-88.compute-1.amazonaws.com RSYNC'ing /root/mesos-ec2 to slaves... ec2-54-84-255-85.compute-1.amazonaws.com rsync: link_stat "/root/mesos-ec2" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1039) [sender=3.0.6] ec2-54-84-255-86.compute-1.amazonaws.com rsync: link_stat "/root/mesos-ec2" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1039) [sender=3.0.6] ec2-54-84-255-87.compute-1.amazonaws.com rsync: link_stat "/root/mesos-ec2" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1039) [sender=3.0.6] ec2-54-84-255-69.compute-1.amazonaws.com rsync: link_stat "/root/mesos-ec2" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1039) [sender=3.0.6] ec2-54-84-255-88.compute-1.amazonaws.com rsync: link_stat "/root/mesos-ec2" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1039) [sender=3.0.6] ./spark-standalone/setup.sh: line 18: /root/spark/bin/stop-all.sh: No such file or directory ./spark-standalone/setup.sh: line 23: /root/spark/bin/start-master.sh: No such file or directory ./spark-standalone/setup.sh: line 29: /root/spark/bin/start-slaves.sh: No such file or directory Setting up training ~ ~/spark-ec2 Loaded plugins: fastestmirror, priorities, security, update-motd, upgrade-helper Loading mirror speeds from cached hostfile

If you wish to set tracking information for this branch you can do so with:

git branch --set-upstream-to=origin/<branch> ampcamp3

fatal: A branch named 'ampcamp3' already exists. Launching sbt with HIVE_HOME set to /root/blinkdb/hiveblinkdb/build/dist [info] Loading project definition from /root/blinkdb/project/project [info] Loading project definition from /root/blinkdb/project [info] Set current project to shark (in build file:/root/blinkdb/) [success] Total time: 0 s, completed 20-Feb-2014 1:54:15 PM [info] Updating {file:/root/blinkdb/}root... [info] Resolving org.scala-tools.testing#test-interface;0.5 ... [info] Done updating. [info] Compiling 96 Scala sources and 5 Java sources to /root/blinkdb/target/scala-2.9.3/classes... [warn] /root/blinkdb/src/main/scala/shark/Utils.scala:93: method getObject in class S3Service is deprecated: see corresponding Javadoc for more information. [warn] val s3obj = s3Service.getObject(bucket, objectName) [warn] ^ [warn] /root/blinkdb/src/main/scala/shark/execution/JoinOperator.scala:119: match is not exhaustive! [warn] missing combination ReduceKeyMapSide * [warn] part.flatMap { case (k: ReduceKeyReduceSide, bufs: Array[]) => [warn] ^ [warn] two warnings found [warn] Note: /root/blinkdb/src/main/java/shark/execution/ExplainTaskHelper.java uses unchecked or unsafe operations. [warn] Note: Recompile with -Xlint:unchecked for details. [info] Packaging /root/blinkdb/target/scala-2.9.3/shark_2.9.3-0.8.0-SNAPSHOT.jar ... [info] Done packaging. [success] Total time: 70 s, completed 20-Feb-2014 1:55:25 PM RSYNC'ing /root/blinkdb to slaves... ec2-54-84-255-85.compute-1.amazonaws.com ec2-54-84-255-86.compute-1.amazonaws.com ec2-54-84-255-87.compute-1.amazonaws.com ec2-54-84-255-69.compute-1.amazonaws.com ec2-54-84-255-88.compute-1.amazonaws.com ~/spark-ec2 ~ ~/spark-ec2 Copying Hadoop executor for Mesos put: Target /ephemeral-hdfs.tar.gz already exists Copying Spark executor for Mesos put: Target /spark.tar.gz already exists ~/spark-ec2 ./training/setup.sh: line 82: popd: directory stack empty Setting up ganglia RSYNC'ing /etc/ganglia to slaves... ec2-54-84-255-85.compute-1.amazonaws.com ec2-54-84-255-86.compute-1.amazonaws.com ec2-54-84-255-87.compute-1.amazonaws.com ec2-54-84-255-69.compute-1.amazonaws.com ec2-54-84-255-88.compute-1.amazonaws.com Shutting down GANGLIA gmond: [ OK ] Starting GANGLIA gmond: [ OK ] Shutting down GANGLIA gmond: [ OK ] Starting GANGLIA gmond: [ OK ] Connection to ec2-54-84-255-85.compute-1.amazonaws.com closed. Shutting down GANGLIA gmond: [ OK ] Starting GANGLIA gmond: [ OK ] Connection to ec2-54-84-255-86.compute-1.amazonaws.com closed. Shutting down GANGLIA gmond: [ OK ] Starting GANGLIA gmond: [ OK ] Connection to ec2-54-84-255-87.compute-1.amazonaws.com closed. Shutting down GANGLIA gmond: [ OK ] Starting GANGLIA gmond: [ OK ] Connection to ec2-54-84-255-69.compute-1.amazonaws.com closed. Shutting down GANGLIA gmond: [ OK ] Starting GANGLIA gmond: [ OK ] Connection to ec2-54-84-255-88.compute-1.amazonaws.com closed. ln: creating symbolic link `/var/lib/ganglia/conf/default.json': File exists Shutting down GANGLIA gmetad: [ OK ] Starting GANGLIA gmetad: [ OK ] Stopping httpd: [ OK ] Starting httpd: [ OK ] Connection to ec2-54-84-255-84.compute-1.amazonaws.com closed. Done! Waiting for cluster to start... Traceback (most recent call last): File "./spark_ec2.py", line 916, in main() File "./spark_ec2.py", line 759, in main err = wait_for_spark_cluster(master_nodes, opts) File "./spark_ec2.py", line 724, in wait_for_spark_cluster err = check_spark_cluster(master_nodes, opts) File "./spark_ec2.py", line 453, in check_spark_cluster response = urllib2.urlopen(url) File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen return _opener.open(url, data, timeout) File "/usr/lib/python2.7/urllib2.py", line 404, in open response = self._open(req, data) File "/usr/lib/python2.7/urllib2.py", line 422, in _open '_open', req) File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain result = func(*args) File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open raise URLError(err) urllib2.URLError: <urlopen error [Errno 110] Connection timed out>

I tried the same script about 3 weeks ago and it worked - can connect thru ssh terminal but can't access master:8080, is this a bug? or

shivaram commented 10 years ago

My guess is that the scripts were updated for AMPCamp4 and you can't resume an older AMPCamp3 cluster with these scripts. If you want to use the cluster from AMPCamp3 try to using the ampcamp3 branch in https://github.com/amplab/training-scripts

hardik-pandya commented 10 years ago

yes, you are right just tried with ampcamp4 branch and it worked for me, thanks shivaram, closing the issue