24601 commented 10 years ago

Nameserver would come up but script's detection relied on it a mechanism to detect which wasn't reliable on Ubuntu Precise LTS. This might break other platforms, I don't know. This fix made it work in my environment (vanilla Ubuntu Precise LTS + latest docker on said Ubuntu via vagrant on Mac OS).

Note: original scripts assumed the nameserver resolves its own hostname to 127.0.0.1

# bmustafa (24601 on GitHub) has found this not to necessarily be the case 
# on Ubuntu Precise LTS...this fixes it...may break other platforms?

see comments.

bzz commented 10 years ago

:+1: This was exactly the issue for me running blog-post examples using Docker 0.65 and 0.7 on Ubuntu 13.04 (Raring Ringtail)

Looks like nameserver comes up now, and is detected all-right but script is stuck at infinite loop

Pulling repository amplab/shark-master
2013/12/03 08:20:08 Server error: 404 trying to fetch remote history for amplab/shark-master
started master container:
MASTER_IP:
waiting for master
Usage: docker logs CONTAINER
Fetch the logs of a container
.

24601 commented 10 years ago

@alexander-bzz - I have the same issue with shark-master, I think it's related to an actual 404, haven't looked into this yet, and probably will as I need to get a shark cluster working in our docker-based environment. If I have a fix, I'll submit a separate pull request with said fix. In the mean time, I had this problem and reverted back to amplab/spark-xxxx image and it "works" (although it doesn't as submitting a job as per the example/amplab blogpost fails as no worker picks up one of the 2 tasks, although I have a suspicion the lack of workers registering might be a DNS issue after all...so might be related to this remotely :).

Regardless, let me know if you have any insights, too or if amplab/spark-xxxx doesn't fix your issue (so far).

AndreSchumacher commented 10 years ago

Thanks guys for the bug report. There may be some differences between various host systems in how the nameserver resolves it's own IP since it's the only one that uses Docker's own DNS. I'll also look into that. The PR may indeed break the script on other hosts but I guess one could just extend the grep by an "OR" on the other IP. In any case, thanks @24601 for reporting and @alexander-bzz for the comment.

bzz commented 10 years ago

I'v created separate issue #18 for 404 on docker pull (BTW @24601 you are right, pull works fine for amplab/spark-xxxx images)

All this makes ./deploy.sh -i amplab/spark:0.8.0 -c work, but then even a simple job from tutorial can not be submitted and results in error:

scala> textFile.count()
13/12/04 03:17:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/12/04 03:17:23 WARN LoadSnappy: Snappy native library not loaded
13/12/04 03:17:23 INFO FileInputFormat: Total input paths to process : 1
13/12/04 03:17:23 INFO SparkContext: Starting job: count at <console>:15
13/12/04 03:17:23 INFO DAGScheduler: Got job 0 (count at <console>:15) with 2 output partitions (allowLocal=false)
13/12/04 03:17:23 INFO DAGScheduler: Final stage: Stage 0 (count at <console>:15)
13/12/04 03:17:23 INFO DAGScheduler: Parents of final stage: List()
13/12/04 03:17:23 INFO DAGScheduler: Missing parents: List()
13/12/04 03:17:23 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[1] at textFile at <console>:12), which has no missing parents
13/12/04 03:17:23 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MappedRDD[1] at textFile at <console>:12)
13/12/04 03:17:23 INFO ClusterScheduler: Adding task set 0.0 with 2 tasks
13/12/04 03:17:38 WARN ClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
.....many times the same...

And it is not a DNS issue cause i see 2 workers connected to master

$ lynx master:8080
Spark Master at spark://master:7077
     * URL: spark://master:7077
     * Workers: 2
     * Cores: 2 Total, 0 Used
     * Memory: 2.9 GB Total, 0.0 B Used
     * Applications: 0 Running, 3 Completed
Workers
   worker-20131203111316-worker2-50029 worker2:50029 ALIVE 1 (0 Used) 1500.0 MB (0.0 B Used)
   worker-20131203111318-worker1-37815 worker1:37815 ALIVE 1 (0 Used) 1500.0 MB (0.0 B Used)

@AndreSchumacher where should we look now? Do you want me to create a separate issue for that?

AndreSchumacher commented 10 years ago

After installing Docker 0.7.0 I see similar problems but also see errors in the stdout in the Spark GUI:

    13/12/04 07:19:11 INFO StandaloneExecutorBackend: Connecting to driver: akka://spark@dd5453bf3ca7:38838/user/StandaloneScheduler
    13/12/04 07:19:11 ERROR StandaloneExecutorBackend: error while creating actor
java.net.UnknownHostException: dd5453bf3ca7: Name or service not known
    at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
    at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:866)

When I ssh into the master and start it from there it works. @alexander-bzz could you try the same?

It may be necessary to add the driver's hostname/IP also the dnsmasq file. I'll check whether that solves the problem.

bzz commented 10 years ago

i confirm that ssh to master ip and running sample code from spark-shell works

$ ssh -i ./docker-scripts/deploy/apache-hadoop-hdfs-precise/files/id_rsa -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@<master-ip>
$ /opt/spark-0.8.0-incubating-bin-hadoop1/spark-shell
scala> val textFile = sc.textFile("hdfs://master:9000/user/hdfs/test.txt")
scala> textFile.count()
....
3/12/04 09:04:42 INFO DAGScheduler: Stage 0 (count at <console>:15) finished in 0.937 s
13/12/04 09:04:42 INFO SparkContext: Job finished: count at <console>:15, took 1.005900271 s
res0: Long = 3

Do not quite get you about "add the driver's hostname/IP also the dnsmasq file"

AndreSchumacher commented 10 years ago

Closing this now due to commit 3ff8925f489f08f0fb8e58b91aa95b450d2a2b56 Thanks, @24601

amplab / docker-scripts

start_nameserver.sh "nameserver up" detection fix (at least on Ubuntu Precise LTS) #17

Note: original scripts assumed the nameserver resolves its own hostname to 127.0.0.1