dcos / dcos-cli

The command line for DC/OS.
https://docs.d2iq.com/mesosphere/dcos/latest/cli/
Apache License 2.0
224 stars 119 forks source link

Spark service is not found in your DCOS cluster #613

Closed hoststreamsell closed 5 years ago

hoststreamsell commented 8 years ago

Please answer the following questions before submitting your issue. Thanks!

What version of the DCOS CLI are you using (dcos --version)?

dcos version 0.4.4

What version of DCOS are you using?

1.7

What operating system and version are you using?

Windows 10 with Vagrant/Virtualbox

What did you do?

If possible please provide a recipe for reproducing. (dcos) PS C:\dcos> dcos package install spark Installing Marathon app for package [spark] version [1.0.0-1.6.1-2] Installing CLI subcommand for package [spark] version [1.0.0-1.6.1-2] New command available: dcos spark.exe DC/OS Spark is being installed!

    Documentation: https://docs.mesosphere.com/spark-1-7/
    Issues: https://docs.mesosphere.com/support/

(dcos) PS C:\dcos> dcos spark Usage: dcos spark --help dcos spark --info (dcos) PS C:\dcos> dcos spark run --submit-args='-Dspark.mesos.coarse=true --driver-cores 1 --driver-memory 1024M --clas s org.apache.spark.examples.SparkPi https://downloads.mesosphere.com/spark/assets/spark-examples_2.10-1.4.0-SNAPSHOT.jar 30' Spark distribution spark-1.6.1-2 not found locally. It looks like this is your first time running Spark! Downloading https://downloads.mesosphere.com/spark/assets/spark-1.6.1-2.tgz... Extracting spark distribution c:\users\gbyrne.dcos\subcommands\spark\env\lib\site-packages\dcos_spark\data\spark-1.6.1- 2.tgz... Successfully fetched spark distribution https://downloads.mesosphere.com/spark/assets/spark-1.6.1-2.tgz! Spark service is not found in your DCOS cluster. 127.0.0.1 - - [17/May/2016 10:43:55] "POST /v1/submissions/create HTTP/1.1" 502 -

The Spark service is running an shown as Healthy in GUI Services screen

What did you expect to see?

Spark service should run task

What did you see instead?

Spark service is not found in your DCOS cluster.

hoststreamsell commented 8 years ago

Not sure if this helps...

(dcos) PS C:\dcos> dcos spark run --help Traceback (most recent call last): File "c:\python27\Lib\runpy.py", line 162, in _run_module_as_main "main", fname, loader, pkg_name) (dcos) PS C:\dcos> python.exe --version Python 2.7.11

jsancio commented 8 years ago

What do you get when you do dcos package list and dcos service?

cc @mgummelt

hoststreamsell commented 8 years ago

(dcos) PS C:\dcos> dcos package list NAME VERSION APP COMMAND DESCRIPTION

spark 1.0.0-1.6.1-2 /spark spark Spark is a fast and general cluster computing system for Big Data. Documentation : https://docs.mesosphere.com/usage/managing-services/spark/ (dcos) PS C:\dcos> dcos service NAME HOST ACTIVE TASKS CPU MEM DISK ID marathon 192.168.65.90 True 1 1.0 1024.0 0.0 8f3d1142-5bab-49b9-8eb9-12a3e033bf13-0000 spark a1.dcos True 0 0.0 0.0 0.0 8f3d1142-5bab-49b9-8eb9-12a3e033bf13-0004

Thanks

hoststreamsell commented 8 years ago

I tried shutting down servers and restarting and am getting the same error, currently running the following and I don't see any errors loading these: vagrant up m1 a1 p1 boot My PC has 16GB of RAM and I haven't seen it using more than 12.5GB

Looks like my issue may not be spark specific, I tried installing kafka too and that does not seem to be installing correctly either. Unlike Spark I don't ever see this running as a service in the GUI dashboard

(dcos) PS C:\dcos> dcos package install kafka Installing Marathon app for package [kafka] version [1.0.7-0.9.0.1] Installing CLI subcommand for package [kafka] version [1.0.7-0.9.0.1] New command available: dcos kafka.exe DC/OS Kafka Service is being installed.

    Documentation: https://docs.mesosphere.com/kafka-1-7/
    Issues: https://docs.mesosphere.com/support/

(dcos) PS C:\dcos> dcos kafka connection Failed to GET http://m1.dcos/service/kafka/v1/connection HTTP 500: Internal Server Error Content:

500 Internal Server Error

500 Internal Server Error


openresty/1.7.10.2

(dcos) PS C:\dcos> dcos package list NAME VERSION APP COMMAND DESCRIPTION

kafka 1.0.7-0.9.0.1 /kafka kafka Apache Kafka running on DC/OS

spark 1.0.0-1.6.1-2 /spark spark Spark is a fast and general cluster computing system for Big Data. Documentation : https://docs.mesosphere.com/usage/managing-services/spark/ (dcos) PS C:\dcos> dcos service NAME HOST ACTIVE TASKS CPU MEM DISK ID marathon 192.168.65.90 True 1 1.0 1024.0 0.0 ee402a3d-c614-468f-8865-9c50b4720348-0000 spark a1.dcos True 0 0.0 0.0 0.0 ee402a3d-c614-468f-8865-9c50b4720348-0001 (dcos) PS C:\dcos>

hoststreamsell commented 8 years ago

I see this in the log viewer of the Spark process in the GUI

I0517 13:07:03.041519 8759 exec.cpp:217] Executor registered on slave 473b8728-6599-4fd3-afd2-dd9b23d274d9-S0

jsancio commented 8 years ago

@hoststreamsell. Thanks a lot for the information. @gabrielhartmann is looking into it and should have some feedback soon.

gabrielhartmann commented 8 years ago

@hoststreamsell: Trying to repro right now.

gabrielhartmann commented 8 years ago

@hoststreamsell: Kafka and Spark issues are unrelated. I had to export the DCOS_SPARK_URL like this: export DCOS_SPARK_URL=http://m1.dcos/service/spark in order to get Spark working on 'real' clusters. I was unable to get Spark working at all on Vagrant. It looks like a network proxying problem associated with running in an Open DC/OS vagrant environment. We will continue to look into this, but a quick fix or configuration change is unlikely to resolve this issue.

Requests are seen making it to the Spark Nginx agent, but failure are encountered after that. It's possible it's a Vagrant networking setup problem.

2016/05/17 22:54:01 [error] 32#0: *128 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.65.90, server: , request: "POST /v1/submissions/create HTTP/1.1", upstream: "http://192.168.65.111:19756/v1/submissions/create", host: "localhost:64563"

curl -v -H 'Authorization:token=<get your token from ~/.dcos/dcos.toml>' http://m1.dcos/service/spark/ provides the expected output. So that endpoint is accessible.

As a workaround spark submissions should be possible by communicating directly to this endpoint from within the cluster.

hoststreamsell commented 8 years ago

Thanks for looking into this so quick and providing an update

darabos commented 8 years ago

I'm getting this error message as well with the 1.7 early access DCOS.

$ git clone https://github.com/dcos/dcos-vagrant
$ cd dcos-vagrant
$ export DCOS_CONFIG_PATH=etc/config-1.7.yaml
$ curl -O https://downloads.dcos.io/dcos/EarlyAccess/dcos_generate_config.sh
$ vagrant up m1 a1 a2 a3 p1 boot
$ dcos auth login
$ dcos package install spark
Installing Marathon app for package [spark] version [1.0.0-1.6.1-2]
Installing CLI subcommand for package [spark] version [1.0.0-1.6.1-2]
New command available: dcos spark
DC/OS Spark is being installed!

    Documentation: https://docs.mesosphere.com/usage/services/spark/
    Issues: https://docs.mesosphere.com/support/
$ dcos service
NAME           HOST      ACTIVE  TASKS  CPU   MEM    DISK  ID                                         
marathon  192.168.65.90   True     1    1.0  1024.0  0.0   cdbc219b-d7a5-4ea7-9e6f-2b63e4c9cad8-0000  
spark        a2.dcos      True     0    0.0   0.0    0.0   cdbc219b-d7a5-4ea7-9e6f-2b63e4c9cad8-0001  
$ $ dcos spark run --verbose --submit-args="--class org.apache.spark.examples.SparkPi https://downloads.mesosphere.com.s3.amazonaws.com/assets/spark/spark-examples_2.10-1.4.0-SNAPSHOT.jar 30"
Ran command: /home/darabos/.dcos/subcommands/spark/env/lib/python3.5/site-packages/dcos_spark/data/spark-1.6.1-2/bin/spark-submit --deploy-mode cluster --master mesos://localhost:43644 --conf spark.ssl.noCertVerification=true --class org.apache.spark.examples.SparkPi https://downloads.mesosphere.com.s3.amazonaws.com/assets/spark/spark-examples_2.10-1.4.0-SNAPSHOT.jar 30
With added env vars: {'SPARK_JAVA_OPTS': '-Dspark.mesos.executor.docker.image=mesosphere/spark:1.0.0-1.6.1-2'}
# Abridged error message:
java.net.ConnectException: Connection refused

I think localhost:43644 looks fairly suspicious. Should it not be a2.dcos:<something> instead? The DCOS UI tells me Spark has 0 tasks running, but there is 1 Marathon task running called spark. The stderr for this task has the following:

I0701 09:01:09.194223 31269 exec.cpp:143] Version: 0.28.1
I0701 09:01:09.196869 31279 exec.cpp:217] Executor registered on slave cdbc219b-d7a5-4ea7-9e6f-2b63e4c9cad8-S5
+ export DISPATCHER_PORT=18558
+ DISPATCHER_PORT=18558
+ export DISPATCHER_UI_PORT=18559
+ DISPATCHER_UI_PORT=18559
+ export HISTORY_SERVER_PORT=18560
+ HISTORY_SERVER_PORT=18560
+ export SPARK_PROXY_PORT=18561
+ SPARK_PROXY_PORT=18561
+ SCHEME=http
+ OTHER_SCHEME=https
+ '[' '' == true ']'
+ export HISTORY_SERVER_WEB_PROXY_BASE=/service/spark/history
+ HISTORY_SERVER_WEB_PROXY_BASE=/service/spark/history
+ export DISPATCHER_UI_WEB_PROXY_BASE=/service/spark
+ DISPATCHER_UI_WEB_PROXY_BASE=/service/spark
+ '[' false = true ']'
+ grep -v '#https#' /etc/nginx/conf.d/spark.conf.template
+ sed s,#http#,,
+ sed -i 's,<PORT>,18561,' /etc/nginx/conf.d/spark.conf
+ sed -i 's,<DISPATCHER_URL>,http://192.168.65.121:18558,' /etc/nginx/conf.d/spark.conf
+ sed -i 's,<DISPATCHER_UI_URL>,http://192.168.65.121:18559,' /etc/nginx/conf.d/spark.conf
+ sed -i 's,<HISTORY_SERVER_URL>,http://192.168.65.121:18560,' /etc/nginx/conf.d/spark.conf
+ sed -i 's,<PROTOCOL>,,' /etc/nginx/conf.d/spark.conf
+ '[' '' == true ']'
+ '[' -f hdfs-site.xml
/sbin/init.sh: line 69: [: missing `]'
+ '[' -n '' ']'
+ exec runsvdir -P /etc/service
+ mkdir -p /mnt/mesos/sandbox/nginx
+ mkdir -p /mnt/mesos/sandbox/spark
+ exec
+ exec svlogd /mnt/mesos/sandbox/spark
+ exec svlogd /mnt/mesos/sandbox/nginx

I tried running the spark-submit command with all the ports mentioned there, but it does not work.

jsancio commented 8 years ago

@mgummelt Can you take a look at the error above? Thanks!

mgummelt commented 8 years ago

This error occurs when /service/spark returns a 502 https://github.com/mesosphere/dcos-spark/blob/269873dbb9eadcef2aadc941e397c60d5e60aeb0/dcos_spark/spark_submit.py#L316

Which can occur if you don't wait the ~30s required for the Spark dispatcher to come up after you install it. If you've waited that long, and it's still not up, you should check the DC/OS or Mesos UI to diagnose.

rwd5213 commented 8 years ago

did anyone figure out this error. I am getting the same thing where even running dcos spark run --help gives this. Running same configuration as above

Traceback (most recent call last): File "c:\python27\Lib\runpy.py", line 175, in _run_module_as_main "main", fname, loader, pkg_name) File "c:\python27\Lib\runpy.py", line 72, in _run_code exec code in run_globals File "C:\Users\579206.dcos\subcommands\spark\env\Scripts\dcos-spark.exemain.py", line 9, in File "c:\users\579206.dcos\subcommands\spark\env\lib\site-packages\dcos_spark\cli.py", line 100, in main return show_spark_submit_help() File "c:\users\579206.dcos\subcommands\spark\env\lib\site-packages\dcos_spark\cli.py", line 48, in show_spark_submit_help

GuyWu commented 7 years ago

Have u fix that problem? I have encountered the same problem as u. Wish to get a resolution from you. Thank you.

tamarrow-zz commented 7 years ago

@mgummelt ^

hantuzun commented 6 years ago

I face an error installing Spark on DC/OS but its after this /sbin/init.sh: line 69: [: missing `]' line. It seems like this issue has been resolved.

Mwea commented 6 years ago

Hi ! I've encountered the exact same problem... Could my proxy be the reason of the failure ?

hantuzun commented 6 years ago

@Mwea I was not using proxy.