Hydrospheredata / mist

Serverless proxy for Spark cluster
http://hydrosphere.io/mist/
Apache License 2.0
326 stars 68 forks source link

Mist always launch jobs with spark.master=local[*] despite function's default context #413

Closed Volodymyr128 closed 6 years ago

Volodymyr128 commented 6 years ago

Despite function was deployed with default context = volodymyr.bakhmatiuk_cluster_context, it is launched with default local context. Help me please to launch my job on my remote cluster!

To launch HelloMist function on my cluster, I did four steps due to documentation:

  1. I created new configuration file:
    model = Context
    name = cluster_context
    data {
    spark-conf {
    spark.master = "spark://myhost.com:7077"
    }
    }
  2. I've set context=cluster_context in hello_mist/scala/conf/20_function.conf
  3. Re-packaged everything with mvn package
  4. Deployed changes to mist with mist-cli apply -f conf

Now I can check that function's context is linked to my cluster:

curl -H 'Content-Type: application/json'v -X GET http://localhost:2004/v2/api/functions

[{"name":"volodymyr.bakhmatiuk_hello-mist-java","execute":{"samples":{"type":"MInt"}},"path":"volodymyr.bakhmatiuk_hello-mist-java_0.0.1.jar","tags":[],"className":"HelloMist","defaultContext":"volodymyr.bakhmatiuk_cluster_context","lang":"java"}]

And I can check that configurations has been deployed:

curl -H 'Content-Type: application/json' -X GET http://localhost:2004/v2/api/contexts/volodymyr.bakhmatiuk_cluster_context

{"name":"volodymyr.bakhmatiuk_cluster_context","maxJobs":20,"workerMode":"shared","precreated":false,"sparkConf":{"\"spark.master\"":"spark://myhost.com:7077"},"runOptions":"","downtime":"120s","streamingDuration":"1s"}

Now I launch job through WebMist and it is finished successfully. But it looks like WebMist launch a job on local[*] spark cluster, because nothing have been launched on myhost.com cluster! Logs:

18-02-09 17:11:42 [mist-akka.actor.default-dispatcher-16] INFO ere.mist.master.WorkersManager:107 Trying to start worker volodymyr.bakhmatiuk_cluster_context, for context: volodymyr.bakhmatiuk_cluster_context
18-02-09 17:11:47 [mist-akka.actor.default-dispatcher-3] INFO ere.mist.master.WorkersManager:107 Received worker registration - WorkerRegistration(volodymyr.bakhmatiuk_cluster_context,akka.tcp://mist@172.17.0.3:41197,Some(http://172.17.0.3:4040))
18-02-09 17:11:47 [mist-akka.actor.default-dispatcher-26] INFO ere.mist.master.WorkersManager:107 Worker resolved - WorkerResolved(volodymyr.bakhmatiuk_cluster_context,akka.tcp://mist@172.17.0.3:41197,Actor[akka.tcp://mist@172.17.0.3:41197/user/worker-volodymyr.bakhmatiuk_cluster_context#-1055101121],Some(http://172.17.0.3:4040))
18-02-09 17:11:47 [mist-akka.actor.default-dispatcher-16] INFO ere.mist.master.WorkersManager:107 Worker with volodymyr.bakhmatiuk_cluster_context is registered on akka.tcp://mist@172.17.0.3:41197
18-02-09 17:11:49 [mist-akka.actor.default-dispatcher-14] INFO ist.master.FrontendJobExecutor:107 Job has been started be02598b-f8ed-4f80-a583-255f478e610e
18-02-09 17:11:50 [mist-akka.actor.default-dispatcher-3] INFO ist.master.FrontendJobExecutor:107 Job RunJobRequest(be02598b-f8ed-4f80-a583-255f478e610e,JobParams(volodymyr.bakhmatiuk_hello-mist-java_0.0.1.jar,HelloMist,Map(samples -> 7),execute)) id done with result JobSuccess(be02598b-f8ed-4f80-a583-255f478e610e,3.4285714285714284)

P.S. My Spark cluster version equals 2.1.1.

I launch mist this way:

docker run -p 2004:2004 -v /var/run/docker.sock:/var/run/docker.sock hydrosphere/mist:1.0.0-RC8-2.2.0 mist

Volodymyr128 commented 6 years ago

Also I did job submition request from terminal - the same result:

curl -d '{"samples": 10000}' -H 'Content-Type: application/json' -X POST http://localhost:2004/v2/api/functions/volodymyr.bakhmatiuk_hello-mist-java/jobs?context=volodymyr.bakhmatiuk_cluster_context

{"id":"a3dbd90c-3ca7-4bc7-910f-5c3c5901fb28"}

dos65 commented 6 years ago

Thanks for the detailed description! I've just released v1.0.0-RC9 witch include fix for such problem (#411 )

Volodymyr128 commented 6 years ago

Thank you for quick response! Now I get another issue - my jobs do not get to remote cluster:

18-02-09 23:01:08 [mist-akka.actor.default-dispatcher-18] INFO ere.mist.master.WorkersManager:107 Trying to start worker volodymyr.bakhmatiuk_cluster_context, for context: volodymyr.bakhmatiuk_cluster_context
18-02-09 23:03:08 [mist-akka.actor.default-dispatcher-17] WARN ere.mist.master.WorkersManager:131 Worker volodymyr.bakhmatiuk_cluster_context initialization timeout: not being responsive for 2 minutes
18-02-09 23:03:08 [mist-akka.actor.default-dispatcher-17] INFO ere.mist.master.WorkersManager:107 Worker for volodymyr.bakhmatiuk_cluster_context is marked down
18-02-09 23:03:08 [mist-akka.actor.default-dispatcher-25] INFO ist.master.FrontendJobExecutor:107 Job RunJobRequest(6ea5caa8-cec1-47d7-bd44-725601137dd9,JobParams(volodymyr.bakhmatiuk_hello-mist-java_0.0.1.jar,HelloMist,Map(samples -> 10000),execute)) id done with result JobFailure(6ea5caa8-cec1-47d7-bd44-725601137dd9,Worker volodymyr.bakhmatiuk_cluster_context initialization timeout: not being responsive for 2 minutes)
18-02-09 23:06:16 [mist-akka.actor.default-dispatcher-25] WARN mote.PhiAccrualFailureDetector:131 heartbeat interval is growing too large: 2766 millis

Is there any way to get more detailed logs to find out what is wrong?

dos65 commented 6 years ago

Try to run mist from binaries: there are network limitations with docker mode and all logs will be collected in $MIST_HOME/logs

Volodymyr128 commented 6 years ago

Thank you! That solved my issue

dos65 commented 6 years ago

Glad to hear that.