databricks / learning-spark

Example code from Learning Spark book
MIT License
3.89k stars 2.42k forks source link

Difference between Running spark as local[*] Vs Yarn-client Vs Yarn-cluster in terms of performance #11

Closed prashanttct07 closed 9 years ago

prashanttct07 commented 9 years ago

Kindly consider this as an inquiry if not an issue.

Hi , I am evaluating Spark to use here at my work.

We have an existing Hortonworks HDP 2.3 install.

I am trying to work out whether I should use local or client or cluster to submit a job in Spark.

Consider I am running my job as : sudo -u hdfs spark-submit --class "org.xyz.Spark_ES_Java_V4" --master "local[*]" target/xyz-1.1-jar-with-dependencies.jar 192.168.0.185 55555 > prashant.txt

In this I am able to do the task in 14 Sec.

When I run the same like sudo -u hdfs spark-submit --class "org.xyz.Spark_ES_Java_V4" --master "yarn-client" target/xyz-1.1-jar-with-dependencies.jar 192.168.0.185 55555 > prashant.txt

It takes 16 Second

And this one sudo -u hdfs spark-submit --class "org.xyz.Spark_ES_Java_V4" --master "yarn-cluster" target/xyz-1.1-jar-with-dependencies.jar 192.168.0.185 55555 > prashant.txt

Takes 18 Second.

As in first case I am running it locally means its running on one machine and taking less time where as in later caseI am submitting the job to cluster with 4 node.

So can anyone let me know what is the use of running the same in cluster as I am getting performance degrade with cluster. Or if any way is there where I can enhance the performance with cluster.

Would love to hear from someone regarding this very urgently.

~Prashant

holdenk commented 9 years ago

This is the repo for examples from learning spark, your question is most likely best suited to the Apache Spark Users mailing list (very few people look at the issues here). Best of luck :)

pradeepshivapur commented 6 years ago

Although at first look, it looks as performance degrade but is useful when spark jobs are submitted on gateway node since your driver program uses good amount of resources which is bottleneck in the future.