CODAIT / spark-bench

Benchmark Suite for Apache Spark
https://codait.github.io/spark-bench/
Apache License 2.0
238 stars 123 forks source link

Compatibility issue with 2.0 version of Spark #172

Closed justorez closed 9 months ago

justorez commented 6 years ago
key value
spark-bench version 2.3.0_0.4.0
spark version 2.0.0
cluster setup standalone
scala version 2.11.8

My configuration file:

spark-bench = {
  spark-submit-config = [{
    spark-home = "/home/xxx/spark-2.0.0-bin-hadoop2.6"
    spark-args = {
      master = "spark://xxx:xxxx"
    }
    workload-suites = [
      {
        descr = "Kmeans"
        benchmark-output = "hdfs:///tmp/kmeans/result-kmeans.csv"
        save-mode = "overwrite"
        workloads = [
          {
            name = "kmeans"
            input = "/tmp/kmeans/kmeans-data.csv"
            k = 100
          }
        ]
      }
    ]
  }]
}

Exception info:

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.mllib.clustering.KMeans$.train(Lorg/apache/spark/rdd/RDD;IILjava/lang/String;J)Lorg/apache/spark/mllib/clustering/KMeansModel;
    at com.ibm.sparktc.sparkbench.workload.ml.KMeansWorkload$$anonfun$train$1.apply(KMeansWorkload.scala:120)

org.apache.spark.mllib.clustering.KMeans in Spark2.1:

@Since("2.1.0")
def train(
  data: RDD[Vector],
  k: Int,
  maxIterations: Int,
  initializationMode: String,
  seed: Long): KMeansModel = {
 new KMeans().setK(k)
  .setMaxIterations(maxIterations)
  .setInitializationMode(initializationMode)
  .setSeed(seed)
  .run(data)
}

Kmeans class in Spark2.0 does not have this method!

lovengulu commented 6 years ago

I'm playing with spark-bench. I was able to run kmeans on spark2. I'm using HDP-2.6.5.0 which comes with Spark2 - 2.3.0.

This is my conf file:

spark-bench = {
  spark-submit-config = [{
   spark-args = {
      master = "yarn" // FILL IN YOUR MASTER HERE
      num-executors = 4
     // executor-memory = "XXXXXXX" // FILL IN YOUR EXECUTOR MEMORY
    }
    conf = {
      // Any configuration you need for your setup goes here, like:
      "spark.executor.cores" = "4"
      "spark.executor.memory" = "5g"
      "spark.driver.memory"   = "5g"
      // "spark.dynamicAllocation.enabled" = "false"
    }

    workload-suites = [
      {
        descr = "kmeans Workloads"
        benchmark-output = "console"
        workloads = [
          {
            name = "kmeans"
            input = "hdfs:///tmp/csv-vs-parquet/kmeans-data.csv"
          }
        ]
      }
    ]
  }]
}