CODAIT / spark-bench

Benchmark Suite for Apache Spark
https://codait.github.io/spark-bench/
Apache License 2.0
239 stars 123 forks source link

CVS-PARQUET CONFIG ISSUE WITH HDP 2.6/Spark 2.1.1 #163

Closed bonibruno closed 6 years ago

bonibruno commented 6 years ago

Spark-Bench version

2.1.1_0.3.0

Details of your cluster setup (Spark version, Standalone/Yarn/Local/Etc)

Spark v 2.1.1, Yarn (HDP 2.6)

Scala version on your cluster

Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)

This configuration works, but note that I had to change the filter.

spark-bench = { spark-submit-config = [{ spark-home = "/usr/hdp/current/spark2-client" spark-args = { master = "yarn" executor-memory = "24G" num-executors = 140 } conf = { "spark.dynamicAllocation.enabled" = "false" } suites-parallel = false workload-suites = [ { descr = "Generate a dataset, then take that same dataset and write it out to Parquet format" benchmark-output = "hdfs:///tmp/s2b/s2bresults-data-gen.csv" save-mode = "overwrite" // We need to generate the dataset first through the data generator, then we take that dataset and convert it to Parquet. parallel = false workloads = [ { name = "data-generation-kmeans" rows = 10000000 cols = 24 output = "hdfs:///tmp/s2b/kmeans-data.csv" }, { name = "sql" query = "select from input" input = "hdfs:///tmp/s2b/kmeans-data.csv" output = "hdfs:///tmp/s2b/kmeans-data.parquet" } ] }, { descr = "Run two different SQL queries over the dataset in two different formats" benchmark-output = "hdfs:///tmp/s2b/s2bresults-sql.csv" save-mode = "overwrite" parallel = false repeat = 3 workloads = [ { name = "sql" input = ["hdfs:///tmp/s2b/kmeans-data.csv", "hdfs:///tmp/s2b/kmeans-data.parquet"] query = ["select from input", "select c0 from input where c0 < -0.9"] cache = false } ] } ] }] }

Relevant stacktrace

Without the change, the error below is thrown using your published config:

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '0' given input columns: [c17, c6, c9, c11, c3, c2, c18, c15, c22, c0, c16, c19, c13, c20, c12, c21, c14, c7, c23, c10, c1, c5, c4, c8]; line 1 pos 34; 'Project ['0, '22]

Relation[c0#2151,c1#2152,c2#2153,c3#2154,c4#2155,c5#2156,c6#2157,c7#2158,c8#2159,c9#2160,c10#2161,c11#2162,c12#2163,c13#2164,c14#2165,c15#2166,c16#2167,c17#2168,c18#2169,c19#2170,c20#2171,c21#2172,c22#2173,c23#2174] csv

Description of your problem and any other relevant info

Problem is with the provided example config file. Modifying the filter resolves the issue on HDP 2.6 platforms running Spark 2.1.1. Thought this feedback would be useful to the team.

ecurtin commented 6 years ago

@bonibruno You're totally right, the documentation didn't get synced with https://github.com/SparkTC/spark-bench/pull/157. Thanks so much for your report!

bonibruno commented 6 years ago

@ecurtin Your welcome! Note: #157 was committed 18 days ago and included in release 91, since I'm on release 91 and still ran into this problem using the provided example conf file, I believe you need to update the example conf in addition to the documentation - that is what ultimately corrected the problem for me with release 91.

ecurtin commented 6 years ago

Closed by #164