Closed bonibruno closed 6 years ago
@bonibruno You're totally right, the documentation didn't get synced with https://github.com/SparkTC/spark-bench/pull/157. Thanks so much for your report!
@ecurtin Your welcome! Note: #157 was committed 18 days ago and included in release 91, since I'm on release 91 and still ran into this problem using the provided example conf file, I believe you need to update the example conf in addition to the documentation - that is what ultimately corrected the problem for me with release 91.
Closed by #164
Spark-Bench version
2.1.1_0.3.0
Details of your cluster setup (Spark version, Standalone/Yarn/Local/Etc)
Spark v 2.1.1, Yarn (HDP 2.6)
Scala version on your cluster
Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)
This configuration works, but note that I had to change the filter.
spark-bench = { spark-submit-config = [{ spark-home = "/usr/hdp/current/spark2-client" spark-args = { master = "yarn" executor-memory = "24G" num-executors = 140 } conf = { "spark.dynamicAllocation.enabled" = "false" } suites-parallel = false workload-suites = [ { descr = "Generate a dataset, then take that same dataset and write it out to Parquet format" benchmark-output = "hdfs:///tmp/s2b/s2bresults-data-gen.csv" save-mode = "overwrite" // We need to generate the dataset first through the data generator, then we take that dataset and convert it to Parquet. parallel = false workloads = [ { name = "data-generation-kmeans" rows = 10000000 cols = 24 output = "hdfs:///tmp/s2b/kmeans-data.csv" }, { name = "sql" query = "select from input" input = "hdfs:///tmp/s2b/kmeans-data.csv" output = "hdfs:///tmp/s2b/kmeans-data.parquet" } ] }, { descr = "Run two different SQL queries over the dataset in two different formats" benchmark-output = "hdfs:///tmp/s2b/s2bresults-sql.csv" save-mode = "overwrite" parallel = false repeat = 3 workloads = [ { name = "sql" input = ["hdfs:///tmp/s2b/kmeans-data.csv", "hdfs:///tmp/s2b/kmeans-data.parquet"] query = ["select from input", "select
c0
from input wherec0
< -0.9"] cache = false } ] } ] }] }Relevant stacktrace
Without the change, the error below is thrown using your published config:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '
0
' given input columns: [c17, c6, c9, c11, c3, c2, c18, c15, c22, c0, c16, c19, c13, c20, c12, c21, c14, c7, c23, c10, c1, c5, c4, c8]; line 1 pos 34; 'Project ['0, '22]Relation[c0#2151,c1#2152,c2#2153,c3#2154,c4#2155,c5#2156,c6#2157,c7#2158,c8#2159,c9#2160,c10#2161,c11#2162,c12#2163,c13#2164,c14#2165,c15#2166,c16#2167,c17#2168,c18#2169,c19#2170,c20#2171,c21#2172,c22#2173,c23#2174] csv
Description of your problem and any other relevant info
Problem is with the provided example config file. Modifying the filter resolves the issue on HDP 2.6 platforms running Spark 2.1.1. Thought this feedback would be useful to the team.