databricks / spark-sql-perf

Apache License 2.0
586 stars 407 forks source link

Cannot create tables in cluster mode - Unable to infer schema for Parquet #131

Closed Panos-Bletsos closed 6 years ago

Panos-Bletsos commented 6 years ago

When I try to setup TPCDS dataset in a cluster I get an error that Spark is not able to infer parquet schema. This happens only in cluster mode, in local mode the setup finishes successfully.

I have installed tpcds kit on all nodes under the same path and the location of the data is the same as well.

Specifically I try

./bin/spark-shell --jars /root/spark-sql-perf/target/scala-2.11/spark-sql-perf-assembly-0.5.0-SNAPSHOT.jar --master spark://master:7077

scala> import spark.sqlContext.implicits._

scala> import com.databricks.spark.sql.perf.tpcds.TPCDSTables

scala> val tables = new TPCDSTables(spark.sqlContext, "/tmp/tpcds-kit-src/tools", "1", false, false)

scala> tables.genData("/tmp/tpcds-data", "parquet", true, true, true, false, "", 100)

scala> sql("create database tpcds")

scala> tables.createExternalTables("/tmp/tpcds-data", "parquet", "tpcds", true, true)
Creating external table catalog_sales in database tpcds using data stored in /tmp/tpcds-data/catalog_sales.
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:182)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:182)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:181)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)
  at org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:77)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:121)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:121)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:142)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:139)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:120)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:121)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)
  at org.apache.spark.sql.internal.CatalogImpl.createTable(CatalogImpl.scala:352)
  at org.apache.spark.sql.internal.CatalogImpl.createTable(CatalogImpl.scala:319)
  at org.apache.spark.sql.internal.CatalogImpl.createTable(CatalogImpl.scala:302)
  at org.apache.spark.sql.SQLContext.createExternalTable(SQLContext.scala:544)
  at com.databricks.spark.sql.perf.Tables$Table.createExternalTable(Tables.scala:242)
  at com.databricks.spark.sql.perf.Tables$$anonfun$createExternalTables$1.apply(Tables.scala:311)
  at com.databricks.spark.sql.perf.Tables$$anonfun$createExternalTables$1.apply(Tables.scala:309)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at com.databricks.spark.sql.perf.Tables.createExternalTables(Tables.scala:309)
  ... 50 elided
juliuszsompolski commented 6 years ago

Hi @Panos-Bletsos I suppose "/tmp/tpcds-data" is not a HDFS / S3 etc. path that is accessible from the whole cluster? If you're generating data on a cluster, you must generate it to a filesystem and path that is accessible from the whole cluster.

Panos-Bletsos commented 6 years ago

Thanks @juliuszsompolski I used a HDFS directory and everything worked as expected. Thanks a lot!