databricks / spark-sql-perf

Apache License 2.0
586 stars 407 forks source link

Use CHAR/VARCHAR types in TPCDSTables #198

Open maropu opened 3 years ago

maropu commented 3 years ago

TPC-DS schemas are different between spark-sql-perf TPCDSTables and spark-master/branch-3.1 TPCDSBase (string v.s. char/varchar). For example;

// spark
    "reason" ->
      """
        |`r_reason_sk` INT,
        |`r_reason_id` CHAR(16),
        |`r_reason_desc` CHAR(100)
      """.stripMargin,

// spark-sql-perf
    Table("reason",
      partitionColumns = Nil,
      'r_reason_sk               .int,
      'r_reason_id               .string,
      'r_reason_desc             .string),

To generated TPCDS table data for Spark (master/branch-3.1), it would be nice to use CHAR/VARCHAR types in TPCDSTables.

NOTE: This ticket comes from https://github.com/apache/spark/pull/31886

maropu commented 3 years ago

https://github.com/databricks/spark-sql-perf/pull/201

zhaner08 commented 3 years ago

Is there a specific reason that this schema was created in the first place rather then using the schema mentioned in the tpc org documentation? http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.1.0.pdf