databricks / spark-sql-perf

Apache License 2.0
586 stars 406 forks source link

Correct types of keys in schema #103

Closed juliuszsompolski closed 7 years ago

juliuszsompolski commented 7 years ago

Currently there is a mix of int and long types for _sk columns. That causes unnecessary casts when joining. Moreover, on SF10000 and above, it causes overflows of e.g. ss_ticket_numer column, leading to incorrect results of queries.

TPC DS specification defines "Identifier" as:

Identifier means that the column shall be able to hold any key value generated for that column

so we could juggle int and long columns depending on SF on a per column basis.

On the other hand it defines "Integer" as:

Integer means that the column shall be able to exactly represent integer values (i.e., values in increments of 1) in the range of at least (-2^(n - 1)) to (2^(n - 1)-1), where n is 64.

So to be true to the specification, all other integer columns should be made long.

Instead, let's follow impala-tpcds-kit: define all keys as long, but keep sensible integers as int.

Also, change d_date and a few others to Date instead of String.

sameeragarwal commented 7 years ago

LGTM

gregrahn commented 7 years ago

AFAIK the only columns that exceed INT32 precision are:

michal-databricks commented 7 years ago

This is what I previously wrote about the type issues with spark-sql-perf:

Some of the types defined in com.databricks.spark.sql.perf.tpcds.Tables are not matching the specification. The differences are as follows:

d_date               date -> string          22
i_rec_start/end_date date -> string           0
s_rec_start/end_date date -> string           0
web_gmt_offset       decimal(5,2) -> string   0
wr_*                 int -> long             11
sr_*                 int -> long             16

The last column above shows the number of queries using this column. Another issue is that while generally most of the tables use int to represent 'identifier' from the specification, it is not big enough for some keys. The overflow will occur on ticket_number / order_number fields of the fact tables with SF > 3000. So either those individual keys or all of them should be declared long. While technically the specification requires the 'integer' datatype (which also maps to int in spark-sql-perf) to be 64 bit signed at minimum (so equivalent to long) the TPC-DS tool kit data generator uses 32 bit signed values to represent data in those columns. I have a PR for those changes ready, but since this code is also used to declare the tables, not just generate, such change would break the compatibility with the existing data sets.

michal-databricks commented 7 years ago

You covered all differences besides web_gmt_offset. I would be hesitant to map all integers to longs because in from my experience when Parquet-MR fails to apply dictionary encoding to a column it falls back to plain, so after the change we may end up with significantly larger Parquet data sets.