databricks / spark-redshift

Redshift data source for Apache Spark
Apache License 2.0
607 stars 348 forks source link

Allow specification of PRIMARY KEY constraint #194

Open ssimeonov opened 8 years ago

ssimeonov commented 8 years ago

Primary key constraints, though not directly enforced during data loading, are extremely valuable to the query planner and hence can substantially improve query performance, especially during joins.

JoshRosen commented 8 years ago

If a primary key constraint can be expressed as a list of columns comprising the PK, then I think we might be able to use Spark SQL's column metadata APIs to let us mark columns as being part of the primary key (similar to how we used column metadata APIs to let string lengths be configured on a per-column basis in #29).

I think that some databases let you use functions in the definition of indices or primary keys, such as a primary key defined over a suffix of a string column, but I imagine that it's a less-common use-case and I don't think that Redshift supports it. Therefore, having a column metadata field called "primary_key" might be sufficient.

Alternatively, we might be able to add a datasource writer option which lets you specify a list of column names comprising the PK, which might be simpler than the column metadata approach.

ssimeonov commented 8 years ago

I prefer the simplicity of the writer option approach. There are specific benefits to providing this to Redshift that I'm not sure exist in the near term with SparkSQL. Therefore, IMO it makes sense to delay deciding how to express this in SparkSQL until we have more Spark-specific use cases.

JoshRosen commented 8 years ago

In that case, I think this feature should be pretty straightforward to implement. Here's how I'd do it: