Allow specification of PRIMARY KEY constraint

ssimeonov commented 8 years ago

Primary key constraints, though not directly enforced during data loading, are extremely valuable to the query planner and hence can substantially improve query performance, especially during joins.

JoshRosen commented 8 years ago

If a primary key constraint can be expressed as a list of columns comprising the PK, then I think we might be able to use Spark SQL's column metadata APIs to let us mark columns as being part of the primary key (similar to how we used column metadata APIs to let string lengths be configured on a per-column basis in #29).

I think that some databases let you use functions in the definition of indices or primary keys, such as a primary key defined over a suffix of a string column, but I imagine that it's a less-common use-case and I don't think that Redshift supports it. Therefore, having a column metadata field called "primary_key" might be sufficient.

Alternatively, we might be able to add a datasource writer option which lets you specify a list of column names comprising the PK, which might be simpler than the column metadata approach.

ssimeonov commented 8 years ago

I prefer the simplicity of the writer option approach. There are specific benefits to providing this to Redshift that I'm not sure exist in the near term with SparkSQL. Therefore, IMO it makes sense to delay deciding how to express this in SparkSQL until we have more Spark-specific use cases.

JoshRosen commented 8 years ago

In that case, I think this feature should be pretty straightforward to implement. Here's how I'd do it:

Update Parameters to add a new primaryKey option, which is a comma-separated list of column names.
- There's some trickiness if column names can contain special-characters, so we should probably allow for backtick / quote-escaping and write a string splitting method which properly accounts for this.
- If necessary, we can define a ColumnName class to help with this, similar to TableName.
- Add a primaryKey which returns a Seq[String] (or Seq[ColumnName] if we want to be more type-safe). By default, this will return an empty Seq.
Update RedshiftWriter to use the new configuration option:
- Should probably add validation at the start of saveToRedshift to make sure that all of the specified primary key columns match the DataFrame's column names.
- Update createTableSql to add the PRIMARY KEY clause when appropriate.
Add tests:
- Basic unit test in RedshiftSourceSuite to check that the expected CREATE TABLE statement is emitted.
- End-to-end integration test in RedshiftIntegrationSuite to make sure that Redshift accepts the queries that we generate.
- If we do fancy parsing for handling column names with special characters, that parsing logic needs its own unit tests.
Document the new parameter in the REAMDE.

databricks / spark-redshift

Allow specification of PRIMARY KEY constraint #194