Open ssimeonov opened 8 years ago
If a primary key constraint can be expressed as a list of columns comprising the PK, then I think we might be able to use Spark SQL's column metadata APIs to let us mark columns as being part of the primary key (similar to how we used column metadata APIs to let string lengths be configured on a per-column basis in #29).
I think that some databases let you use functions in the definition of indices or primary keys, such as a primary key defined over a suffix of a string column, but I imagine that it's a less-common use-case and I don't think that Redshift supports it. Therefore, having a column metadata field called "primary_key" might be sufficient.
Alternatively, we might be able to add a datasource writer option which lets you specify a list of column names comprising the PK, which might be simpler than the column metadata approach.
I prefer the simplicity of the writer option approach. There are specific benefits to providing this to Redshift that I'm not sure exist in the near term with SparkSQL. Therefore, IMO it makes sense to delay deciding how to express this in SparkSQL until we have more Spark-specific use cases.
In that case, I think this feature should be pretty straightforward to implement. Here's how I'd do it:
Parameters
to add a new primaryKey
option, which is a comma-separated list of column names.
ColumnName
class to help with this, similar to TableName.primaryKey
which returns a Seq[String]
(or Seq[ColumnName]
if we want to be more type-safe). By default, this will return an empty Seq.RedshiftWriter
to use the new configuration option:
saveToRedshift
to make sure that all of the specified primary key columns match the DataFrame's column names.createTableSql
to add the PRIMARY KEY
clause when appropriate.RedshiftSourceSuite
to check that the expected CREATE TABLE
statement is emitted.RedshiftIntegrationSuite
to make sure that Redshift accepts the queries that we generate.
Primary key constraints, though not directly enforced during data loading, are extremely valuable to the query planner and hence can substantially improve query performance, especially during joins.