Open MPagel opened 2 years ago
Hi @MPagel thank you for the question. We are platform agnostic so we have to think about how our scripts will be translated to the other dialects we support. Your question about nonclustered vs clustered is a good one and essentially stems from how we had to write the DDLs, keys and indices so they could be put through sqlRender and applied to other databases. Do you know how or if SqlRender would work with your suggestions? https://github.com/ohdsi/SqlRender
I may start a series of pull requests later, but for now I will just add more code in these comments.
It looks like SQLRender starts solely with MS SQL and then converts that to other SQL engine languages. Given that, I understand the MS SQL to be your base language, and you are in turn using SQL Render to convert it to the following: bigquery, impala, netezza, oracle, pdw, postgresql, redshift, and spark
I believe the solution lies in modifying createDDL.R,and OMOP_CDM_indices_vX.Y.sql along with SqlRender:inst/csv/replacementPaterns.csv. It is possible that some changes will need to be made to writeDDL.R, but I believe that is handled by manual modifications to the OMOP_CDM_indices_vX.Y.sql
Change createDDL.R L#113
to read sql_result <- c(sql_result, paste0("\nALTER TABLE @cdmDatabaseSchema.", subquery$cdmTableName, " ADD CONSTRAINT xpk_", subquery$cdmTableName, " PRIMARY KEY CLUSTERED (", subquery$cdmFieldName , ");\n"))
Change OMOP_CDM_indices_v5.4.sql
and manually delete all lines that begin CREATE CLUSTERED INDEX
(I couldn't find where in the pipeline where/if the CDM_indices SQL file is being generated to make one "magic bullet" change).
In SqlRender:inst/csv/replacementPaterns.csv add a line for each target DB engine
&DBname,ALTER TABLE @table ADD CONSTRAINT @pkname PRIMARY KEY CLUSTERED (@pkcol),&substitutionTxt
Where DBname and substitutionTxt could be
oracle,ALTER TABLE @table ADD CONSTRAINT @pkname PRIMARY KEY(@pkcol)
bigquery,--PK on @pkcol cannot be added. bigQuery does not support PKs or unique constraints
impala,--PK on @pkcol cannot be added. impala requires PKs to be the first column of a CREATE TABLE expression. see https://stackoverflow.com/questions/56475209/alter-table-in-impala-make-a-column-a-primary-key
netezza,ALTER TABLE @table ADD CONSTRAINT @pkname PRIMARY KEY(@pkcol) INITIALLY IMMEDIATE
pdw,ALTER TABLE @table ADD CONSTRAINT @pkname PRIMARY KEY NONCLUSTERED (@pkcol) NOT ENFORCED -- no clustered primary keys or enforced unique constraints are allowed https://stackoverflow.com/questions/49941101/how-to-set-any-column-as-primary-key-in-azure-sql-data-warehouse
postgresql,ALTER TABLE @table ADD PRIMARY KEY (@pkcol) -- user-named PKs may not be possible?
redshift,ALTER TABLE @table ADD PRIMARY KEY (@pkcol) -- user-named PKs may not be possible?
spark,ALTER TABLE @table ADD CONSTRAINT @pkname PRIMARY KEY (@pkcol) -- clustering has to be in table creation?
Thanks for this @MPagel. Would you be willing to make those changes in SqlRender?
Why doesn't the DDL for Microsoft SQL Server use a clustered primary key? Why does it instead use an index that is essentially a duplicate of the primary key, thereby using up more resources (storage and processing time)?
From Microsoft's documentation at https://docs.microsoft.com/en-us/sql/relational-databases/indexes/clustered-and-nonclustered-indexes-described?view=sql-server-ver15 they mention that a primary key and other constraints are automatically indexed.
Perhaps I am missing a use-case where we may wish to change which fields the PK is issued on or temporarily disable uniqueness checks.
for example with the Person table there is the following DDL
this could be replaced with the CREATE TABLE plus
Note that idx_person_id is thereby not defined
This can be consolidated into fewer DDL steps (but no net resource impact otherwise) with
or
related: https://github.com/OHDSI/CommonDataModel/issues/406