OHDSI / CommonDataModel

Definition and DDLs for the OMOP Common Data Model (CDM)
https://ohdsi.github.io/CommonDataModel
876 stars 448 forks source link

for MSSQL why do you use nonclustered primary keys and separate clustered indicies? (version CDM 5.4) #495

Open MPagel opened 2 years ago

MPagel commented 2 years ago

Why doesn't the DDL for Microsoft SQL Server use a clustered primary key? Why does it instead use an index that is essentially a duplicate of the primary key, thereby using up more resources (storage and processing time)?

From Microsoft's documentation at https://docs.microsoft.com/en-us/sql/relational-databases/indexes/clustered-and-nonclustered-indexes-described?view=sql-server-ver15 they mention that a primary key and other constraints are automatically indexed.

Indexes and constraints

Indexes are automatically created when PRIMARY KEY and UNIQUE constraints are defined on table columns. For example, when you create a table with a UNIQUE constraint, Database Engine automatically creates a nonclustered index. If you configure a PRIMARY KEY, Database Engine automatically creates a clustered index, unless a clustered index already exists. When you try to enforce a PRIMARY KEY constraint on an existing table and a clustered index already exists on that table, SQL Server enforces the primary key using a nonclustered index.

For more information, see Create primary keys and Create unique constraints.

Perhaps I am missing a use-case where we may wish to change which fields the PK is issued on or temporarily disable uniqueness checks.

for example with the Person table there is the following DDL

CREATE TABLE @cdmDatabaseSchema.PERSON (
            person_id integer NOT NULL,
            gender_concept_id integer NOT NULL,
            year_of_birth integer NOT NULL,
            month_of_birth integer NULL,
            day_of_birth integer NULL,
            birth_datetime datetime NULL,
            race_concept_id integer NOT NULL,
            ethnicity_concept_id integer NOT NULL,
            location_id integer NULL,
            provider_id integer NULL,
            care_site_id integer NULL,
            person_source_value varchar(50) NULL,
            gender_source_value varchar(50) NULL,
            gender_source_concept_id integer NULL,
            race_source_value varchar(50) NULL,
            race_source_concept_id integer NULL,
            ethnicity_source_value varchar(50)
            ethnicity_source_concept_id integer NULL );

ALTER TABLE @cdmDatabaseSchema.PERSON ADD CONSTRAINT xpk_PERSON PRIMARY KEY NONCLUSTERED  (person_id);

CREATE CLUSTERED INDEX idx_person_id ON @cdmDatabaseSchema.person (person_id ASC);
CREATE INDEX idx_gender ON @cdmDatabaseSchema.person (gender_concept_id ASC);

this could be replaced with the CREATE TABLE plus

ALTER TABLE @cdmDatabaseSchema.PERSON ADD CONSTRAINT xpk_PERSON PRIMARY KEY CLUSTERED  (person_id);
-- fill table with data here (before secondary index creation)
CREATE INDEX idx_gender ON @cdmDatabaseSchema.person (gender_concept_id ASC);

Note that idx_person_id is thereby not defined

This can be consolidated into fewer DDL steps (but no net resource impact otherwise) with

CREATE TABLE @cdmDatabaseSchema.PERSON (
            person_id integer NOT NULL PRIMARY KEY CLUSTERED, --pk name is not defined with this method
            ...
            ethnicity_source_concept_id integer NULL );
-- fill table with data here (before secondary index creation)
CREATE INDEX idx_gender ON @cdmDatabaseSchema.person (gender_concept_id ASC);

or

CREATE TABLE @cdmDatabaseSchema.PERSON (
            person_id integer NOT NULL,
            ...
            ethnicity_source_concept_id integer NULL,
            CONSTRAINT xpk_PERSON PRIMARY KEY CLUSTERED (person_id );
-- fill table with data here (before secondary index creation)          
CREATE INDEX idx_gender ON @cdmDatabaseSchema.person (gender_concept_id ASC);

related: https://github.com/OHDSI/CommonDataModel/issues/406

clairblacketer commented 2 years ago

Hi @MPagel thank you for the question. We are platform agnostic so we have to think about how our scripts will be translated to the other dialects we support. Your question about nonclustered vs clustered is a good one and essentially stems from how we had to write the DDLs, keys and indices so they could be put through sqlRender and applied to other databases. Do you know how or if SqlRender would work with your suggestions? https://github.com/ohdsi/SqlRender

MPagel commented 2 years ago

I may start a series of pull requests later, but for now I will just add more code in these comments.

It looks like SQLRender starts solely with MS SQL and then converts that to other SQL engine languages. Given that, I understand the MS SQL to be your base language, and you are in turn using SQL Render to convert it to the following: bigquery, impala, netezza, oracle, pdw, postgresql, redshift, and spark

I believe the solution lies in modifying createDDL.R,and OMOP_CDM_indices_vX.Y.sql along with SqlRender:inst/csv/replacementPaterns.csv. It is possible that some changes will need to be made to writeDDL.R, but I believe that is handled by manual modifications to the OMOP_CDM_indices_vX.Y.sql

Change createDDL.R L#113

to read sql_result <- c(sql_result, paste0("\nALTER TABLE @cdmDatabaseSchema.", subquery$cdmTableName, " ADD CONSTRAINT xpk_", subquery$cdmTableName, " PRIMARY KEY CLUSTERED (", subquery$cdmFieldName , ");\n"))

Change OMOP_CDM_indices_v5.4.sql

and manually delete all lines that begin CREATE CLUSTERED INDEX (I couldn't find where in the pipeline where/if the CDM_indices SQL file is being generated to make one "magic bullet" change).

In SqlRender:inst/csv/replacementPaterns.csv add a line for each target DB engine &DBname,ALTER TABLE @table ADD CONSTRAINT @pkname PRIMARY KEY CLUSTERED (@pkcol),&substitutionTxt

Where DBname and substitutionTxt could be

oracle,ALTER TABLE @table ADD CONSTRAINT @pkname PRIMARY KEY(@pkcol)
bigquery,--PK on @pkcol cannot be added. bigQuery does not support PKs or unique constraints
impala,--PK on @pkcol cannot be added. impala requires PKs to be the first column of a CREATE TABLE expression. see https://stackoverflow.com/questions/56475209/alter-table-in-impala-make-a-column-a-primary-key
netezza,ALTER TABLE @table ADD CONSTRAINT @pkname PRIMARY KEY(@pkcol) INITIALLY IMMEDIATE
pdw,ALTER TABLE @table ADD CONSTRAINT @pkname PRIMARY KEY NONCLUSTERED (@pkcol) NOT ENFORCED -- no clustered primary keys or enforced unique constraints are allowed https://stackoverflow.com/questions/49941101/how-to-set-any-column-as-primary-key-in-azure-sql-data-warehouse
postgresql,ALTER TABLE @table ADD PRIMARY KEY (@pkcol) -- user-named PKs may not be possible?
redshift,ALTER TABLE @table ADD PRIMARY KEY (@pkcol) -- user-named PKs may not be possible?
spark,ALTER TABLE @table ADD CONSTRAINT @pkname PRIMARY KEY (@pkcol) -- clustering has to be in table creation?
clairblacketer commented 1 year ago

Thanks for this @MPagel. Would you be willing to make those changes in SqlRender?