RADar-AZDelta / Rabbit-in-a-Blender

An ETL pipeline to transform your EMP data to OMOP.
https://radar-azdelta.github.io/Rabbit-in-a-Blender/
GNU General Public License v3.0
11 stars 3 forks source link

SQL Server: Performance: Indices for work tables #86

Closed Ben-Goethuys closed 2 months ago

Ben-Goethuys commented 2 months ago

The "work"-zone tables corresponding to the main omop tables like for example [work].[vocabulary] don't have indices. Other "work"-zone tables do have indices.

Could indices on these tables help with the SQL performance?

Side note: Azure SQL could benefit from the compression of columnstore indices in batch loads

pjlammertyn commented 2 months ago

Ben, all the tables that have event columns, are first stored in the WORK zone. After all the other tables are done, then the second stage of the ETL, will fill in the event colums (based on the _swap tables in the work zone), while copying the date to the OMOP zone. This will always be a full table scan. So adding indexes won't improve performance.

A performance improvement can be, to split up the records that have a event record filled in. So the event column is not filled in, send that record directly to the OMOP zone, other wise send it to the WORK zone. This will minimize the copy around of data, but will break the atomic nature of the ETL. So there are trade offs.