HMIS / LSASampleCode

Longitudinal System Analysis (LSA) Sample Code and Documentation
22 stars 10 forks source link

Question about possible query optimization #1197

Closed bayboos closed 9 months ago

bayboos commented 9 months ago

Hello

Ever since we started testing this year's sample code, I've noticed that one particular SQL query - namely Step 3.4.4 - takes unusually long time to complete. Which was a bit surprising since it's a relatively simple one, updating just one value in the tlsa_Enrollment table. The issue seems to be caused by a way of incorporating a subquery, which does run for every enrollment record scanning the entire HealthAndDV table. The original query looks like this:

update n set n.DVStatus = (select min(case when dv.DomesticViolenceSurvivor = 1 and dv.CurrentlyFleeing = 1 then 1 when dv.DomesticViolenceSurvivor = 1 and dv.CurrentlyFleeing = 0 then 2 when dv.DomesticViolenceSurvivor = 1 then 3 when dv.DomesticViolenceSurvivor = 0 then 10 when dv.DomesticViolenceSurvivor in (8,9) then 98 else null end) from lsa_Report rpt inner join hmis_HealthAndDV dv on dv.EnrollmentID = n.EnrollmentID and dv.DateDeleted is null and dv.InformationDate <= rpt.ReportEnd and dv.InformationDate >= n.EntryDate and (dv.InformationDate <= n.ExitDate or n.ExitDate is null)) , n.Step = '3.4.4' from tlsa_Enrollment n;

Changing the order of actions just by shuffling parts of the query a bit seems to make the planner's job much easier and the execution more streamlined:

with source as ( select n.EnrollmentID, case when dv.DomesticViolenceSurvivor = 1 and dv.CurrentlyFleeing = 1 then 1 when dv.DomesticViolenceSurvivor = 1 and dv.CurrentlyFleeing = 0 then 2 when dv.DomesticViolenceSurvivor = 1 then 3 when dv.DomesticViolenceSurvivor = 0 then 10 when dv.DomesticViolenceSurvivor in (8,9) then 98 else null end as DVStatus_calculated from lsa_Report rpt inner join tlsa_Enrollment n on 1 = 1 inner join hmis_HealthAndDV dv on dv.EnrollmentID = n.EnrollmentID and dv.DateDeleted is null and dv.InformationDate <= rpt.ReportEnd and dv.InformationDate >= n.EntryDate and (dv.InformationDate <= n.ExitDate or n.ExitDate is null) ) update n1 set n1.DVStatus = (select min(s.DVStatus_calculated) from source s where s.EnrollmentID = n1.EnrollmentID) , n.Step = '3.4.4' from tlsa_Enrollment n1;

Several tests, both with sample data kit and live datasets, showed noticeable improvement, bringing this one query's execution time (the numbers come from sample data kit runs with monitoring enabled) from approximately 10 minutes to... 1 second.

Obviously there was no difference in the result of this change in terms of returned data, and that's exactly what I'd expect; however the subject is too important and sensitive to leave it to my expectations only. Therefore, a question: can anybody review the proposed change and find any flaw that can (even just potentially) return a value different from the original query? Any help will be highly appreciated.

Best regards

Mirosław Korwel

MollyMcEvilley commented 9 months ago

Hi there. I've looked at this in detail and I see no issue at all with your code.

I compared the execution plans for the original version, your version, and a couple of variations. Pretty much everything was better than the original. I'm updating the sample code with a statement that gives the same results and the lowest query cost of anything I looked at. Also adding an index to hmis_HealthAndDV in the Totally Optional HMIS Table Indexes file that SQL Server suggested.