CDCgov / prime-devops

Apache License 2.0
4 stars 2 forks source link

Run vacuum and analyze on tables that are insert only #89

Closed mkalish closed 1 year ago

mkalish commented 1 year ago

Issue

Many of the ReportStream tables are insert-only which means that they never have vacuum run on them and subsequently are never analyzed. This means the stats for tables are incorrect leading to non-optimal queries

Description

Run both vacuum and analyze against all of the report stream tables that have never been analyzed

select relname
from pg_stat_user_tables
where last_analyze is null;

In rough priority order:

report_lineage
item_lineage
covid_result_metadata
action
report_file
setting
flyway_schema_history
task
county_lat_long
email_schedule
jti_cache
lookup_table_version
lookup_table_row
action_log
temp_action
temp
temp_vt_resubmissions
elr_result_metadata
today_actions
sender_ips
october_actions
vw_livd_table
temp_patient_demographics
patient_gender
sender
state
JosiahSiegel commented 1 year ago

3 maintenance windows required to complete the following:

tables runtime minutes command window
✔️action 45 VACUUM VERBOSE ANALYZE "action"; 1
✔️report_file 41 VACUUM VERBOSE ANALYZE "report_file"; 1
✔️action_log 40 VACUUM VERBOSE ANALYZE "action_log"; 1
✔️item_lineage 120 VACUUM VERBOSE ANALYZE "item_lineage"; 2
✔️report_lineage 60 VACUUM VERBOSE ANALYZE "report_lineage"; 3
✔️covid_result_metadata 26 VACUUM VERBOSE ANALYZE "covid_result_metadata"; 3
✔️temp_patient_demographics 22 VACUUM VERBOSE ANALYZE "temp_patient_demographics"; 3
✔️temp_action 13 VACUUM VERBOSE ANALYZE "temp_action"; 3
✔️task 8 VACUUM VERBOSE ANALYZE "task"; 3
✔️sender_ips 1 VACUUM VERBOSE ANALYZE "sender_ips"; 3
✔️october_actions 1 VACUUM VERBOSE ANALYZE "october_actions"; 3
✔️setting 0 VACUUM VERBOSE ANALYZE "setting"; 3
✔️flyway_schema_history 0 VACUUM VERBOSE ANALYZE "flyway_schema_history"; 3
✔️county_lat_long 0 VACUUM VERBOSE ANALYZE "county_lat_long"; 3
✔️email_schedule 0 VACUUM VERBOSE ANALYZE "email_schedule"; 3
✔️jti_cache 0 VACUUM VERBOSE ANALYZE "jti_cache"; 3
✔️lookup_table_version 0 VACUUM VERBOSE ANALYZE "lookup_table_version"; 3
✔️lookup_table_row 0 VACUUM VERBOSE ANALYZE "lookup_table_row"; 3
✔️temp 0 VACUUM VERBOSE ANALYZE "temp"; 3
✔️temp_vt_resubmissions 0 VACUUM VERBOSE ANALYZE "temp_vt_resubmissions"; 3
✔️elr_result_metadata 0 VACUUM VERBOSE ANALYZE "elr_result_metadata"; 3
✔️today_actions 0 VACUUM VERBOSE ANALYZE "today_actions"; 3
✔️vw_livd_table 0 VACUUM VERBOSE ANALYZE "vw_livd_table"; 3
✔️patient_gender 0 VACUUM VERBOSE ANALYZE "patient_gender"; 3
✔️sender 0 VACUUM VERBOSE ANALYZE "sender"; 3
✔️state 0 VACUUM VERBOSE ANALYZE "state"; 3
377
mkalish commented 1 year ago

@JosiahSiegel As we chatted about earlier this week, I'd like to see if we could update the statistics threshold for some of the columns in report_lineage and item_lineage where the n_distinct count is way off. Here are the SQL commands I tested locally:

ALTER TABLE item_lineage  
    ALTER COLUMN child_report_id
    SET STATISTICS 1000;

ALTER TABLE item_lineage  
    ALTER COLUMN parent_report_id
    SET STATISTICS 1000;

ALTER TABLE report_lineage  
    ALTER COLUMN child_report_id
    SET STATISTICS 1000;

ALTER TABLE report_lineage  
    ALTER COLUMN parent_report_id
    SET STATISTICS 1000;

analyze verbose report_lineage;

analyze verbose item_lineage;

Currently, for the two tables the n_distinct values are:

tablename attname n_distinct
report_lineage parent_report_id 1216959.0
report_lineage child_report_id 84760.0
item_lineage parent_report_id 49342.0
item_lineage child_report_id 39858.0

query I used to get the numbers

select tablename, attname, n_distinct
from pg_stats
where (tablename = 'report_lineage'
or tablename = 'item_lineage')
and (attname = 'parent_report_id' or attname = 'child_report_id')

Maybe we could try this in a cloned DB to see the impact and I could a query that we could use to check performance too.

JosiahSiegel commented 1 year ago

@mkalish results after testing on clone db:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

tablename | attname | n_distinct -- | -- | -- report_lineage | child_report_id | 757686 report_lineage | parent_report_id | -0.167848 item_lineage | parent_report_id | 384048 item_lineage | child_report_id | 276785

mkalish commented 1 year ago

@JosiahSiegel That looks a lot better, except is that second row right?

JosiahSiegel commented 1 year ago

@mkalish i'm not sure why it's returning that. select distinct of that column gives 44,094,945

mkalish commented 1 year ago

@JosiahSiegel Ah, I actually think this is because postgres switches to fractions at a certain distinct count. Negative implies that postgres thinks the number distinct values is likely to grow

https://www.postgresql.org/docs/current/view-pg-stats.html#:~:text=of%20column%27s%20entries-,n_distinct,-float4

JosiahSiegel commented 1 year ago

@mkalish i'm not sure how to do the math for that. total rows is 56,737,644, but sounds like the stats are working correctly there

JosiahSiegel commented 1 year ago

All requested maintenance complete!