Closed arnejduranovic closed 1 month ago
Hey team! Please add your planning poker estimate with Zenhub @adegolier @arnejduranovic @brick-green @david-navapbc @jack-h-wang @jalbinson @JFisk42 @mkalish @thetaurean
I chose an arbitrary report_id where next_action = 'none' to generate explain plans.
In staging:
Nested Loop (cost=752327.09..1732214.23 rows=411835 width=685)
CTE lineage
-> Recursive Union (cost=0.43..655548.21 rows=758026 width=48)
-> Index Scan using idx_report_lineage_child_report_id on report_lineage (cost=0.43..98.06 rows=86 width=48)
Index Cond: (child_report_id = 'cbc66ac1-c04b-43f9-9dd2-4a5453dc017e'::uuid)
-> Subquery Scan on "*SELECT* 2" (cost=0.43..64786.90 rows=75794 width=48)
-> Nested Loop (cost=0.43..64028.96 rows=75794 width=48)
-> WorkTable Scan on lineage lineage_1 (cost=0.00..17.20 rows=860 width=48)
-> Index Scan using idx_report_lineage_child_report_id on report_lineage report_lineage_1 (cost=0.43..72.89 rows=88 width=32)
Index Cond: (child_report_id = lineage_1.parent_report_id)
-> Merge Join (cost=96778.45..650004.89 rows=758026 width=685)
Merge Cond: (report_file.report_id = lineage.parent_report_id)
-> Index Scan using report_file_pkey on report_file (cost=0.43..521923.80 rows=7215047 width=685)
-> Materialize (cost=96778.01..100568.14 rows=758026 width=16)
-> Sort (cost=96778.01..98673.08 rows=758026 width=16)
Sort Key: lineage.parent_report_id
-> CTE Scan on lineage (cost=0.00..15160.52 rows=758026 width=16)
-> Index Scan using action_pkey on action (cost=0.43..0.56 rows=1 width=8)
Index Cond: (action_id = report_file.action_id)
Filter: (action_name = 'receive'::task_action)
In prod:
Nested Loop (cost=608504.60..2028142.68 rows=344378 width=755)
CTE lineage
-> Recursive Union (cost=0.56..608503.60 rows=523451 width=48)
-> Index Scan using idx_report_lineage_child_report_id on report_lineage (cost=0.56..81.36 rows=71 width=48)
Index Cond: (child_report_id = 'ddd7e591-9866-4337-8585-65b8e919eb4b'::uuid)
-> Subquery Scan on "*SELECT* 2" (cost=0.56..60318.70 rows=52338 width=48)
-> Nested Loop (cost=0.56..59795.32 rows=52338 width=48)
-> WorkTable Scan on lineage lineage_1 (cost=0.00..14.20 rows=710 width=48)
-> Index Scan using idx_report_lineage_child_report_id on report_lineage report_lineage_1 (cost=0.56..82.91 rows=74 width=32)
Index Cond: (child_report_id = lineage_1.parent_report_id)
-> Nested Loop (cost=0.56..1136833.93 rows=523451 width=755)
-> CTE Scan on lineage (cost=0.00..10469.02 rows=523451 width=16)
-> Index Scan using report_file_pkey on report_file (cost=0.56..2.15 rows=1 width=755)
Index Cond: (report_id = lineage.parent_report_id)
-> Index Scan using action_pkey on action (cost=0.44..0.54 rows=1 width=8)
Index Cond: (action_id = report_file.action_id)
Filter: (action_name = 'receive'::task_action)
VACUUM and ANALYZE for the tables involved in the expensive merge join on staging (report_file, report_lineage) appears to be up to date. However, there are more indexes on staging than on prod:
Extra indexes on staging report_file:
idx_report_file_created_week
CREATE INDEX idx_report_file_created_week ON public.report_file USING btree (date_trunc('week'::text, timezone('America/New_York'::text, created_at)))
report_file_receiving_org_action_id_index
CREATE INDEX report_file_receiving_org_action_id_index ON public.report_file USING btree (receiving_org, action_id)
Dropped the indexes but the explain plan did not change significantly.
At this time it appears the performance difference is resolved via spot-checking transactions in Azure appinsights. I've confirmed the indexes dropped are currently not created via the flyway migrations and will be leaving them dropped; if they are needed, the definitions are documented in the previous comment. The explain plan not significantly changing may be due to the parameters I chose not reflecting a realistic use of the query.
I reached out to devops to ask about when the report_file
and report_lineage
are configured to vacuum/analyze. This is currently done manually rather than using autovacuum:
Our scheduling is still in the works. It was supposed to go live last weekend, but citgo was performing maintenance, so it's now scheduled this weekend for a manual effort, then enabling of the workflow. We only need to do this because Azure is not running these actions correctly, but this will be resolved when we migrate to the new psql
Closing this issue for now per @arnejduranovic as we reviewed several recent transactions together to check if this was still occurring.
User Story
As a sender of reports to ReportStream, I want my reports to make it through the pipeline in a timely fashion, so I can easily meet my reporting requirements.
Description/Use Case
It has been observed that on the STAGING environment the following query is taking ~15 seconds to complete:
Context and timing of the query in the DestinationFilter step:
The goal of this ticket is to try to figure out why this is happening and implement or propose a solution
Risks/Impacts/Considerations
Dev Notes
Starting point/thoughts:
This query seems to run in milliseconds on Production environment, so:
SELECT schemaname, relname, last_autoanalyze, last_analyze FROM pg_stat_all_tables WHERE relname = 'accounts';
Acceptance Criteria