Investigate recursive report_lineage query speed

arnejduranovic commented 2 months ago

User Story

As a sender of reports to ReportStream, I want my reports to make it through the pipeline in a timely fashion, so I can easily meet my reporting requirements.

Description/Use Case

It has been observed that on the STAGING environment the following query is taking ~15 seconds to complete:

WITH recursive "lineage"("parent_report_id", "path") AS
(
       SELECT "public"."report_lineage"."parent_report_id",
              cast("public"."report_lineage"."child_report_id" AS varchar)
       FROM   "public"."report_lineage"
       WHERE  "public"."report_lineage"."child_report_id" IN (cast(? AS uuid))
       UNION ALL
       SELECT "public"."report_lineage"."parent_report_id",
              (lineage.path
                     || cast("public"."report_lineage"."parent_report_id" AS varchar))
       FROM   "public"."report_lineage"
       JOIN   "lineage"
       ON     "lineage"."parent_report_id" = "public"."report_lineage"."child_report_id")
SELECT "public"."report_file".*
FROM   "lineage"
JOIN   "public"."report_file"
ON     "public"."report_file"."report_id" = "lineage"."parent_report_id"
JOIN   "public"."action"
ON     "public"."action"."action_id" = "public"."report_file"."action_id"
WHERE  "public"."action"."action_name" = cast(? AS "public"."task_action")

Context and timing of the query in the DestinationFilter step:

Screenshot 2024-08-08 at 11 00 07 AM

The goal of this ticket is to try to figure out why this is happening and implement or propose a solution

Risks/Impacts/Considerations

Dev Notes

Starting point/thoughts:

This query seems to run in milliseconds on Production environment, so:

verify this is true
If it is, generate an execution plan on STAGING metabase and compare with execution plan on PRODUCTION metabase
If the execution plan is different and we think statistics are an issue, connect directly to DB using VPN and check if statistics are off: SELECT schemaname, relname, last_autoanalyze, last_analyze FROM pg_stat_all_tables WHERE relname = 'accounts';

Acceptance Criteria

[ ] determine root cause of slow recursive report_lineage query
[ ] Implement solution OR create ticket with solution suggestion

Andrey-Glazkv commented 2 months ago

Hey team! Please add your planning poker estimate with Zenhub @adegolier @arnejduranovic @brick-green @david-navapbc @jack-h-wang @jalbinson @JFisk42 @mkalish @thetaurean

jack-h-wang commented 2 months ago

I chose an arbitrary report_id where next_action = 'none' to generate explain plans.

In staging:

Nested Loop (cost=752327.09..1732214.23 rows=411835 width=685)
CTE lineage
-> Recursive Union (cost=0.43..655548.21 rows=758026 width=48)
-> Index Scan using idx_report_lineage_child_report_id on report_lineage (cost=0.43..98.06 rows=86 width=48)
Index Cond: (child_report_id = 'cbc66ac1-c04b-43f9-9dd2-4a5453dc017e'::uuid)
-> Subquery Scan on "*SELECT* 2" (cost=0.43..64786.90 rows=75794 width=48)
-> Nested Loop (cost=0.43..64028.96 rows=75794 width=48)
-> WorkTable Scan on lineage lineage_1 (cost=0.00..17.20 rows=860 width=48)
-> Index Scan using idx_report_lineage_child_report_id on report_lineage report_lineage_1 (cost=0.43..72.89 rows=88 width=32)
Index Cond: (child_report_id = lineage_1.parent_report_id)
-> Merge Join (cost=96778.45..650004.89 rows=758026 width=685)
Merge Cond: (report_file.report_id = lineage.parent_report_id)
-> Index Scan using report_file_pkey on report_file (cost=0.43..521923.80 rows=7215047 width=685)
-> Materialize (cost=96778.01..100568.14 rows=758026 width=16)
-> Sort (cost=96778.01..98673.08 rows=758026 width=16)
Sort Key: lineage.parent_report_id
-> CTE Scan on lineage (cost=0.00..15160.52 rows=758026 width=16)
-> Index Scan using action_pkey on action (cost=0.43..0.56 rows=1 width=8)
Index Cond: (action_id = report_file.action_id)
Filter: (action_name = 'receive'::task_action)

In prod:

Nested Loop (cost=608504.60..2028142.68 rows=344378 width=755)
CTE lineage
-> Recursive Union (cost=0.56..608503.60 rows=523451 width=48)
-> Index Scan using idx_report_lineage_child_report_id on report_lineage (cost=0.56..81.36 rows=71 width=48)
Index Cond: (child_report_id = 'ddd7e591-9866-4337-8585-65b8e919eb4b'::uuid)
-> Subquery Scan on "*SELECT* 2" (cost=0.56..60318.70 rows=52338 width=48)
-> Nested Loop (cost=0.56..59795.32 rows=52338 width=48)
-> WorkTable Scan on lineage lineage_1 (cost=0.00..14.20 rows=710 width=48)
-> Index Scan using idx_report_lineage_child_report_id on report_lineage report_lineage_1 (cost=0.56..82.91 rows=74 width=32)
Index Cond: (child_report_id = lineage_1.parent_report_id)
-> Nested Loop (cost=0.56..1136833.93 rows=523451 width=755)
-> CTE Scan on lineage (cost=0.00..10469.02 rows=523451 width=16)
-> Index Scan using report_file_pkey on report_file (cost=0.56..2.15 rows=1 width=755)
Index Cond: (report_id = lineage.parent_report_id)
-> Index Scan using action_pkey on action (cost=0.44..0.54 rows=1 width=8)
Index Cond: (action_id = report_file.action_id)
Filter: (action_name = 'receive'::task_action)

jack-h-wang commented 2 months ago

VACUUM and ANALYZE for the tables involved in the expensive merge join on staging (report_file, report_lineage) appears to be up to date. However, there are more indexes on staging than on prod:

Extra indexes on staging report_file:

idx_report_file_created_week
CREATE INDEX idx_report_file_created_week ON public.report_file USING btree (date_trunc('week'::text, timezone('America/New_York'::text, created_at)))

report_file_receiving_org_action_id_index
CREATE INDEX report_file_receiving_org_action_id_index ON public.report_file USING btree (receiving_org, action_id)

Dropped the indexes but the explain plan did not change significantly.

jack-h-wang commented 2 months ago

At this time it appears the performance difference is resolved via spot-checking transactions in Azure appinsights. I've confirmed the indexes dropped are currently not created via the flyway migrations and will be leaving them dropped; if they are needed, the definitions are documented in the previous comment. The explain plan not significantly changing may be due to the parameters I chose not reflecting a realistic use of the query.

I reached out to devops to ask about when the report_file and report_lineage are configured to vacuum/analyze. This is currently done manually rather than using autovacuum:

Our scheduling is still in the works. It was supposed to go live last weekend, but citgo was performing maintenance, so it's now scheduled this weekend for a manual effort, then enabling of the workflow. We only need to do this because Azure is not running these actions correctly, but this will be resolved when we migrate to the new psql

Closing this issue for now per @arnejduranovic as we reviewed several recent transactions together to check if this was still occurring.

CDCgov / prime-reportstream