Open grossir opened 7 months ago
Running this query and displaying it somewhere is the most basic scraper status page:
SELECT court_id, max(so.date_created) as latest_creation
FROM (SELECT * FROM search_docket WHERE court_id IN (SELECT id FROM search_court WHERE has_opinion_scraper and in_use)) sd
INNER JOIN search_opinioncluster soc ON soc.docket_id = sd.id
INNER JOIN search_opinion so ON so.cluster_id = soc.id
GROUP BY sd.court_id
ORDER BY 2 DESC
;
I am using it to find silently failing scrapers
It could be stored as a materialized view, refreshed every couple days; and it could be displayed in the admin using an abstract model, to avoid having to write a custom page
We've never used materialized views and I haven't dug into them in a long time. What would we get out of that?
A materialized view basically stores the results of a query as a named "table"
REFRESH MATERIALIZED VIEW {view name}
So it's just a way of naming / storing a commonly used query, usually for analytic purposes
Then, we could create an admin page for it to access it without having DB access / having to rerun the query on a console. (we could do it for a plain query too, I think, but that would probably hit the DB each time something is done with the admin)
I think I'd suggest doing it the simple way first, and thinking about solutions like materialized views only if we need the performance they'd give. We could do it that way, right?
While testing a implementation, I run into a problem: the query takes 3 minutes to execute, which would cause a timeout when loading the admin page. That would be another reason to use a materialized view. I tried some variations of the query for speed ( a. taking the most recently modified docket for each court; b. taking the most created docket for each court), but they raise false positives, which would kill the usefulness of a status page
I have posted a draft PR with more implementation details, and screenshots of how the admin page would look like
EXPLAIN ANALYZE (
SELECT
court_id,
max(so.date_created) as latest_creation_date,
now() - max(so.date_created) as time_since
FROM
(
SELECT id, court_id
FROM search_docket
WHERE court_id IN (
SELECT id
FROM search_court
/*
Only check courts with scrapers in use
*/
WHERE has_opinion_scraper AND in_use
)
) sd
INNER JOIN
(SELECT id, docket_id FROM search_opinioncluster) soc ON soc.docket_id = sd.id
INNER JOIN
search_opinion so ON so.cluster_id = soc.id
GROUP BY
sd.court_id
HAVING
/*
Only return results for courts with no updates in a week
*/
now() - max(so.date_created) > interval '7 days'
ORDER BY
2 DESC
);
Returned
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------
Sort (cost=4543310.49..4543310.81 rows=130 width=29) (actual time=189197.212..189197.288 rows=60 loops=1)
Sort Key: (max(so.date_created)) DESC
Sort Method: quicksort Memory: 29kB
-> Finalize GroupAggregate (cost=4542986.37..4543305.92 rows=130 width=29) (actual time=188485.971..189197.238 rows=60 loops=1)
Group Key: search_docket.court_id
Filter: ((now() - max(so.date_created)) > '7 days'::interval)
Rows Removed by Filter: 94
-> Gather Merge (cost=4542986.37..4543294.52 rows=782 width=13) (actual time=188485.815..189197.020 rows=461 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial GroupAggregate (cost=4541986.34..4542204.23 rows=391 width=13) (actual time=188428.338..189076.311 rows=154 loops=3)
Group Key: search_docket.court_id
-> Sort (cost=4541986.34..4542057.67 rows=28531 width=13) (actual time=188428.165..188759.618 rows=2890413 loops=3)
Sort Key: search_docket.court_id
Sort Method: external merge Disk: 80200kB
Worker 0: Sort Method: external merge Disk: 81408kB
Worker 1: Sort Method: external merge Disk: 80408kB
-> Nested Loop (cost=3357026.57..4539875.01 rows=28531 width=13) (actual time=101245.747..186033.095 rows=2890413 loops=3)
-> Hash Join (cost=3357026.13..4408960.21 rows=25625 width=9) (actual time=101244.992..109646.335 rows=2671912 loops=3)
Hash Cond: (search_opinioncluster.docket_id = search_docket.id)
-> Parallel Seq Scan on search_opinioncluster (cost=0.00..1037210.24 rows=3858024 width=8) (actual time=0.773..4967.403 rows=3264428 loops=3)
-> Hash (cost=3351399.63..3351399.63 rows=450120 width=9) (actual time=101241.684..101241.686 rows=8376331 loops=3)
Buckets: 4194304 (originally 524288) Batches: 4 (originally 1) Memory Usage: 122591kB
-> Nested Loop (cost=0.57..3351399.63 rows=450120 width=9) (actual time=0.398..98754.097 rows=8376331 loops=3)
-> Seq Scan on search_court (cost=0.00..182.10 rows=18 width=12) (actual time=0.015..0.697 rows=154 loops=3)
Filter: (has_opinion_scraper AND in_use)
Rows Removed by Filter: 3199
-> Index Only Scan using search_dock_court_i_a043ae_idx on search_docket (cost=0.57..184445.55 rows=173320 width=9) (actual time=0.656..635.839 rows=54392 loo
ps=462)
Index Cond: (court_id = (search_court.id)::text)
Heap Fetches: 12170486
-> Index Scan using search_opinion_a97b1c12 on search_opinion so (cost=0.43..5.08 rows=3 width=12) (actual time=0.026..0.028 rows=1 loops=8015735)
Index Cond: (cluster_id = search_opinioncluster.id)
Planning Time: 0.891 ms
Execution Time: 189208.672 ms
(34 rows)
For reference, the "full" equivalent query takes around 10 minutes
explain analyze (SELECT
court_id,
max(so.date_created) as latest_creation_date,
now() - max(so.date_created) as time_since
FROM
search_docket sd
INNER JOIN
search_court sc ON sc.id = sd.court_id
INNER JOIN
search_opinioncluster soc ON soc.docket_id = sd.id
INNER JOIN
search_opinion so ON so.cluster_id = soc.id
WHERE
/*
Only check courts with scrapers in use
*/
has_opinion_scraper AND in_use
GROUP BY
sd.court_id
HAVING
/*
Only return results for courts with no updates in a week
*/
now() - max(so.date_created) > interval '7 days'
ORDER BY
2 DESC);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------
Sort (cost=4543310.49..4543310.81 rows=130 width=29) (actual time=580309.996..580310.072 rows=62 loops=1)
Sort Key: (max(so.date_created)) DESC
Sort Method: quicksort Memory: 29kB
-> Finalize GroupAggregate (cost=4542986.37..4543305.92 rows=130 width=29) (actual time=579626.856..580310.018 rows=62 loops=1)
Group Key: sd.court_id
Filter: ((now() - max(so.date_created)) > '7 days'::interval)
Rows Removed by Filter: 92
-> Gather Merge (cost=4542986.37..4543294.52 rows=782 width=13) (actual time=579626.554..580309.776 rows=461 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial GroupAggregate (cost=4541986.34..4542204.23 rows=391 width=13) (actual time=579551.369..580195.880 rows=154 loops=3)
Group Key: sd.court_id
-> Sort (cost=4541986.34..4542057.67 rows=28531 width=13) (actual time=579551.193..579878.254 rows=2890447 loops=3)
Sort Key: sd.court_id
Sort Method: external merge Disk: 80880kB
Worker 0: Sort Method: external merge Disk: 80056kB
Worker 1: Sort Method: external merge Disk: 81088kB
-> Nested Loop (cost=3357026.57..4539875.01 rows=28531 width=13) (actual time=333910.236..577075.587 rows=2890447 loops=3)
-> Hash Join (cost=3357026.13..4408960.21 rows=25625 width=9) (actual time=333909.365..350828.814 rows=2671946 loops=3)
Hash Cond: (soc.docket_id = sd.id)
-> Parallel Seq Scan on search_opinioncluster soc (cost=0.00..1037210.24 rows=3858024 width=8) (actual time=0.025..13225.357 rows=3264463 loops=3)
-> Hash (cost=3351399.63..3351399.63 rows=450120 width=9) (actual time=333901.262..333901.264 rows=8376503 loops=3)
Buckets: 4194304 (originally 524288) Batches: 4 (originally 1) Memory Usage: 122593kB
-> Nested Loop (cost=0.57..3351399.63 rows=450120 width=9) (actual time=0.049..330968.288 rows=8376503 loops=3)
-> Seq Scan on search_court sc (cost=0.00..182.10 rows=18 width=12) (actual time=0.012..0.713 rows=154 loops=3)
Filter: (has_opinion_scraper AND in_use)
Rows Removed by Filter: 3199
-> Index Only Scan using search_dock_court_i_a043ae_idx on search_docket sd (cost=0.57..184445.55 rows=173320 width=9) (actual time=0.717..2143.368 rows=54393
loops=462)
Index Cond: (court_id = (sc.id)::text)
Heap Fetches: 12241143
-> Index Scan using search_opinion_a97b1c12 on search_opinion so (cost=0.43..5.08 rows=3 width=12) (actual time=0.079..0.084 rows=1 loops=8015838)
Index Cond: (cluster_id = soc.id)
Planning Time: 0.831 ms
Execution Time: 580347.608 ms
(34 rows)
Sorry, just seeing this. Have you analyzed for indexes that might help? It's not great to be running a three-minute query as a MV either — it's the kind of thing that you later regret as your DB slows down for a few minutes every night or whatever.
Let's track here our ideas and requirements for a "scraper status page". Sometimes scrapers fail (for many reasons) and such a page would be useful to notice more errors that won't surface on Sentry
Background
In #3949 we deleted
cl/scrapers/management/commands/cl_report_scrape_status.py
which depended on theErrorLog
model and created a useful HTML report, that was sent via email and also to a slack channelHowever, it seems nobody was really monitoring that, and both the content and the access to the report could be improved
Ideas
...