Create a scraper status page

grossir commented 7 months ago

Let's track here our ideas and requirements for a "scraper status page". Sometimes scrapers fail (for many reasons) and such a page would be useful to notice more errors that won't surface on Sentry

Background

In #3949 we deleted cl/scrapers/management/commands/cl_report_scrape_status.py which depended on the ErrorLog model and created a useful HTML report, that was sent via email and also to a slack channel

However, it seems nobody was really monitoring that, and both the content and the access to the report could be improved

Ideas

...

grossir commented 1 month ago

Running this query and displaying it somewhere is the most basic scraper status page:

SELECT court_id, max(so.date_created) as latest_creation
FROM (SELECT * FROM search_docket WHERE court_id IN (SELECT id FROM search_court WHERE has_opinion_scraper and in_use)) sd 
INNER JOIN search_opinioncluster soc ON soc.docket_id = sd.id
INNER JOIN search_opinion so ON so.cluster_id = soc.id
GROUP BY sd.court_id
ORDER BY 2 DESC
;

I am using it to find silently failing scrapers

It could be stored as a materialized view, refreshed every couple days; and it could be displayed in the admin using an abstract model, to avoid having to write a custom page

mlissner commented 1 month ago

We've never used materialized views and I haven't dug into them in a long time. What would we get out of that?

grossir commented 1 month ago

A materialized view basically stores the results of a query as a named "table"

saves computation time if you just want to check the stored results
makes it easy to re-run the query and refresh the values REFRESH MATERIALIZED VIEW {view name}

So it's just a way of naming / storing a commonly used query, usually for analytic purposes

Then, we could create an admin page for it to access it without having DB access / having to rerun the query on a console. (we could do it for a plain query too, I think, but that would probably hit the DB each time something is done with the admin)

mlissner commented 1 month ago

I think I'd suggest doing it the simple way first, and thinking about solutions like materialized views only if we need the performance they'd give. We could do it that way, right?

grossir commented 2 weeks ago

While testing a implementation, I run into a problem: the query takes 3 minutes to execute, which would cause a timeout when loading the admin page. That would be another reason to use a materialized view. I tried some variations of the query for speed ( a. taking the most recently modified docket for each court; b. taking the most created docket for each court), but they raise false positives, which would kill the usefulness of a status page

I have posted a draft PR with more implementation details, and screenshots of how the admin page would look like

EXPLAIN ANALYZE  (    
SELECT 
        court_id, 
        max(so.date_created) as latest_creation_date,
        now() - max(so.date_created) as time_since
    FROM 
        (
            SELECT id, court_id
            FROM search_docket 
            WHERE court_id IN (
                SELECT id 
                FROM search_court
                /* 
                    Only check courts with scrapers in use 
                */
                WHERE has_opinion_scraper AND in_use
            )
        ) sd 
    INNER JOIN 
        (SELECT id, docket_id FROM search_opinioncluster) soc ON soc.docket_id = sd.id
    INNER JOIN 
        search_opinion so ON so.cluster_id = soc.id
    GROUP BY 
        sd.court_id
    HAVING 
        /* 
            Only return results for courts with no updates in a week 
        */
        now() - max(so.date_created) > interval '7 days'
    ORDER BY 
        2 DESC
);

Returned

                                                                                                        QUERY PLAN                                                                                                 

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------
 Sort  (cost=4543310.49..4543310.81 rows=130 width=29) (actual time=189197.212..189197.288 rows=60 loops=1)
   Sort Key: (max(so.date_created)) DESC
   Sort Method: quicksort  Memory: 29kB
   ->  Finalize GroupAggregate  (cost=4542986.37..4543305.92 rows=130 width=29) (actual time=188485.971..189197.238 rows=60 loops=1)
         Group Key: search_docket.court_id
         Filter: ((now() - max(so.date_created)) > '7 days'::interval)
         Rows Removed by Filter: 94
         ->  Gather Merge  (cost=4542986.37..4543294.52 rows=782 width=13) (actual time=188485.815..189197.020 rows=461 loops=1)
               Workers Planned: 2
               Workers Launched: 2
               ->  Partial GroupAggregate  (cost=4541986.34..4542204.23 rows=391 width=13) (actual time=188428.338..189076.311 rows=154 loops=3)
                     Group Key: search_docket.court_id
                     ->  Sort  (cost=4541986.34..4542057.67 rows=28531 width=13) (actual time=188428.165..188759.618 rows=2890413 loops=3)
                           Sort Key: search_docket.court_id
                           Sort Method: external merge  Disk: 80200kB
                           Worker 0:  Sort Method: external merge  Disk: 81408kB
                           Worker 1:  Sort Method: external merge  Disk: 80408kB
                           ->  Nested Loop  (cost=3357026.57..4539875.01 rows=28531 width=13) (actual time=101245.747..186033.095 rows=2890413 loops=3)
                                 ->  Hash Join  (cost=3357026.13..4408960.21 rows=25625 width=9) (actual time=101244.992..109646.335 rows=2671912 loops=3)
                                       Hash Cond: (search_opinioncluster.docket_id = search_docket.id)
                                       ->  Parallel Seq Scan on search_opinioncluster  (cost=0.00..1037210.24 rows=3858024 width=8) (actual time=0.773..4967.403 rows=3264428 loops=3)
                                       ->  Hash  (cost=3351399.63..3351399.63 rows=450120 width=9) (actual time=101241.684..101241.686 rows=8376331 loops=3)
                                             Buckets: 4194304 (originally 524288)  Batches: 4 (originally 1)  Memory Usage: 122591kB
                                             ->  Nested Loop  (cost=0.57..3351399.63 rows=450120 width=9) (actual time=0.398..98754.097 rows=8376331 loops=3)
                                                   ->  Seq Scan on search_court  (cost=0.00..182.10 rows=18 width=12) (actual time=0.015..0.697 rows=154 loops=3)
                                                         Filter: (has_opinion_scraper AND in_use)
                                                         Rows Removed by Filter: 3199
                                                   ->  Index Only Scan using search_dock_court_i_a043ae_idx on search_docket  (cost=0.57..184445.55 rows=173320 width=9) (actual time=0.656..635.839 rows=54392 loo
ps=462)
                                                         Index Cond: (court_id = (search_court.id)::text)
                                                         Heap Fetches: 12170486
                                 ->  Index Scan using search_opinion_a97b1c12 on search_opinion so  (cost=0.43..5.08 rows=3 width=12) (actual time=0.026..0.028 rows=1 loops=8015735)
                                       Index Cond: (cluster_id = search_opinioncluster.id)
 Planning Time: 0.891 ms
 Execution Time: 189208.672 ms
(34 rows)

For reference, the "full" equivalent query takes around 10 minutes

explain analyze (SELECT 
    court_id, 
    max(so.date_created) as latest_creation_date,
    now() - max(so.date_created) as time_since
FROM 
    search_docket sd
INNER JOIN
    search_court sc ON sc.id = sd.court_id
INNER JOIN 
    search_opinioncluster soc ON soc.docket_id = sd.id
INNER JOIN 
    search_opinion so ON so.cluster_id = soc.id
WHERE
    /* 
        Only check courts with scrapers in use 
    */
    has_opinion_scraper AND in_use
GROUP BY 
    sd.court_id
HAVING 
    /* 
        Only return results for courts with no updates in a week 
    */
    now() - max(so.date_created) > interval '7 days'
ORDER BY 
    2 DESC);

                                                                                                          QUERY PLAN                                                                                               

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------
 Sort  (cost=4543310.49..4543310.81 rows=130 width=29) (actual time=580309.996..580310.072 rows=62 loops=1)
   Sort Key: (max(so.date_created)) DESC
   Sort Method: quicksort  Memory: 29kB
   ->  Finalize GroupAggregate  (cost=4542986.37..4543305.92 rows=130 width=29) (actual time=579626.856..580310.018 rows=62 loops=1)
         Group Key: sd.court_id
         Filter: ((now() - max(so.date_created)) > '7 days'::interval)
         Rows Removed by Filter: 92
         ->  Gather Merge  (cost=4542986.37..4543294.52 rows=782 width=13) (actual time=579626.554..580309.776 rows=461 loops=1)
               Workers Planned: 2
               Workers Launched: 2
               ->  Partial GroupAggregate  (cost=4541986.34..4542204.23 rows=391 width=13) (actual time=579551.369..580195.880 rows=154 loops=3)
                     Group Key: sd.court_id
                     ->  Sort  (cost=4541986.34..4542057.67 rows=28531 width=13) (actual time=579551.193..579878.254 rows=2890447 loops=3)
                           Sort Key: sd.court_id
                           Sort Method: external merge  Disk: 80880kB
                           Worker 0:  Sort Method: external merge  Disk: 80056kB
                           Worker 1:  Sort Method: external merge  Disk: 81088kB
                           ->  Nested Loop  (cost=3357026.57..4539875.01 rows=28531 width=13) (actual time=333910.236..577075.587 rows=2890447 loops=3)
                                 ->  Hash Join  (cost=3357026.13..4408960.21 rows=25625 width=9) (actual time=333909.365..350828.814 rows=2671946 loops=3)
                                       Hash Cond: (soc.docket_id = sd.id)
                                       ->  Parallel Seq Scan on search_opinioncluster soc  (cost=0.00..1037210.24 rows=3858024 width=8) (actual time=0.025..13225.357 rows=3264463 loops=3)
                                       ->  Hash  (cost=3351399.63..3351399.63 rows=450120 width=9) (actual time=333901.262..333901.264 rows=8376503 loops=3)
                                             Buckets: 4194304 (originally 524288)  Batches: 4 (originally 1)  Memory Usage: 122593kB
                                             ->  Nested Loop  (cost=0.57..3351399.63 rows=450120 width=9) (actual time=0.049..330968.288 rows=8376503 loops=3)
                                                   ->  Seq Scan on search_court sc  (cost=0.00..182.10 rows=18 width=12) (actual time=0.012..0.713 rows=154 loops=3)
                                                         Filter: (has_opinion_scraper AND in_use)
                                                         Rows Removed by Filter: 3199
                                                   ->  Index Only Scan using search_dock_court_i_a043ae_idx on search_docket sd  (cost=0.57..184445.55 rows=173320 width=9) (actual time=0.717..2143.368 rows=54393
 loops=462)
                                                         Index Cond: (court_id = (sc.id)::text)
                                                         Heap Fetches: 12241143
                                 ->  Index Scan using search_opinion_a97b1c12 on search_opinion so  (cost=0.43..5.08 rows=3 width=12) (actual time=0.079..0.084 rows=1 loops=8015838)
                                       Index Cond: (cluster_id = soc.id)
 Planning Time: 0.831 ms
 Execution Time: 580347.608 ms
(34 rows)

mlissner commented 2 weeks ago

Sorry, just seeing this. Have you analyzed for indexes that might help? It's not great to be running a three-minute query as a MV either — it's the kind of thing that you later regret as your DB slows down for a few minutes every night or whatever.

freelawproject / courtlistener

Create a scraper status page #3950

Background

Ideas