freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
542 stars 150 forks source link

[EPIC] Crawl PACER Mobile pages to power alerts in courts with bad or missing RSS feeds #1279

Open mlissner opened 4 years ago

mlissner commented 4 years ago

The PACER mobile page has some useful data for sending alerts. In https://github.com/freelawproject/juriscraper/issues/315, we'll make a parser for it. In https://github.com/freelawproject/courtlistener/issues/1277, we'll add some table fields to track scraping it.

Once those are done, this issue is to mirror most of the work done in https://github.com/freelawproject/courtlistener/pull/1267, except for the Mobile UI. Some things to note:

The general alerting approach is:

  1. For courts with complete RSS feeds (Court.pacer_rss_feed_details == 'all'), we don't crawl the mobile pages. No need.

  2. We continue crawling RSS feeds at present, sending alerts if any new content.

  3. For courts with partial or absent RSS feeds, we use the mobile site as a supplement to the RSS feeds, with the goal being to send all alerts necessary, but as few duplicate alerts as possible.

When we crawl the mobile sites we do the following:

  1. For each docket alert, crawl the mobile page according to the schedule below.

  2. When we crawl, we'll learn how many entries a docket has. We compare that to the previous time and to the amount found during the last RSS crawl. If new items were found since the last RSS crawl, we send an alert (this indicates an item that was missing from RSS). (If this is confusing see message in #recap channel from today around 14:47PDT.)

  3. Make a note when the mobile crawl is done so that the RSS crawler can compare against it.

Schedule:

In #1267 we set up a schedule like this for checking the iquery page:

                                   Num alerts
                             +---+---+---+-----+------+
                             | 0 | 1 | 2 | 3-9 | >=10 |
          │                  +---+---+---+-----+------+
          │         30 after | N | N | N |  N  |  Y   |
       W  │                  +---+---+---+-----+------+
       H  │           hourly | N | N | N |  Y  |  Y   |
       E  │                  +---+---+---+-----+------+
       N  │  6am, 6pm & noon | N | N | Y |  Y  |  Y   |
       ?  │                  +---+---+---+-----+------+
          │         Midnight | N | Y | Y |  Y  |  Y   |
          │                  +---+---+---+-----+------+
          │         Midnight | N | Y | Y |  Y  |  Y   | <-- Also old term. cases
                      Sunday +---+---+---+-----+------+
                               │
                               └─never

I think that's a pretty good schedule and perhaps we should go with it.

mlissner commented 1 year ago

For those in our Slack group, I summarized the status of this EPIC as best I could, here.