freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
553 stars 152 forks source link

Start saving PACER RSS feeds when they change #1311

Closed mlissner closed 3 years ago

mlissner commented 4 years ago

This could probably eat up a fair bit of storage space, but we should do it anyway. Whenever we detect that an RSS feed has changed, we should store the original RSS feed file somewhere on disk.

mlissner commented 4 years ago

We should zip these to save tons of space.

mlissner commented 4 years ago

Here's the model that I'm planning. At a high level, this is a table with four columns:

  1. The datetime the row was ① created and ② last modified (probably will always be the same, but I always just have both). Indexed.

  2. A FK to the Court table so we know which court the RSS is from.

  3. A text field where we can store a path to the file itself. The text path is configured to be on disk at pacer-rss-feeds/year/month/day/UUID.

Compression isn't yet figured out, but we can cross that bridge later. Sort of surprisingly there aren't any particularly good ready-to-go packages for this.

class RssFeed(models.Model):
    """Store all old RSS data to disk for future analysis."""

    date_created = models.DateTimeField(
        help_text="The time when this item was created",
        auto_now_add=True,
        db_index=True,
    )
    date_modified = models.DateTimeField(
        help_text="The last moment when the item was modified.",
        auto_now=True,
        db_index=True,
    )
    court = models.ForeignKey(
        Court,
        help_text="The court where the RSS feed was found",
        on_delete=models.CASCADE,
        related_name="rss_feeds",
    )
    filepath = models.FileField(
        help_text="The path of the file in the local storage area.",
        upload_to=make_rss_feed_path,
        storage=UUIDFileSystemStorage(),
        max_length=150,
    )

@johnhawkinson, this was a feature you suggested. Any other fields you'd like to see in here or tweaks to the model you'd like? This is very similar to how we store HTML currently, FWIW.

johnhawkinson commented 4 years ago

Seems fine. Honestly storing the raw XML in datetime-named files in the filesystem would likely be sufficient, anything else is gravy.

That said, you might want to store the URL path to address issues like readyDockets.pl vs. rss_external.pl which could help resolve future confusion.

johnhawkinson commented 4 years ago

However, if you want to do more, you might also consider saving the "file" size.

Also, if you want to get into parsing the XML, one might imagine saving

And even parsing out:

which could plausibly be useful in some kinds of future searches.

mlissner commented 4 years ago

The database part of this is in place. Now we just need to actually...do something with the DB.

mlissner commented 4 years ago

As of 45ca3022694367913517b2f36f29ce950d270dde, we will be storing RSS feeds. That'll need careful deployment though, because it creates a new directory, and blah, blah, NFS, etc. But it only took one data model and two lines of code.

mlissner commented 4 years ago

This is creating more data than anybody thought it would. We do need to at least implement zipping, which should save about 90%. If we use bzip, it saves even more (at a cost to CPU). It looks like we'll need a custom file storage backend to do this.

mlissner commented 4 years ago

With the fix I just put in place (see commit above), this should be fine. Looks like we had a nasty bug. Our sever will be thanking us for this fix.

mlissner commented 4 years ago

Just checking in here, we're currently logging an average of 130MB/day over the last 23 days. Less on weekends, obviously. On regular days it's around 150MB, maybe.