Closed mlissner closed 3 years ago
We should zip these to save tons of space.
Here's the model that I'm planning. At a high level, this is a table with four columns:
The datetime the row was ① created and ② last modified (probably will always be the same, but I always just have both). Indexed.
A FK to the Court table so we know which court the RSS is from.
A text field where we can store a path to the file itself. The text path is configured to be on disk at pacer-rss-feeds/year/month/day/UUID
.
Compression isn't yet figured out, but we can cross that bridge later. Sort of surprisingly there aren't any particularly good ready-to-go packages for this.
class RssFeed(models.Model):
"""Store all old RSS data to disk for future analysis."""
date_created = models.DateTimeField(
help_text="The time when this item was created",
auto_now_add=True,
db_index=True,
)
date_modified = models.DateTimeField(
help_text="The last moment when the item was modified.",
auto_now=True,
db_index=True,
)
court = models.ForeignKey(
Court,
help_text="The court where the RSS feed was found",
on_delete=models.CASCADE,
related_name="rss_feeds",
)
filepath = models.FileField(
help_text="The path of the file in the local storage area.",
upload_to=make_rss_feed_path,
storage=UUIDFileSystemStorage(),
max_length=150,
)
@johnhawkinson, this was a feature you suggested. Any other fields you'd like to see in here or tweaks to the model you'd like? This is very similar to how we store HTML currently, FWIW.
Seems fine. Honestly storing the raw XML in datetime-named files in the filesystem would likely be sufficient, anything else is gravy.
That said, you might want to store the URL path to address issues like readyDockets.pl
vs. rss_external.pl
which could help resolve future confusion.
However, if you want to do more, you might also consider saving the "file" size.
Also, if you want to get into parsing the XML, one might imagine saving
And even parsing out:
pacer_case_id
swhich could plausibly be useful in some kinds of future searches.
The database part of this is in place. Now we just need to actually...do something with the DB.
As of 45ca3022694367913517b2f36f29ce950d270dde, we will be storing RSS feeds. That'll need careful deployment though, because it creates a new directory, and blah, blah, NFS, etc. But it only took one data model and two lines of code.
This is creating more data than anybody thought it would. We do need to at least implement zipping, which should save about 90%. If we use bzip, it saves even more (at a cost to CPU). It looks like we'll need a custom file storage backend to do this.
With the fix I just put in place (see commit above), this should be fine. Looks like we had a nasty bug. Our sever will be thanking us for this fix.
Just checking in here, we're currently logging an average of 130MB/day over the last 23 days. Less on weekends, obviously. On regular days it's around 150MB, maybe.
This could probably eat up a fair bit of storage space, but we should do it anyway. Whenever we detect that an RSS feed has changed, we should store the original RSS feed file somewhere on disk.