calvinli / pacerrssscraper

PACER RSS feed scraper
10 stars 7 forks source link

Turn the program into a daemon #8

Closed calvinli closed 10 years ago

calvinli commented 10 years ago

As I wrote in #2,

Some research indicates that PACER usually updates roughly every half-hour, but it wobbles around unpredictably. Also, CACD bucks the trend and only updates hourly. I think. This is all from observation and not any documentation. (See #7.)

One way to take advantage of this would be to turn the program from a cron script to a daemon which checks the feed and then tells itself to check again 35 minutes after lastBuildDate. This means we would always get updates as soon as possible, unlike now, when we could possibly miss updates by up to an hour (if a document is posted at, say, 15:10, and the feed is updated at 15:35, but we checked it at 15:30, then under the current system we wouldn't see the doc until our next check at 16:00).

Of course I don't mean a real UNIX daemon, just a program that keeps running in a loop.

Note that commit f08d03ae64625a73a801337167bfa4447403d7b0 makes this much easier because we now explicitly keep track of the lastBuildDate field.

One nuance: if, 35 minutes after the lastBuildDate, there is not a new feed, we need to check again in 35 minutes (or perhaps less) from then, not 35 minutes from lastBuildDate (that would cause it to stop checking altogether, by putting the next check time in the past)