18F / fbopen

[DEPRECATED] An open API server, data import tools, and sample apps to help small businesses search for opportunities to work with the U.S. government.
Other
101 stars 45 forks source link

Grants.gov attachment loader needs to scrape only new data #88

Closed arowla closed 10 years ago

arowla commented 10 years ago

Currently, the grants.gov XML dump is a full re-dump every night. We need a way to avoid having the attachments scraper re-scrape all the attachments every night, as well. The first way that comes to mind is to add something to the data loading step which monitors the return codes that Elasticsearch sends back. If the record is new, it is one code (200?), and if it is an update, it is another (201?).

arowla commented 10 years ago

Another option: keep around the last night's grants file, and diff it with the new one when it comes in. Only load changed lines. This could be the better approach, since it limits our overall data processing load, and it would also trigger a rescrape of old but modified records.

It would probably also be easier to tack this new step onto the front of the process, rather than inserting new steps in the middle.

arowla commented 10 years ago

One more thought: this should probably happen after conversion to JSON, since at that point records will be one-per-line, it will be easy to do a line-by-line diff on the files.

arowla commented 10 years ago

This is not worth the work right now... the grants attachment scraping only takes ~5 min as it stands.