Closed arowla closed 10 years ago
Another option: keep around the last night's grants file, and diff it with the new one when it comes in. Only load changed lines. This could be the better approach, since it limits our overall data processing load, and it would also trigger a rescrape of old but modified records.
It would probably also be easier to tack this new step onto the front of the process, rather than inserting new steps in the middle.
One more thought: this should probably happen after conversion to JSON, since at that point records will be one-per-line, it will be easy to do a line-by-line diff on the files.
This is not worth the work right now... the grants attachment scraping only takes ~5 min as it stands.
Currently, the grants.gov XML dump is a full re-dump every night. We need a way to avoid having the attachments scraper re-scrape all the attachments every night, as well. The first way that comes to mind is to add something to the data loading step which monitors the return codes that Elasticsearch sends back. If the record is new, it is one code (200?), and if it is an update, it is another (201?).