ecohealthalliance / wahis

Animal outbreak data scraped from the WAHIS database
Other
6 stars 0 forks source link

Additional changes for use in deployed scripts #15

Closed emmamendelsohn closed 4 years ago

emmamendelsohn commented 5 years ago

This PR applies changes to the outbreak reports adapted from changes to annual reports (PR#14). Specifically:

I also made a few minor, misc changes to the annual report functions to clean up the code and fix a few dependencies.

noamross commented 5 years ago

Looks good! The one thing that I think might need some different handling is related reports and threads. To my understanding the immediate_report field basically holds all the information required to identify events in a common thread, correct? I think we should actually drop the "related events" field because it clearly is dynamic - it will be updated with new reports.

I also don't fully understand what the thread field is, but when I run the function on a single new outbreak report (transform_outbreak_reports(list(ingest_outbreak_report(new_report_url)), its value is 1, which I don't think should be the case. It may be something we can drop, given that we can just query on immediate_report. It would be helpful to extract the follow-up number from the report_type field, though to a column like follow_up, with immediate notifications being 0.

emmamendelsohn commented 4 years ago

That's right, the thread field will not make sense when only running on a subset of reports. We would need to query the full databases to identify a thread (ie all related reports). So, yes, this routine for identifying related reports would need to be performed separately from the transform function.

We could use just use immediate_report, but currently immediate_report is NA when the immediate notification is also the final notification (eg 10012.html). We might need to give these reports their own unique values.

Note that related_reports is updated for all reports in a given thread every time a new report in the thread is released. So the method we're currently using does seem like it would capture all threads when performed on the complete database, but it's more complicated and probably slower than the immediate_report approach.

noamross commented 4 years ago

Let's make it so we can append new tables to the database without updating old fields or querying it. I think the simplest thing is just to drop related_reports and thread, (Keep related reports in the ingest function lose it in the transform function), and make immediate_reports the same as id for initial reports, whether or not they are also final reports. We can then re-run on the old data and then start updating with new reports.

emmamendelsohn commented 4 years ago

Ok made the change. There are a few followup cases (~5) that are missing immediate_report, so I will include that in the QA check.

emmamendelsohn commented 4 years ago

I made a few more changes to the annual reports processing so that it returns a log of the ingest status for each report, which I am adding to the db in rebel-infrastructure. @noamross Let me know if we can merge this PR (no rush).