data-liberation-project / aphis-inspection-reports

Inspection data and PDFs from the USDA's Animal and Plant Health Inspection Service.
13 stars 3 forks source link

Fix deduping, add data dictionary #17

Closed jsvine closed 1 year ago

jsvine commented 1 year ago

Discovered that APHIS appears to occasionally bulk-change the legalName for any given customerNumber. Because the sort_key incorporated that legalName and the scripts merge cached results with newer ones, the previous logic was retaining multiple copies of the same inspection (i.e., for inspections for which the legalName changed in the fresh/recent results, but not yet in our historical cache). The good news is that the legalName seems to correspond directly to customerNumber, so we can just use that. Also adjusted get_sort_key to account for other aspects observed in the data, e.g., that customerNumber is never blank.

Also added a data dictionary for the APHIS portal data, with some things learned through figuring out the deduping approach.

... as well as a fix for add_hash_ids to prevent errors when adding the IDs to a result set that already has them.

palewire commented 1 year ago

Might be a good opportunity to add unittests for those utilities

jsvine commented 1 year ago

Amen, issue now created for that: https://github.com/data-liberation-project/aphis-inspection-reports/issues/20

palewire commented 1 year ago

Yep. Since are mucking with a couple now getting them covered in this merge would be good while the expectations and edge cases are fresh in your mind.

jsvine commented 1 year ago

Very reasonable! Now added.