data-liberation-project / aphis-inspection-reports

Inspection data and PDFs from the USDA's Animal and Plant Health Inspection Service.
13 stars 3 forks source link

Parse top section of inspection report PDFs #34

Closed jsvine closed 1 year ago

jsvine commented 1 year ago

This addresses #31 in two commits: https://github.com/data-liberation-project/aphis-inspection-reports/commit/ed5a3d5c7f199281f45e10acf16b66339af371d8 adds the code to the parser, while https://github.com/data-liberation-project/aphis-inspection-reports/commit/2e391c79fdc9061c87313f4946fbea1daf93f3ba re-parses all the PDFs with the updated parser.

The parser is largely regex-based, using fairly explicit patterns. I was pleasantly surprised to see that very few edge cases needed handling. (Mostly related to unconventional values in the "certificate" field, e.g., --.)

Here's an example of the new parser output:

{
  "insp_id": "INS-0000804745",
  "layout": "b",
  "customer_id": "1652",
  "customer_name": "UNITED PARCEL SERVICE CO",
  "customer_addr": "LEGAL DEPARTMENT\n1400 N. HURSTBOURNE PKWY.\nLOUISVILLE, KY 40223",
  "certificate": "61-T-0005",
  "site_id": "PHL",
  "site_name": "PHILADELPHIA INTERNATIONAL AIRPORT",
  "insp_type": "ROUTINE INSPECTION",
  "date": "19-JUL-2022"
}