The parser is largely regex-based, using fairly explicit patterns. I was pleasantly surprised to see that very few edge cases needed handling. (Mostly related to unconventional values in the "certificate" field, e.g., --.)
Here's an example of the new parser output:
{
"insp_id": "INS-0000804745",
"layout": "b",
"customer_id": "1652",
"customer_name": "UNITED PARCEL SERVICE CO",
"customer_addr": "LEGAL DEPARTMENT\n1400 N. HURSTBOURNE PKWY.\nLOUISVILLE, KY 40223",
"certificate": "61-T-0005",
"site_id": "PHL",
"site_name": "PHILADELPHIA INTERNATIONAL AIRPORT",
"insp_type": "ROUTINE INSPECTION",
"date": "19-JUL-2022"
}
This addresses #31 in two commits: https://github.com/data-liberation-project/aphis-inspection-reports/commit/ed5a3d5c7f199281f45e10acf16b66339af371d8 adds the code to the parser, while https://github.com/data-liberation-project/aphis-inspection-reports/commit/2e391c79fdc9061c87313f4946fbea1daf93f3ba re-parses all the PDFs with the updated parser.
The parser is largely regex-based, using fairly explicit patterns. I was pleasantly surprised to see that very few edge cases needed handling. (Mostly related to unconventional values in the "certificate" field, e.g.,
--
.)Here's an example of the new parser output: