data-liberation-project / aphis-inspection-reports

Inspection data and PDFs from the USDA's Animal and Plant Health Inspection Service.
13 stars 3 forks source link

Report body handling #56

Closed gcappaert closed 1 year ago

gcappaert commented 1 year ago

Starting a pull request. As of now, code for handling the raw text is commented out, but once we figure out how we want to handle that on the combine side, will be easy to re-enable. One option would be to chunk it up by violation (as that tends to be the most important info) and include it in the same pipeline that handles violation code/heading/status.

This would be a little tricky perhaps, as the narrative about each violation often crosses multiple pages, but I bet it's doable.

What is implemented here:

I wasn't sure how to test the combine functionality without messing something up (would love to hear how you'd do it), so hopefully that works smoothly. I'm still getting the hang of how JSON and Python to talk to each other.

Sorry my commit history is a little messy. I'm still getting a handle on git. This is actually a great way to learn the process.

jsvine commented 1 year ago

Thanks! Will dig into this. And no worries about the git history. We can do a squash-merge or rebase on the branch when the time comes.

jsvine commented 1 year ago

Update, now having taken a closer look: This seems like it's on the right track 🎉. I'm going to fiddle with it a bit, but the general approach makes sense to me. I'm hoping to share more detailed thoughts / tweaks soon.

jsvine commented 1 year ago

Howdy again! I've proposed some tweaks in the commits above; see the extended commit messages for a bit more explanation. But basically: A few refactors (of my earlier code and yours), plus trying to associate each violation with its summary. It seems to be working over here, but am going to rerun the full parsing pipeline again to check.

A few hitches I've noticed:

gcappaert commented 1 year ago

Everything makes sense to me here. While I was working on this, I was getting a weird smell from the way I was trying to handle layouts, so it's great that's fixed. The way you organized it could also apply to future projects.

I like your liberal use of assertions and type hints. I'm definitely starting to adopt those habits.

jsvine commented 1 year ago

Great! I'm going to merge this into the relevant branch, clean up the commits there, add the parsed data, and then merge into main.

Re. the bottom text: Thanks for the suggestion! Unfortunately, that text turns out a bit less common than expected. Seems like report-writers have a bit of latitude re. how to phrase that general idea. I'll open a separate issue re. finding a technique to identify the end-of-report language.