Closed gcappaert closed 1 year ago
Thanks! Will dig into this. And no worries about the git history. We can do a squash-merge or rebase on the branch when the time comes.
Update, now having taken a closer look: This seems like it's on the right track 🎉. I'm going to fiddle with it a bit, but the general approach makes sense to me. I'm hoping to share more detailed thoughts / tweaks soon.
Howdy again! I've proposed some tweaks in the commits above; see the extended commit messages for a bit more explanation. But basically: A few refactors (of my earlier code and yours), plus trying to associate each violation with its summary. It seems to be working over here, but am going to rerun the full parsing pipeline again to check.
A few hitches I've noticed:
hash_id:00046e2c5d535550
/insp_id:253151614250871
, which the online portal says has a critical citation, while the PDF does not explicitly list any (although it lists the correct number of citations overall). Might have to ask APHIS about this.Everything makes sense to me here. While I was working on this, I was getting a weird smell from the way I was trying to handle layouts, so it's great that's fixed. The way you organized it could also apply to future projects.
I like your liberal use of assertions and type hints. I'm definitely starting to adopt those habits.
Not surprised there are some errors in the reports. Honestly, I'm surprised there are as few as there are. These inspectors seem to really take their jobs seriously (though I guess the bad ones would tend to find no violations).
Re: bottom text, you might be able to use some regex for this language: "This inspection and exit interview were conducted with the facility representative" which seems to end each report. You'd probably find an edge case or two that violated this rule though.
Great! I'm going to merge this into the relevant branch, clean up the commits there, add the parsed data, and then merge into main
.
Re. the bottom text: Thanks for the suggestion! Unfortunately, that text turns out a bit less common than expected. Seems like report-writers have a bit of latitude re. how to phrase that general idea. I'll open a separate issue re. finding a technique to identify the end-of-report language.
Starting a pull request. As of now, code for handling the raw text is commented out, but once we figure out how we want to handle that on the combine side, will be easy to re-enable. One option would be to chunk it up by violation (as that tends to be the most important info) and include it in the same pipeline that handles violation code/heading/status.
This would be a little tricky perhaps, as the narrative about each violation often crosses multiple pages, but I bet it's doable.
What is implemented here:
[{'code': '3.23(a)', 'heading': 'sanitation', 'status': 'non-critical'}, ...]
I wasn't sure how to test the combine functionality without messing something up (would love to hear how you'd do it), so hopefully that works smoothly. I'm still getting the hang of how JSON and Python to talk to each other.
Sorry my commit history is a little messy. I'm still getting a handle on git. This is actually a great way to learn the process.