Closed gcappaert closed 1 year ago
Thanks, @gcappaert! Looks pretty good to me. I'm running the results over a random selection of PDFs, and so far have found just one quirk, which is that inspection 2016090000555647
has a blank report date, so I'll tweak the regex. Then I'll re-parse all the PDFs (takes a little while), and push to this branch.
Re. the layout
parameter: You're correct. There are two inspection report layouts. They look fairly similar, but have a few structural differences. In def get_inspection_id_and_layout
, we identify which layout we're parsing, and pass it to the other methods. But in the case of get_top_section
and get_bottom_section
, the conditional statement at the top is sufficient to make that distinction. Maybe we'd want to refactor this approach in the future, but it's currently working and doesn't impose much overhead.
Great. Thanks for having a look! I'll check out your regex changes, so I know how to avoid that issue in the future.
Tweak added and PDFs reparsed! Thanks again for this. Merging now.
This should do it. I mostly tweaked/copied your approach from above for
get_top_section()
used to parse the first page. I tested with both report layouts. It would be pretty easy to add the other information in the bottom section too if so desired.I had a little trouble figuring out how edges and lines of a page are ordered. The way I do it works, but if there's a more intuitive approach (or one that's less likely to break if APHIS changes report formats), let me know. It took some trial-and-error to make sure I was selecting the correct
bottom_line
One other note, the function
get_top_section()
took a "layout" argument, so I added one too, but unless I'm missing something, I think it goes unused.Obviously feel free to delete my comments and change to the gitignore. Happy to keep working on this too if it's not what you're after.