data-liberation-project / aphis-inspection-reports

Inspection data and PDFs from the USDA's Animal and Plant Health Inspection Service.
13 stars 3 forks source link

Extract Report Date #55

Closed gcappaert closed 1 year ago

gcappaert commented 1 year ago

This should do it. I mostly tweaked/copied your approach from above for get_top_section() used to parse the first page. I tested with both report layouts. It would be pretty easy to add the other information in the bottom section too if so desired.

I had a little trouble figuring out how edges and lines of a page are ordered. The way I do it works, but if there's a more intuitive approach (or one that's less likely to break if APHIS changes report formats), let me know. It took some trial-and-error to make sure I was selecting the correct bottom_line

One other note, the function get_top_section() took a "layout" argument, so I added one too, but unless I'm missing something, I think it goes unused.

Obviously feel free to delete my comments and change to the gitignore. Happy to keep working on this too if it's not what you're after.

jsvine commented 1 year ago

Thanks, @gcappaert! Looks pretty good to me. I'm running the results over a random selection of PDFs, and so far have found just one quirk, which is that inspection 2016090000555647 has a blank report date, so I'll tweak the regex. Then I'll re-parse all the PDFs (takes a little while), and push to this branch.

Re. the layout parameter: You're correct. There are two inspection report layouts. They look fairly similar, but have a few structural differences. In def get_inspection_id_and_layout, we identify which layout we're parsing, and pass it to the other methods. But in the case of get_top_section and get_bottom_section, the conditional statement at the top is sufficient to make that distinction. Maybe we'd want to refactor this approach in the future, but it's currently working and doesn't impose much overhead.

gcappaert commented 1 year ago

Great. Thanks for having a look! I'll check out your regex changes, so I know how to avoid that issue in the future.

jsvine commented 1 year ago

Tweak added and PDFs reparsed! Thanks again for this. Merging now.