Parse core text from inspection report PDFs

jsvine commented 1 year ago

For inspections without violations, these are relatively simple — just a block of free-form text. But with violations, they have some structure, and I suspect that there are a lot of variations to this structure:

For now, though, I think it's sufficient just to pull the plain text. Next step might be to mark the bolded text and then, after that, to try identifying the structure of what we see.

gcappaert commented 1 year ago

Not sure what your next priority is, but I'm thinking of jumping on this next. It seems like kind of a chewy problem which, if solved, could really increase the usefulness of the dataset. Extracting the text itself should be pretty straightforward, but if headings, relevant regulations and topic areas could be extracted... man that'd be useful. There are a lot of stories there.

There might be role for some NLP techniques too (and some relevance for future projects that involve narrative reports), which would be fun learning for me. Would love to hear any ideas you've already had about making the report text usable.

Happy to go after something else if you've got other fish a fryin'.

jsvine commented 1 year ago

Thanks, @gcappaert 🎉 . I think it's absolutely worth working on this. Some structural thoughts:

Due to the number of inspections and the length of the texts, it's possible/likely that trying to fit it all into a single file (like we do with the other parsed data in data/combined) will exceed GitHub's file-size limitations. Some options for handling this:
- Simply don't aggregate the texts into any single file
- Aggregate the texts into a SQLite file or other format that uses compression
- Don't aggregate the texts in the committed files, but include a simple command (perhaps make text) that allows users to generate a file (perhaps a version of data/combined/inspections.csv) that does include the text
I agree that it'd be fantastic to be able to identify the headings. If you can find a high-precision way of doing this, great. If not, we can table for now. A dataset with, say, only 80% of the headings extracted correctly, seems only marginally useful. I suppose this will depend on the consistency with which the headings are formatted by the report-generation software.

gcappaert commented 1 year ago

Awesome. Thanks for the suggestions on structuring too. I hadn't thought of the potential size issues with storing the full text data and I'll definitely be considering that. Here's what I'm thinking, and I will of course do this step by step so incremental improvements can be rolled in as/if they're made.

My first step will be to put together a small sample of reports with and without violations to figure out how to tackle this problem. As far as I can tell, every violation is associated with a subsection of the Animal Welfare Act, so that's a hook for structure. Violation headers also appear to be in bold, with critical, direct and repeat violations being explicitly noted.

I think I'll try pulling the headings first, because the category of the violation (e.g. 3.25(a) - Facilities, general) has some useful information and could be folded into the tabular data.

Once I find a way to reliably pull the headings and subsections -- or if I can't find any way to do it -- I'll move to the scraping of unstructured text and consider a way to store it.

Then, assuming I have not been dragged into the depths of despair by data monsters, I'll try to turn that unstructured text into structured data using sentiment analysis, word frequency, etc to produce tags for each entry that can be incorporated into the parsed data. For example, this:

3.84(a) CLEANING, SANITIZATION, HOUSEKEEPING, AND PEST CONTROL. *** The ring tail lemurs night house had an accumulation of old feces in it and on the dirt floor there were piles of feces as well. The entire enclosures were not spot cleaned daily. Excreta must be removed from inside each indoor primary enclosure daily and from underneath them as often as necessary to prevent an excessive accumulation of feces, to prevent the nonhuman primates from becoming soiled, and to reduce disease hazards, insects, pests, and odors. Dirt floors, floors with absorbent bedding, and planted areas in primary enclosures must be spot-cleaned with sufficient frequency to ensure all animals the freedom to avoid contact with excreta, or as often as necessary to reduce disease hazards, insects, pests, and odors. Perches, bars, and shelves must be kept clean and replaced when worn. If the species of the nonhuman primates housed in the primary enclosure engages in scent marking, hard surfaces in the primary enclosure must be spot-cleaned daily. Correct by February 15, 2017

Could be tagged with "feces, primates, excreta." Thereby all of the violations involving primate poop since 2014 could be queried by a member of the Primate Poop Investigation Unit of our well-funded journalism establishment or the powerful People Against Primate Poop PAC. JK, but I do think this would be useful if accomplishable.

jsvine commented 1 year ago

Re. headings and unstructured text: Sounds great.

Re. NLP / tagging: Definitely sounds useful, but feels like part of a separate repo/pipeline. (General philosophy: Keeping this repo to strictly deterministic, factual information directly from the reports/portal.) Thoughts on that?

gcappaert commented 1 year ago

All good with me, Jeremy!

As for the headings and text, I wrote a script that extracts the text and consistently extracts the headings, though I'm not done testing it. I'll put in a pull when I have it ready to go.

An NLP analysis probably does constitute mission creep given the scope of this project and it can introduce some inherent subjectivity depending on the tools used.

Using a library like nltk to enable straightforward and transparent stuff like word frequencies or a clickable index might help make the data more usable for the non-tech-savvy end user, so maybe that would make sense at some point in the DLP context. Lotta work though I think. If you ever end up wanting to add features that help digest the narrative reports/stories in these documents, I'm down to help.

gcappaert commented 1 year ago

I've tested this on several samples of the 80,000 or so pdfs, and it seems to consistently extract the violation code, heading, and status ('critical/non-critical' etc) for each violation found.

I've commented out the bit that extracts the full content, because I'm not sure exactly how to handle the full text data yet. I went ahead and ran the text extraction and saved each report's content into a separate text file. This worked out to a 40 MB folder of text files with the largest being 87KB.

Honestly, I'm out of my depth trying to determine the most useful way to handle the raw text data. Trying to think about how someone might actually use the data, my instinct is to make it easy to aggregate the text data based on the type of violation (e.g. show me the text from all the violations that involve sanitation), which seems best suited to SQL approach? Happy to try and implement whatever you think makes the most sense.

Do you want me to go ahead and submit a pull request that implements the header extraction as part of the parsing script and keep working on the content extraction bit?

Below is the code:

def get_report_body(pages: list[pdfplumber.page.Page], layout: str) -> dict[str, typing.Any]:
    # Exclude species pages
    # Extract text based on layout

    def is_header_char(obj: dict[str, typing.Any],size=11) -> bool:
        return "Bold" in obj.get("fontname", "") and obj.get("size", 0) > size

    def is_species_page(page: pdfplumber.page.Page) -> bool:
        return page.filter(is_header_char).extract_text().strip() == "Species Inspected"

    def extract_violation_codes(text: str) -> list:
        codes = re.findall(r"\d\.\d\S*", text)

        violations = []
        if len(codes) > 0:
            for code in codes:

                status, heading = re.search(r"\d\.\d\S*(.*)\s+(.*)",text).group(1,2)

                # extract violation status ('non-critical' if blank), code, and heading

                status = norm_ws(status.lower()) if len(status.strip())>2 else "non-critical"
                heading = norm_ws(heading.lower())

            # this may be clearer as a named tuple, but that would require an import
                violations.append((code,heading,status))

            return violations
        else:
            return False

    pages = list(filter(lambda x: not is_species_page(x),pages))

    a_body_bbox = {'first_page_body':(0,232,pages[0].width,708),'other_page_body':(0,92,pages[0].width,708)}
    b_body_bbox = {'first_page_body':(0,237,pages[0].width,636),'other_page_body':(0,103,pages[0].width,636)}

    bbox = b_body_bbox if len(pages[0].lines) > 2 else a_body_bbox

    # content = str()
    violations = []

    for i, page in enumerate(pages):
        if i==0:
            page = page.crop(bbox['first_page_body'])
        else:
            page = page.crop(bbox['other_page_body'])

        # page_content = page.extract_text()
        # content = "".join((content, page_content))

        headers = page.filter(lambda x: is_header_char(x,size=2)).extract_text()
        violations = extract_violation_codes(headers)

        if violations:
            violations = tuple(violations)

    return {
        # 'content':content,
        'violations':violations
    }

jsvine commented 1 year ago

Wonderful! Thank you for this. Couldn't hurt to start a PR, and then we can try some different approaches to the big-picture strategy on your fork/branch. I think the first step is just to decide on the representation of the headings and full text in the individual-report parse files in data/parsed/inspections/*.json.

From there, we can (a) inspect the results, and (b) figure out how we want to represent this info into (pre-existing or new) aggregate files.

data-liberation-project / aphis-inspection-reports

Parse core text from inspection report PDFs #33