allenai / pawls

Software that makes labeling PDFs easy.
https://pawls.apps.allenai.org
Apache License 2.0
380 stars 74 forks source link

Fix tesseract preprocessor for blank pages #202

Open JSv4 opened 1 year ago

JSv4 commented 1 year ago

Fix for Issue #201. When the processed PDF is empty, there appears to be a single token returned for the page and the text is na. This becomes a problem in extract_page_tokens in the tesseract preprocessor. At the start of the call to the processor the token df, tokens with text of na are filtered out:

res[~res.text.isna()]

leaving you with an empty dataframe. For pages with at least one token that is not na, you do not have an empty dataframe. Where the dataframe is not empty and you apply groupby():

.groupby(["page_num", "block_num", "par_num", "line_num", "word_num"], group_keys=False)
.apply(
    lambda gp: pd.Series(
        [
            gp["left"].min(),
            gp["top"].min(),
            gp["width"].max(),
            gp["height"].max(),
            gp["conf"].mean(),
            gp["text"].astype(str).str.cat(sep=" "),
        ]
    )
)

You wind up with cols for the dataframe of RangeIndex(start=0, stop=6, step=1). So, when you call rename like this:

    .rename(
        columns={
            0: "x",
            1: "y",
            2: "width",
            3: "height",
            4: "score",
            5: "text",
            "index": "id",
        }
    )

the cols with "names" of 0, 1, 2, 3, 4, and 5 ARE renamed. This doesn't happen with empty dataframes, however. The grouping step doesn't change the df so the column names remain unchanged - you have an empty df with col names of

[id, level, page_num, block_num, par_num, line_num, word_num, left, top, width, height, conf, text]

Thus, the renaming above totally fails to make any changes because there are no cols 0, 1, 2, 3, 4, or 5. And so there is no col named "score" annnnddd when you call .drop(columns=["score", "id"]), you get KeyError: "['score'] not found in axis

My suggested fix is to change extract_page_tokens() to test if the page's token df is empty when stripped of all tokens where text is na. If False, proceed with the preprocessor as usual. If True, however, return an empty array.

FYI, I also changed

.groupby(["page_num", "block_num", "par_num", "line_num", "word_num"]) to

.groupby(["page_num", "block_num", "par_num", "line_num", "word_num"], group_keys=False) as I noticed a deprecation warning leaving out the group_keys keyword arg.

I've attached two sample PDFs, one blank and one not. Both processor successfully now whereas the blank failed before: 00075cb9-9428-4270-baac-93ed12d284ef.pdf 0d953016-c4c1-4d0f-8745-dc59bef8351f.pdf

Fixes #201

JSv4 commented 1 year ago

You guys open to merging this? I use your pre-processor in another project, and it'd be greet to use your repo as a dependency instead of my fork.