allenai / pawls

Software that makes labeling PDFs easy.
https://pawls.apps.allenai.org
Apache License 2.0
380 stars 74 forks source link

BUG: Tesseract Preprocessor Fails When PDF Page Data Only Contains a Decimal Number #200

Closed JSv4 closed 1 year ago

JSv4 commented 1 year ago

I have a test.pdf that has some pages that only have #s on them. When calling process_tesseract from pawls.commands.preprocess, I was getting an error I'd not seen before (despite processing hundreds of docs with your preprocessor):

/home/jman/PycharmProjects/test_pawls_parser/venv/bin/python /home/jman/PycharmProjects/test_pawls_parser/main.py 
Traceback (most recent call last):
  File "/home/jman/PycharmProjects/test_pawls_parser/main.py", line 10, in <module>
    annotations: list = process_tesseract(file.name)
  File "/home/jman/PycharmProjects/test_pawls_parser/venv/lib/python3.10/site-packages/pawls/preprocessors/tesseract.py", line 101, in process_tesseract
    annotations = parse_annotations(pdf_file)
  File "/home/jman/PycharmProjects/test_pawls_parser/venv/lib/python3.10/site-packages/pawls/preprocessors/tesseract.py", line 80, in parse_annotations
    tokens = extract_page_tokens(pdf_image, pdf_size)
  File "/home/jman/PycharmProjects/test_pawls_parser/venv/lib/python3.10/site-packages/pawls/preprocessors/tesseract.py", line 35, in extract_page_tokens
    .apply(
  File "/home/jman/PycharmProjects/test_pawls_parser/venv/lib/python3.10/site-packages/pandas/core/groupby/groupby.py", line 1567, in apply
    result = self._python_apply_general(f, self._selected_obj)
  File "/home/jman/PycharmProjects/test_pawls_parser/venv/lib/python3.10/site-packages/pandas/core/groupby/groupby.py", line 1629, in _python_apply_general
    values, mutated = self.grouper.apply(f, data, self.axis)
  File "/home/jman/PycharmProjects/test_pawls_parser/venv/lib/python3.10/site-packages/pandas/core/groupby/ops.py", line 839, in apply
    res = f(group)
  File "/home/jman/PycharmProjects/test_pawls_parser/venv/lib/python3.10/site-packages/pawls/preprocessors/tesseract.py", line 43, in <lambda>
    gp["text"].str.cat(sep=" "),
  File "/home/jman/PycharmProjects/test_pawls_parser/venv/lib/python3.10/site-packages/pandas/core/generic.py", line 5902, in __getattr__
    return object.__getattribute__(self, name)
  File "/home/jman/PycharmProjects/test_pawls_parser/venv/lib/python3.10/site-packages/pandas/core/accessor.py", line 182, in __get__
    accessor_obj = self._accessor(obj)
  File "/home/jman/PycharmProjects/test_pawls_parser/venv/lib/python3.10/site-packages/pandas/core/strings/accessor.py", line 181, in __init__
    self._inferred_dtype = self._validate(data)
  File "/home/jman/PycharmProjects/test_pawls_parser/venv/lib/python3.10/site-packages/pandas/core/strings/accessor.py", line 235, in _validate
    raise AttributeError("Can only use .str accessor with string values!")
AttributeError: Can only use .str accessor with string values!. Did you mean: 'std'?

Looking at my PDF and the offending pd.Series tripping the error in line 36 of cli.pawls.preprocessors.tesseract in extract_page_tokens(), pretty sure I know what's happening here. It appears some of the tokens being extracted from the attached pdf are treated as floats in the dataframe, and line 43 of extract_page_tokens() fails when trying to concat this token as the pandas df type is float64 yet .str.cat can only be called on cells containing string type.

See the offending, dataframe-related code here in tesseract.py:

https://github.com/allenai/pawls/blob/1225660ccb4f3b9877bf45c04baecc2798d183ee/cli/pawls/preprocessors/tesseract.py#L20-L46

I assume my issue - where extracted pdf token data type is not string - is rare but this is bound to happen in large PDF collections, particularly as the pdf formatting is dependent on the original source PDF authoring tool.

I propose adding .astype(str) to line 43 to force conversion to of the text cell contents to string type, which should cover these kinds of corner cases where our data type could be converted to a string.

I've opened PR #199 with a fix.

soldni commented 1 year ago

Thank you for the PR! Just merged it in.