allenai / pawls

Software that makes labeling PDFs easy.
https://pawls.apps.allenai.org
Apache License 2.0
380 stars 74 forks source link

PAWLS Tesseract Preprocessor Throws Error With Blank PDF #201

Open JSv4 opened 1 year ago

JSv4 commented 1 year ago

So, I know what you're probably saying - don't process blank .pdfs. There's a compelling reason to do so, however, particularly with headless processing, I have a pipeline I've developed to asynchronously process huge document corpuses using your tesseract preprocessor. The lengths of the documents were too variable to simple pass the whole document to the preprocessor. I have 3,000 page plus documents that cause timeout issues, over-consumption of CPU, etc. when run via celeryworkers. I found that, by using a combination of celery parameter tuning and splitting each pdf into single pages and then putting the individual pages into a queue for processing, I am able to have much better control over my celery workers and fully saturate my system's computer resources without locking it up.

While doing this page-wise headless processing, however, my worker failed to process a single page (attached) and crashed my pipe:
0d953016-c4c1-4d0f-8745-dc59bef8351f.pdf.

I got the following error message:

opencontracts-celeryworker-2  | [2023-02-24 18:12:01,776: INFO/ForkPoolWorker-1] process_pdf_page() - Process page 4 of 5 from path user_3/fragments/0d953016-c4c1-4d0f-8745-dc59bef8351f.pdf
opencontracts-celeryworker-2  | [2023-02-24 18:12:01,777: INFO/ForkPoolWorker-1] process_pdf_page() - Load obj from s3
opencontracts-celeryworker-2  | [2023-02-24 18:12:02,572: WARNING/ForkPoolWorker-1] /usr/local/lib/python3.10/site-packages/pawls/preprocessors/tesseract.py:35: FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.
opencontracts-celeryworker-2  | To preserve the previous behavior, use
opencontracts-celeryworker-2  | 
opencontracts-celeryworker-2  |         >>> .groupby(..., group_keys=False)
opencontracts-celeryworker-2  | 
opencontracts-celeryworker-2  | To adopt the future behavior and silence this warning, use 
opencontracts-celeryworker-2  | 
opencontracts-celeryworker-2  |         >>> .groupby(..., group_keys=True)
opencontracts-celeryworker-2  |   .apply(
opencontracts-celeryworker-2  | 
opencontracts-celeryworker-2  | [2023-02-24 18:12:02,578: ERROR/ForkPoolWorker-1] Chord 'da6612ee-b0e8-43cc-8d67-44b21c818373' raised: ChordError('Dependency 2c197e9f-9225-4b6b-9801-1f6184af2ace raised KeyError("[\'score\'] not found in axis")')
opencontracts-celeryworker-2  | Traceback (most recent call last):
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/celery/app/trace.py", line 451, in trace_task
opencontracts-celeryworker-2  |     R = retval = fun(*args, **kwargs)
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/celery/app/trace.py", line 734, in __protected_call__
opencontracts-celeryworker-2  |     return self.run(*args, **kwargs)
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/celery/app/autoretry.py", line 54, in run
opencontracts-celeryworker-2  |     ret = task.retry(exc=exc, **retry_kwargs)
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/celery/app/task.py", line 717, in retry
opencontracts-celeryworker-2  |     raise_with_context(exc)
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/celery/app/autoretry.py", line 34, in run
opencontracts-celeryworker-2  |     return task._orig_run(*args, **kwargs)
opencontracts-celeryworker-2  |   File "/app/opencontractserver/tasks/doc_tasks.py", line 66, in process_pdf_page
opencontracts-celeryworker-2  |     annotations = extract_pawls_from_pdfs_bytes(pdf_bytes=page_data)
opencontracts-celeryworker-2  |   File "/app/opencontractserver/utils/pdf.py", line 111, in extract_pawls_from_pdfs_bytes
opencontracts-celeryworker-2  |     annotations: list = process_tesseract(page_path)
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/pawls/preprocessors/tesseract.py", line 101, in process_tesseract
opencontracts-celeryworker-2  |     annotations = parse_annotations(pdf_file)
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/pawls/preprocessors/tesseract.py", line 80, in parse_annotations
opencontracts-celeryworker-2  |     tokens = extract_page_tokens(pdf_image, pdf_size)
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/pawls/preprocessors/tesseract.py", line 33, in extract_page_tokens
opencontracts-celeryworker-2  |     res[~res.text.isna()]
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
opencontracts-celeryworker-2  |     return func(*args, **kwargs)
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/frame.py", line 5399, in drop
opencontracts-celeryworker-2  |     return super().drop(
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
opencontracts-celeryworker-2  |     return func(*args, **kwargs)
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/generic.py", line 4505, in drop
opencontracts-celeryworker-2  |     obj = obj._drop_axis(labels, axis, level=level, errors=errors)
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/generic.py", line 4546, in _drop_axis
opencontracts-celeryworker-2  |     new_axis = axis.drop(labels, errors=errors)
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6934, in drop
opencontracts-celeryworker-2  |     raise KeyError(f"{list(labels[mask])} not found in axis")
opencontracts-celeryworker-2  | KeyError: "['score'] not found in axis"
opencontracts-celeryworker-2  | 
opencontracts-celeryworker-2  | During handling of the above exception, another exception occurred:
opencontracts-celeryworker-2  | 
opencontracts-celeryworker-2  | Traceback (most recent call last):
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/celery/backends/redis.py", line 520, in on_chord_part_return
opencontracts-celeryworker-2  |     resl = [unpack(tup, decode) for tup in resl]
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/celery/backends/redis.py", line 520, in <listcomp>
opencontracts-celeryworker-2  |     resl = [unpack(tup, decode) for tup in resl]
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/celery/backends/redis.py", line 426, in _unpack_chord_result
opencontracts-celeryworker-2  |     raise ChordError(f'Dependency {tid} raised {retval!r}')
opencontracts-celeryworker-2  | celery.exceptions.ChordError: Dependency 2c197e9f-9225-4b6b-9801-1f6184af2ace raised KeyError("['score'] not found in axis")
opencontracts-celeryworker-2  | [2023-02-24 18:12:02,580: ERROR/ForkPoolWorker-1] Task opencontractserver.tasks.doc_tasks.process_pdf_page[2c197e9f-9225-4b6b-9801-1f6184af2ace] raised unexpected: KeyError("['score'] not found in axis")
opencontracts-celeryworker-2  | Traceback (most recent call last):
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/celery/app/trace.py", line 451, in trace_task
opencontracts-celeryworker-2  |     R = retval = fun(*args, **kwargs)
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/celery/app/trace.py", line 734, in __protected_call__
opencontracts-celeryworker-2  |     return self.run(*args, **kwargs)
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/celery/app/autoretry.py", line 54, in run
opencontracts-celeryworker-2  |     ret = task.retry(exc=exc, **retry_kwargs)
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/celery/app/task.py", line 717, in retry
opencontracts-celeryworker-2  |     raise_with_context(exc)
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/celery/app/autoretry.py", line 34, in run
opencontracts-celeryworker-2  |     return task._orig_run(*args, **kwargs)
opencontracts-celeryworker-2  |   File "/app/opencontractserver/tasks/doc_tasks.py", line 66, in process_pdf_page
opencontracts-celeryworker-2  |     annotations = extract_pawls_from_pdfs_bytes(pdf_bytes=page_data)
opencontracts-celeryworker-2  |   File "/app/opencontractserver/utils/pdf.py", line 111, in extract_pawls_from_pdfs_bytes
opencontracts-celeryworker-2  |     annotations: list = process_tesseract(page_path)
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/pawls/preprocessors/tesseract.py", line 101, in process_tesseract
opencontracts-celeryworker-2  |     annotations = parse_annotations(pdf_file)
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/pawls/preprocessors/tesseract.py", line 80, in parse_annotations
opencontracts-celeryworker-2  |     tokens = extract_page_tokens(pdf_image, pdf_size)
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/pawls/preprocessors/tesseract.py", line 33, in extract_page_tokens
opencontracts-celeryworker-2  |     res[~res.text.isna()]
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
opencontracts-celeryworker-2  |     return func(*args, **kwargs)
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/frame.py", line 5399, in drop
opencontracts-celeryworker-2  |     return super().drop(
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
opencontracts-celeryworker-2  |     return func(*args, **kwargs)
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/generic.py", line 4505, in drop
opencontracts-celeryworker-2  |     obj = obj._drop_axis(labels, axis, level=level, errors=errors)
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/generic.py", line 4546, in _drop_axis
opencontracts-celeryworker-2  |     new_axis = axis.drop(labels, errors=errors)
opencontracts-celeryworker-2  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6934, in drop
opencontracts-celeryworker-2  |     raise KeyError(f"{list(labels[mask])} not found in axis")
opencontracts-celeryworker-2  | KeyError: "['score'] not found in axis"

Looking at the pdf, I think the root cause is pretty clear. There is no text on the page and thus there cannot be any generated tokens. We just need to modify the preprocessor to handle this situation gracefully. I will try to propose a fix for this when I can take a closer look this weekend, but I wanted to open an issue now in case a fix jumps out for you guys,