allenai / pawls

Software that makes labeling PDFs easy.
https://pawls.apps.allenai.org
Apache License 2.0
380 stars 74 forks source link

BUG FIX: Ensure cli.pawls.preprocessors.tesseract.extract_page_tokens() casts df value types to string before calling .str.cat(...) #199

Closed JSv4 closed 1 year ago

JSv4 commented 1 year ago

Line 43 of cli.pawls.preprocessors.tesseract in extract_page_tokens() fails when the underlying text datatype is not actually string type as .str.cat can only be called on cells containing string type.

See here:

https://github.com/allenai/pawls/blob/1225660ccb4f3b9877bf45c04baecc2798d183ee/cli/pawls/preprocessors/tesseract.py#L20-L46

I assume this is rare but is dependent on the original source PDF authoring tool. I have a test.pdf where some of the pages only have numbers on them, and it appears the data type that PAWLS / tesseract extracted and then stored in the pandas data frame is type float64. When this happens, the extract_page_tokens() function as written fails as .str.cat(...) can only be called on cell containing a string. I added .astype(str) to line 43 to force conversion to of the text cell contents to string type, which should cover these kinds of corner cases where our data type could be converted to a string. Working for me at least on the pdf that was crashing the parser (attached)