BUG FIX: Ensure cli.pawls.preprocessors.tesseract.extract_page_tokens() casts df value types to string before calling .str.cat(...)

Line 43 of cli.pawls.preprocessors.tesseract in extract_page_tokens() fails when the underlying text datatype is not actually string type as .str.cat can only be called on cells containing string type.

See here:

https://github.com/allenai/pawls/blob/1225660ccb4f3b9877bf45c04baecc2798d183ee/cli/pawls/preprocessors/tesseract.py#L20-L46

I assume this is rare but is dependent on the original source PDF authoring tool. I have a test.pdf where some of the pages only have numbers on them, and it appears the data type that PAWLS / tesseract extracted and then stored in the pandas data frame is type float64. When this happens, the extract_page_tokens() function as written fails as .str.cat(...) can only be called on cell containing a string. I added .astype(str) to line 43 to force conversion to of the text cell contents to string type, which should cover these kinds of corner cases where our data type could be converted to a string. Working for me at least on the pdf that was crashing the parser (attached)

allenai / pawls

BUG FIX: Ensure cli.pawls.preprocessors.tesseract.extract_page_tokens() casts df value types to string before calling .str.cat(...) #199