Open LukasWallrich opened 1 year ago
In a larger set of files, this issue occurred another 8 times (e.g., 10.1016/j.jbusres.2018.11.029
and 10.1002/smj.2976
) - so it is reasonably common and resolves by using pdftools
. Another issue that occurred is that xpdf inserts A
s into two tests in the attached PDF, which are then no longer found.
10.1016--j.obhdp.2013.04.003.pdf
However, pdftools failed to extract some tests (without causing errors) due to problems in reading multi-column layouts. Another tool that I tried (pdfminer.six) resolves that, but also fails with some chisq ... so for now, I would no longer recommend a change in the default PDF engine, but clearer error messages. Also, it might be worth recommending that users re-OCR PDFs that fail (or, if striving for completeness, all). For instance, statcheck does not find results in the following file due to issues with the =
, but works after running ocrmypdf --force-ocr
This PDF file 10.1111:apps.12362.pdf
fails with
This is because the chisq tests get read as follows:
This is really odd
xpdf
-behaviour because I can copy-paste them from the PDF without trouble, so they seem to be embedded as characters rather than images.So, two questions here:
could not process "(2 (199) = 627.73"
then trouble-shooting would be much easier?(Getting this to work requires two minor pre-processing steps:
pdftools::pdf_text(f) |> paste(collapse = "") |> gsub("\n", "", _) |> statcheck:::extract_stats("chisq")
)