MicheleNuijten / statcheck

A spellchecker for statistics
174 stars 28 forks source link

PDF with chisq fails with unclear error message due to extraction issue (improve message / use pdftools?) #84

Open LukasWallrich opened 1 year ago

LukasWallrich commented 1 year ago

This PDF file 10.1111:apps.12362.pdf

fails with

Error in if (grepl(pattern = RGX_Q, x = test_raw)) { : 
  the condition has length > 1

This is because the chisq tests get read as follows:

a good model fit (2 (199) = 627.73, p < .001, CFI = .94, RMSEA = .07, SRMR = .05), and [...] loading on one factor (2 (206) = 2533.69, p < .001, CFI = .67, RMSEA = .15, SRMR = .15) and the one-factor model with all items loading on one common factor (2 (209) = 4489.05, p < .001, CFI = .40, RMSEA = .20, SRMR = .17).

This is really odd xpdf-behaviour because I can copy-paste them from the PDF without trouble, so they seem to be embedded as characters rather than images.

So, two questions here:

a good model fit (χ 2 (199) = 627.73, p < .001, CFI = .94,\nRMSEA = .07, SRMR = .05), and [...] loading on one factor (χ 2 (206) = 2533.69, p < .001, CFI = .67, RMSEA = .15, SRMR = .15) and\nthe one-factor model with all items loading on one common factor (χ 2 (209) = 4489.05,\np < .001, CFI = .40, RMSEA = .20, SRMR = .17).

(Getting this to work requires two minor pre-processing steps: pdftools::pdf_text(f) |> paste(collapse = "") |> gsub("\n", "", _) |> statcheck:::extract_stats("chisq") )

LukasWallrich commented 8 months ago

In a larger set of files, this issue occurred another 8 times (e.g., 10.1016/j.jbusres.2018.11.029 and 10.1002/smj.2976) - so it is reasonably common and resolves by using pdftools. Another issue that occurred is that xpdf inserts As into two tests in the attached PDF, which are then no longer found. 10.1016--j.obhdp.2013.04.003.pdf

However, pdftools failed to extract some tests (without causing errors) due to problems in reading multi-column layouts. Another tool that I tried (pdfminer.six) resolves that, but also fails with some chisq ... so for now, I would no longer recommend a change in the default PDF engine, but clearer error messages. Also, it might be worth recommending that users re-OCR PDFs that fail (or, if striving for completeness, all). For instance, statcheck does not find results in the following file due to issues with the =, but works after running ocrmypdf --force-ocr

10.1002--job.220.pdf