PolMine / trickypdf

Turn pdf document into simple annotated XML for further processing in a corpus preparation pipeline.
11 stars 1 forks source link

Install Error due to Rpoppler #9

Open hbaggen opened 4 years ago

hbaggen commented 4 years ago

Hello,

I installed trickypdf through both install_github() and githubinstall() on separate occasions to ensure it installed properly and that my issue does not seem to arise from those actions. Both times when a prompt asked about updating packages I accepted updates for all packages; In addition I also selected none and several other iterations to confirm this is the main issue.

My problem was that after the updates took place one of two errors showed up. The first said that RPopplar was not supported by R version 3.6.0. The other simply stated that there was an error downloading the package.

Perhaps there is something that I am doing incorrectly, but I wanted to post here to see if other people had a similar problem. My review is a similar topic to the research performed by the PolMine group and therefore would very much like to take advantage of its packages.

Cheers

PolMine commented 4 years ago

Thanks for you feedback and my apologies for being so slow to respond.

If I am not mistaken, the problem with Rpoppler is its dependency on Glib. But it is usually only somewhat difficult to solve on macOS systems. For the occurrence on the CRAN macOS systems, see this.

Am I correct to assume that you are on macOS? Then installing glib first using Homebrew may be the solution: In a terminal, run this:

brew install glib

Then it should be possible to compile Rpoppler (using install.packages("Rpoppler")).

For the further development of trickypdf, dropping the dependency on Rpoppler should really be the aim. The package does not use the Rpoppler package for the crucial step, which is to turn pdf into pdf (that's done via a system call), it is just jused to get information on page sizes in this line:

sizes <- stri_extract_all(str = Rpoppler::PDF_info(.self$filename_pdf)$Sizes, regex = "\\d+(\\.\\d+|)")

There must be another way to do this. Maybe the pdftools package, which is portable in a way that Rpoppler is not has something on offer? I have not yet managed to look this up.

PolMine commented 4 years ago

Trying to find an alternative to Rpoppler::PDF_info(), I first installed Rpoppler. What it really takes (on macOS) is an installation of poppler. In a terminal:

brew install poppler
brew upgrade poppler # was necessary in my case

You can check as follows whether poppler-glib is available. In a terminal:

pkg-config --libs poppler-glib

You should see something like "-L/usr/local/Cellar/poppler/0.81.0/lib -L/usr/local/Cellar/glib/2.62.2/lib -L/usr/local/opt/gettext/lib -L/usr/local/Cellar/cairo/1.16.0_2/lib -lpoppler-glib -lgobject-2.0 -lglib-2.0 -lintl -lcairo".

I also checked whether pdftools has an equivalent function to Rpoppler::PDF_info(). Indeed, there is pdftools::pdf_pagesize(). I can't do this right now, but it should not be difficult to use this function and to get rid of the Rpoppler dependency, which will make the package more portable.

PolMine commented 4 years ago

So I managed to implement the switch from Rpoppler to pdftools. To install things on a Linux machine, the poppler development library is still needed, see the README of pdftools.

I took care that switching from Rpoppler to pdftools does not change the behaviour of the package at all by introducing unit tests before the refactoring exercise. So I hope it works!

But as you work on Windows, it will still be necessary that you install the poppler utils. Following an advice on Stackoverflow, you might check here for poppler binaries.

An out-of-the-box installation for Windows is a bigger exercise. It would require that a package that relies on the poppler library (libpoppler) would be modified thus that you get the XML out of a pdf document instead of plain text, which is what trickypdf does. It is feasible - but nothing I can do immediately.

hbaggen commented 4 years ago

Hello,

Thank you very much for your help. Converting the software was immensely helpful. To clarify for users of this package (though perhaps this is self-evident) when installing trickypdf make sure to uninstall pdftools and xfun. Deleting Rccp may not be necessary, but the usethis package is required for Rcpp.

I am continuing to input my PDF files into trickypdf and will update when I have worked out the issues that I am having on my side.

Cheers