Science-for-Nature-and-People / text-mining-r

testing the quanteda R package for text minig
0 stars 0 forks source link

Not able to read in papers with Corpus() #1

Open swood-ecology opened 7 years ago

swood-ecology commented 7 years ago

When executing the following command in R from your tutorial

papers <- Corpus(URISource(pdf), readerControl = list(reader=pdfRead))

I get the following error

sh: pdfinfo: command not found sh: pdftotext: command not found Error in system2("pdftotext", c(control$text, shQuote(x), "-"), stdout = TRUE) : error in running command

My pdf object looks like this:

pdf [1] "arees-informatics-2006-reprint.pdf"
[2] "Borer et al 2009 Bull ESA_Effective Data Management.pdf"
[3] "Fegraus-esa_bulletin_eml_ms_07_2005.pdf"
[4] "Harris_2017_Environ._Res._Lett._12_024012.pdf"
[5] "Heidorn_2008_Shedding Light on the Dark Data in the Long Tail of Science.pdf"
[6] "MORTON_et_al-2008-Global_Change_Biology.pdf"
[7] "Ohara et al 2016_Aligning marine species range data to better serve science and conservation.pdf" [8] "peerj-preprints-549.pdf"

and my readPDF function object looks like this:

pdfRead function (elem, language, id) { uri <- processURI(elem$uri) meta <- pdf_info(uri) content <- pdf_text(uri) PlainTextDocument(content, meta$Author, meta$CreationDate, meta$Subject, meta$Title, basename(elem$uri), language, meta$Creator) } <environment: 0x110bd22a0>

brunj7 commented 7 years ago

Hi Steve,

It seems like pdftotext is not install on your machine. The best way to test this: from the terminal type:

pdftotext

If you get an error, it means the library is not installed. You can try to follow that: http://www.foolabs.com/xpdf/download.html

it you get like a description of the tool, then it is something else. Let me know

swood-ecology commented 7 years ago

Thanks Julien it seems like this package isn't available for most current releases of R. I've been writing code to bypass using the pdf tools package to create the object needed for analysis rather than the TM package. It seems to work well this way.

On Jul 6, 2017, at 17:17, Julien notifications@github.com wrote:

Hi Steve,

It seems like pdftotext is not install on your machine. The best way to test this: from the terminal type:

pdftotext

If you get an error, it means the library is not installed. You can try to follow that: http://www.foolabs.com/xpdf/download.html

it you get like a description of the tool, then it is something else. Let me know

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.