desmarais-lab / govWebsites

1 stars 1 forks source link

impediments to creating an R package #7

Open markusneumann opened 6 years ago

markusneumann commented 6 years ago

This issue lists the reasons we might have difficulties turning this project into an R package. These aren't urgent problems that need to be responded to immediately, I just want to maintain a list of concerns that need to be kept in mind. Consequently I will update this post rather than adding additional posts, if anything else comes up.

Unix-only libraries: wget libmagic (determining file types) (works on Windows according to the Github page)

Python packages: spaCy (lemmatization) (spacyR works well enough now) Selenium (webscraping on interactive websites) (in theory there is an R version, but it is terrible)

R packages not on CRAN: SpeedReader (fightin' words) (not used any more) wand (R interface to libmagic)

bdesmarais commented 6 years ago

If we make an R package for this project, I think the most valuable methods group to implement would be methods for downloading files from websites and processing them into text for text analysis. Other packages can take over at that point. Would the download.file() function in R serve as a viable substitute for wget? I have never tried to integrate Python code into an R package, but here's a vignette on passing CRAN checks with Python module dependencies--- https://cran.r-project.org/web/packages/reticulate/vignettes/package.html.


Bruce A. Desmarais Associate Professor, Department of Political Science Director, Graduate Programs in Social Data Analytics Pennsylvania State University brucedesmarais.com

On Wed, Mar 28, 2018 at 2:39 PM, Markus Neumann notifications@github.com wrote:

This issue lists the reasons we might have difficulties turning this project into an R package. These aren't urgent problems that need to be responded to immediately, I just want to maintain a list of concerns that need to be kept in mind. Consequently I will update this post rather than adding additional posts, if anything else comes up.

Unix-only libraries: wget libmagic (determining file types)

Python packages: spaCy (lemmatization) Selenium (webscraping on interactive websites)

R packages not on CRAN: SpeedReader https://github.com/matthewjdenny/SpeedReader (fightin' words) wand https://github.com/hrbrmstr/wand (R interface to libmagic)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/desmarais-lab/govWebsites/issues/7, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYXKTTjiNwyO7_vXOGzC7xMlCXLRFq0ks5ti9jXgaJpZM4S_J2j .

markusneumann commented 6 years ago

The download.file() function doesn't have the ability to download recursively, so it only gets whatever url you feed it with. That is, unless you set 'method' to 'wget', but then it just uses the system's version of wget. I tried this out already, as well as other packages (RCurl, httr, downloader, Rcrawler), but none of them did what we needed them to do (Rcrawler came the closest).

bdesmarais commented 6 years ago

In what ways did Rcrawler fall short of what you have done with wget?

markusneumann commented 6 years ago

It is designed to only downloads html files, no pdf/doc/docx/txt In theory, it has the following option, which should enable it to download other stuff as well (if I understand it correctly):

urlExtfilter | character's vector, by default the crawler avoid irrelevant files for data scraping such us xml,js,css,pdf,zip ...etc, it's not recommanded to change the default value until you can provide all the list of filetypes to be escaped.

In practice, I've tried shortening that character vector (or emptying it entirely), so it also gets the other file types we want, but that didn't work - it still only downloads htmls. There is no other documentation on that function, and given how it reads, and what the rest of the package's documentation describes, it doesn't really seem to be something that is intended to be the main functionality of the package.