benmarwick / rrtools

rrtools: Tools for Writing Reproducible Research in R
Other
671 stars 85 forks source link

custom knit to update DESCRIPTION Imports during knitting #18

Closed benmarwick closed 7 years ago

benmarwick commented 7 years ago

In https://github.com/benmarwick/rrtools/commit/e235c44798ddf4e911b1c72c4c74016501d1db72 I have tried to over-ride the knit button to trigger a function that scans the Rmd for library() etc. calls, and update the DESCRIPTION Imports with the names of all the packages used in the Rmd. I often get tripped up because I forget to add those, so I'm looking for a way to automate this.

I have added a line to the YAML in paper.Rmd, and a new function knit_and_update_desc_imports to try and do this. It works, but only if we import packrat, which I don't want to do.

The problem is that I cannot exactly duplicate the key function for scanning the Rmd, packrat:::fileDependencies. I have tried to copy in all the relevant code from packrat, to avoid the dependency (because that gives warnings with devtools). But still I cannot make my local fileDependencies work, only packrat:::fileDependencies works.

I would like your feedback on (1) if this is a good idea and (2) how to make the local fileDependencies work without using packrat.

I have also directly copied in a lot of code from devtools so we don't have to use ::: at all anymore. That gets rid of a bunch of warnings.

dakni commented 7 years ago

In general I think it is a good idea to keep the Imports in the DESCRIPTION file automatically added.

I am not sure to which degree R is able to use regular expression, but in my understanding it would be most straightforward to

\(library\(\)\([a-z]+\)\) something like this for search would enable to paste the different matched instances (at least in theory and in base Emacs)...but as long as base R can handle regular expression this should be the most easy way without additionally required packages.

nevrome commented 7 years ago

@dakni regex are available in R.

I suggest to pull off this DESCRIPTION update magic in an own (!) function. Maybe with a small RStudio-Addin. No direct connection between this function and the knitting process! I really dislike the idea of a modified knit button.

benmarwick commented 7 years ago

Thanks for your feedback. Yes, both the packrat and checkpoint pkgs already have this kind of function to scan a file or a project to identify the packages used (looking for library, require, requirenamespace, ::, etc.). No need to reinvent that wheel. But since I can't get it work properly, and we don't agree that this is a good idea, I've removed it in f11031d. I'll look into an add-in for this functionality.

benmarwick commented 7 years ago

But I still get this warning:

checking for missing documentation entries ... WARNING
Undocumented code objects:
  'use_analysis'

You guys seeing that too?

nevrome commented 7 years ago

Aha - I thought this is a just a problem of my setup. I have no idea where this is coming from. I guess you have already seen this thread? If I change the function name to "use_analysi" instead of "use_analysis" the warning vanishes. Interesting...

dakni commented 7 years ago

sorry for still hanging around in this issue but regexp (especially regexp replace) is not that straightforward in R.

I still think the such a function can be useful.

the example code:

library(magrittr)
a <- grep(pattern = "library", x = readLines("analysis/Test.Rmd"), value = TRUE) %>%
  strsplit(split = "library\\(") %>%  
  lapply(function(x){x[2]}) %>%
  lapply(function(x){strsplit(x, split = "\\)")}) %>%
  unlist
a

tmp <- readLines("DESCRIPTION")
tmp[grep("Imports", tmp)] <- paste0("Imports: ", paste(a, collapse = " "))
writeLines(text = tmp, con = "DESCRIPTION")

Now the question is: is it worth to be implemented?

benmarwick commented 7 years ago

We would also want to get pkgs used via :: and :::, right? I'm not sure why the equivalent fns in packrat and checkpoint are so much more complicated than this. This is what packrat does to get the libraries: https://github.com/benmarwick/rrtools/blob/e235c44798ddf4e911b1c72c4c74016501d1db72/R/dependencies.R

At the bottom of https://github.com/benmarwick/rrtools/blob/e235c44798ddf4e911b1c72c4c74016501d1db72/R/hello.R I used add_desc_package from devtools to insert the package names into the DESCRIPTION.

How do you think this should fit into the workflow?

My previous suggestion was that it should happen when the source document is rendered. We can override the knit button in RStudio to do this. The override is only valid for the Rmd be being rendered. It is not a persistent override. The advantage of this is that the user doesn't have to think about this boring task.

Aside from that option, the user must deliberately run this kind of function repeatedly. I find this option undesirable because it adds another task that had to be done over and over. That seems silly.

Perhaps we could have it as an argument to use_analysis, to override the knit button or not, and export the function also. Then those who prefer to run the function manually can do so, leaving their knit button intact. What do you think?

dakni commented 7 years ago

I think you're right. However, I prefer plain solutions. "overriding" the knit button has to be a solution also for non RStudio users (like me ;) ). Therefore, I like the idea to put it as an argument to use_analysis

Since the proposed function would rewrite the "Imports" line every time there should also be no problem with old left-over entries.

In order to get :: and ::: to work is easy: grep for these, strsplit at the colons lapplyover it and take the first column.

We also need a kind of loop structures that scans through the compendium directory for R and Rmd files and afterwards starts the grep part on them.

nevrome commented 7 years ago

@benmarwick

Aside from that option, the user must deliberately run this kind of function repeatedly. I find this option undesirable because it adds another task that had to be done over and over. That seems silly.

Writing an R package is a complex task with many hurdles. rrtools simplifies and accelerates the creation of a special kind of package, but it doesn't disengage the user to know more or less what she/he's doing. We're wandering the narrow line between R user and R developer. I would prefer to make this dependency magic a conscious thing. Otherwise the user eventually ends up in a situation where he doesn't know anymore what exactly is going on. Debugging hell. Also we're loosing flexibility with such highly specialized setups: If I remember correctly it was your suggestion to adjust the package exactly to the needs of the individual project (context: small, medium, large).

@dakni

We also need a kind of loop structures that scans through the compendium directory for R and Rmd files and afterwards starts the grep part on them.

dirDependencies() in the packrat code seems to be almost there:

  pattern <- "\\.[rR]$|\\.[rR]md$|\\.[rR]nw$|\\.[rR]pres$"
  pkgs <- character()
  R_files <- list.files(dir,
                        pattern = pattern,
                        ignore.case = TRUE,
                        recursive = TRUE
  )
dakni commented 7 years ago

@nevrome, great thanks.

pattern <- "\\.[rR]$|\\.[rR]md$|\\.[rR]nw$|\\.[rR]pres$"
R_files <- list.files(dir,
                        pattern = pattern,
                        ignore.case = TRUE,
                        recursive = TRUE
  )

pkgs <- list()
i <- 1

while( i < length(R_files)) {
pkgs[[i]] <- grep(pattern = "library", x = readLines(R_files[i]), value = TRUE) %>%
  strsplit(split = "library\\(") %>%  
  lapply(function(x){x[2]}) %>%
  lapply(function(x){strsplit(x, split = "\\)")}) %>%
  unlist
i <- i+1
}

pkgs

I scanned a quite large folder with a lot of R files. At some points that function does not work as expected since the usage of strsplit is not general enough. Some example outputs that show the issues:

rgdal, lib.loc=list.dirs("   "
## parallel"                "rgrass7, lib.loc=list.dirs(
" ## parallel"
"; "
\"spatgraphs\"
\"sp\", \"spatgraphs\"
"foreach, lib.loc=list.dirs("
" %>%"

The rest works great.

The easiest way to arrive at a vector might be:

unique(unlist(pkgs))
dakni commented 7 years ago

of course one shall just send the working directory/compendium directory to scan for files

R_files <- list.files(getwd(),
                        pattern = pattern,
                        ignore.case = TRUE,
                        recursive = TRUE
  )
nevrome commented 7 years ago

Maybe it's a nice addition to also check if the package is available on CRAN. Otherwise it doesn't make much sense to add it to the DESCRIPTION file.

Something like this could do the trick:

x <- "ggplot2"
x %in% available.packages()[,1]
dakni commented 7 years ago

@nevrome nice! This solves the strsplitissue since this mostly stems from libraries not installed from CRAN.

So, the entire thing would look like this


pattern <- "\\.[rR]$|\\.[rR]md$|\\.[rR]nw$|\\.[rR]pres$"
R_files <- list.files(dir,
                        pattern = pattern,
                        ignore.case = TRUE,
                        recursive = TRUE
  )

pkgs <- list()
i <- 1

while( i < length(R_files)) {
pkgs[[i]] <- grep(pattern = "library", x = readLines(R_files[i]), value = TRUE) %>%
  strsplit(split = "library\\(") %>%  
  lapply(function(x){x[2]}) %>%
  lapply(function(x){strsplit(x, split = "\\)")}) %>%
  unlist
i <- i+1
}

pkgs

a <- unique(unlist(pkgs))
a <- a[a %in% available.packages(repos = "https://cran.r-project.org")[,1]==TRUE]

tmp <- readLines("DESCRIPTION")
tmp[grep("Imports", tmp)] <- paste0("Imports: ", paste(a, collapse = ", "))
writeLines(text = tmp, con = "DESCRIPTION")
nevrome commented 7 years ago

The call to available.packages() assumes an internet connection. We could again use curl::has_internet() here. Or even better: a check if https://cran.r-project.org is available.

dakni commented 7 years ago

ok, something like this

if (curl::has_internet()==TRUE & RCurl::url.exists("https://cran.r-project.org") ==TRUE ) {
  a <- a[a %in% available.packages(repos = "https://cran.r-project.org")[,1]==TRUE]
}

though now, without internet the messy a which is the greped list of package names cannot be cleaned. Hence, throw an error at this stage?

nevrome commented 7 years ago

Hey @dakni, did you give it up already? ;-) I decided that I want this function.

guacamole <- function(path = getwd()) {

  # check if directory or single file
  if (utils::file_test("-d", path)) {
    # if directory, search for all possible R or Rmd files
    pattern <- "\\.[rR]$|\\.[rR]md$|\\.[rR]nw$|\\.[rR]pres$"
    R_files <- list.files(
      path,
      pattern = pattern,
      ignore.case = TRUE,
      recursive = TRUE
    )
  } else if (utils::file_test("-f", path)) {
    # if file, just copy it into the files list
    R_files <- path
  } else {
    # stop if path doesn't exist at all
    stop("File or directory doesn't exist.")
  }

  # loop over every file
  pkgs <- lapply(
    R_files,
    function(y) {
      # read files libe by line
      current_file <- readLines(y)
      # get libraries explicitly called via library()
      library_lines <- grep(pattern = "library", x = current_file, value = TRUE)
      l_libs <- strsplit(library_lines, split = "library\\(") 
      l_libs <- lapply(l_libs, function(x){strsplit(x[2], split = "\\)")}) 
      l_libs <- unlist(l_libs)
      # get libraries explicitly called via require()
      require_lines <- grep(pattern = "require", x = current_file, value = TRUE)
      r_libs <- strsplit(library_lines, split = "require\\(") 
      r_libs <- lapply(l_libs, function(x){strsplit(x[2], split = "\\)")}) 
      r_libs <- unlist(l_libs)
      # get libraries implicitly called via ::
      point_lines <- grep(pattern = "::", x = current_file, value = TRUE)
      # one of the most complex regular expression I ever managed to figure out:
      # search for all collections of alphanumeric signs in between the 
      # line start/a non-alphanumeric sign and ::
      p_libs <- regmatches(
        point_lines, 
        gregexpr("(?<=^|[^a-zA-Z0-9])[a-zA-Z0-9]*?(?=::)", point_lines, perl = TRUE)
      )
      p_libs <- unlist(p_libs)
      # merge results for current file
      res <- c(l_libs, r_libs, p_libs)
      return(unique(res))
    }
  )

  # merge results of every file
  pkgs <- unique(unlist(pkgs))
  # remove NA and empty string
  pkgs <- pkgs[pkgs != "" & !is.na(pkgs)]

  # remove packages that are not on CRAN
  # TODO: keep an eye on that one: https://github.com/ropenscilabs/available
  if (curl::has_internet()==TRUE & RCurl::url.exists("https://cran.r-project.org") ==TRUE ) {
    pkgs <- pkgs[pkgs %in% available.packages(repos = "https://cran.r-project.org")[,1]==TRUE]
  }

  # needs massive amounts of love
  tmp <- readLines("DESCRIPTION")
  tmp[grep("Imports", tmp)] <- paste0("Imports: ", paste(pkgs, collapse = ", "))
  writeLines(text = tmp, con = "DESCRIPTION")
}

Please give it a try -- regex is a bitch. I'm tired and have a headache now, so I stopped for today. The last important part missing is a better implementation of the writing process to the DESCRIPTION file.

@benmarwick Does this still fit into the package? You invested a lot of time into it since my last review and I don't understand the directory structure anymore...

nevrome commented 7 years ago

Thanks for merging, Ben. Please change name and position of the function to your liking. Of course such a function always has shortcomings concerning special cases. I tested it with 5-6 of my projects and fixed all the occurring problems, but I'm sure there's more to come. I added a lot of code comments to make it easy to improve it in the future.

benmarwick commented 7 years ago

This has been working well for me, thanks again! Closing this for now.