CodeDepends as a backend for reproducible build systems

wlandau-lilly commented 7 years ago

I just learned about CodeDepends, and I am glad to find such an active effort in static code analysis. R-focused reproducible build systems like drake and remake (and of course, knitr) could seriously benefit (richfitz/remake#172, wlandau-lilly/drake#41). Drake and remake rely on dependency graphs of the relationships among the "targets" of a data analysis project, and the dependencies for these graphs are resolved using code analysis. For drake in particular, I am considering switching the backend from codetools to CodeDepends, and I am wondering if you think the transition would be worth the trouble. In general, what would you say are the main advantages of CodeDepends relative to codetools?

clarkfitzg commented 7 years ago

Drake looks like an interesting system, especially the parallel stuff. Can you point me to a small example analysis that uses it?

Here's an example where I used CodeDepends to generate a graph representing the dependencies between top level expressions in an R script. I use it to map this simulation script into this graph.

You may be able to do something like this to automatically generate the drake code from a simple R script.

From what I've seen CodeDepends has many more tools to resolve dependencies for the graphs you're interested in relative to codetools.

wlandau-lilly commented 7 years ago

Thanks, @clarkfitzg! Drake has a small built-in canonical example analysis in the quickstart vignette on CRAN, and you can generate the underlying R script with drake::example_drake("basic").

I will look into your CodeDepends example. From your description, it sounds like that might even help with wlandau-lilly/drake#25.

There is plenty of room to improve how drake tracks dependencies (also at wlandau-lilly/drake#11), and I hope some of those extra tools in CodeDepends can help.

gmbecker commented 7 years ago

Another thing to be aware of, depending on exactly how you intend to use codetools, is that there are corner cases that it currently gets wrong:

> f = function() {x = x + y; x}
> findGlobals(f)
[1] "{" "+" "=" "y"

(x is a global symbol in the above definition of f) This can wreak havoc when you're doing careful computing on dependencies.

I think you may also be interested in some work I did during my thesis on input-aware caching via RCacheSuite (disclaimer, I haven't looked at that code in a while, though now that CodeDepends is on CRAN I intend to get back to it and push it out as well). Not exactly the same thing as make-like build systems, but related, I think. At least it does the same type of dependency calculation you probably want, using CodeDepends.

wlandau-lilly commented 7 years ago

@gmbecker interesting. Both drake and remake use storr to cache both code and targets, and RCacheSuite looks like it makes the code part easier.

Incidentally, for drake, I would actually prefer findGlobals(f) to ignore x and include y. (For local variables like x, I can use codetools::findFuncLocals(formals(f), body(f))). But regarding your general point about the havoc of codetools, I totally agree. I am mainly concerned about corner cases like the following.

f <- function(){
  b = get("x", envir = globalenv())
  digest::digest(1234)
}
codetools::findGlobals(f) # Incorrectly ignores x and digest.

And I see that CodeDepends::getInputs(body(f)) does detect digest.

wlandau-lilly commented 7 years ago

After digging into the package some more, I have decided against replacing codetools with CodeDepends in the preexisting internals of drake. However, I think CodeDepends will put the otherwise difficult issues wlandau-lilly/drake#9 and wlandau-lilly/drake#25 within reach, and I look forward to getting started on these new features.

gmbecker commented 7 years ago

Do you mind if I ask what made you come to that decision? It is your prerogative, of course, but using two entirely different static code frameworks in the same project can't be ideal. Are we missing functionality that code tools gives you?

Best, ~G

On Wed, Jun 14, 2017 at 8:07 AM, Will Landau notifications@github.com wrote:

After digging into the package some more, I have decided against replacing codetools with CodeDepends in the preexisting internals of drake https://github.com/wlandau-lilly/drake/issues/41#issuecomment-308456705. However, I think CodeDepends will put the otherwise difficult issues wlandau-lilly/drake#9 https://github.com/wlandau-lilly/drake/issues/9 and wlandau-lilly/drake#25 https://github.com/wlandau-lilly/drake/issues/25 within reach, and I look forward to getting started on these new features.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/duncantl/CodeDepends/issues/14#issuecomment-308461434, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3dsX0xT7hs-UCCxbkAOdc43AQu3AxTks5sD_dKgaJpZM4N0XG- .

-- Gabriel Becker, PhD Associate Scientist (Bioinformatics) Genentech Research

wlandau-lilly commented 7 years ago

I definitely think CodeDepends is more powerful and feature-rich than codetools, and I am still excited to make use of it. But for the specific purpose of building a network of targets in drake, I was originally looking for solutions to the edge cases in wlandau-lilly/drake#11. CodeDepends solves at least one of these by detecting functions referenced with :: and :::, but others remain unsolved:

In expressions like get("x", envir = globalenv()), the CodeDepends detects the string "x", but it does not recognize that x is actually a global variable. There are other sneaky uses of strings as in eval(parse()) that I may want to consider.
I like CodeDepends' string detection feature, but it I am not sure it smooths out my current custom code. In drake commands (but not imported functions), single-quoted strings are treated as file names, and double-quoted strings are treated as ordinary string literals. While convenient and flexible for users, this convention makes it hard to construct dependency graphs. CodeDepends replaces single quotes with double quotes in expressions like readRDS('my-file.rds') (which maybe cannot be helped).

I also thought CodeDepends might be trying to replace the entire existence of codetools, but I did see codetools listed as an import in the DESCRIPTION and calls to findGlobals() in the source.

In any case, depending on how CodeDepends changes, and depending on the landscape of code analysis tools in the future, I am open to the possibility of making the switch at some point. But at the moment, I do not think it is worth the time, effort, and real-world testing required to overhaul the internals of drake, and the slightly-altered dependency graphs may require some of my users to rerun their long projects from scratch.

The features in wlandau-lilly/drake#11 and wlandau-lilly/drake#25 just deal with analyzing knitr reports and importing new projects into drake from R scripts, and I think I can maintain behavior consistent with the current version of drake that is powered by codetools. Users will be able to enable or disable these new features freely, and the functionality will be modular and transparent. I really like that the analysis of scripts and reports is a major focus of CodeDepends.

wlandau-lilly commented 7 years ago

...I didn't explain that very well, so maybe I should add a tl;dr and a clarification: drake already has mostly satisfactory functionality for tracking dependencies and graphing targets, and after looking at CodeDepends, I think the gains of reimplementing that specific piece would be small. And in my case, any loss of back compatibility is unusually risky. My users have computationally intense projects with extremely long runtimes, and I hesitate to make them re-execute their work from scratch for the sake of incremental improvements.

But I think CodeDepends still fills a large unmet need in drake. Those new features I mentioned could significantly improve the tool, and until now, they seemed like moonshots.

wlandau-lilly commented 7 years ago

FYI: this SO post elaborates on a piece of what I am looking for.

wlandau-lilly commented 6 years ago

Just found a nifty use for CodeDepends in ropensci/drake#232. I occasionally need to find the globals of expressions. With codetools::findGlobals(), I would first need to coerce each expression into a function, which is trouble if those expressions have multiple language objects.

expr <- parse(text = c("a <- x + y", "b <- v + w"), keep.source = FALSE)
expr

## expression(a <- x + y, b <- v + w)

f <- function(){}
body(f) <- expr

## Warning in `body<-`(`*tmp*`, value = expression(a <- x + y, b <- v + w)) :
##   using the first element of 'value' of type "expression"

codetools::findGlobals(f) # Does not include "a" or "b".

## [1] "<-" "+"  "x"  "y"

With CodeDepends, this seems easier and safer.

inputs <- CodeDepends::getInputs(expr)
union(
  inputs@inputs,
  names(inputs@functions)
)

## [1] "x" "y" "v" "w" "+"

With the changes happening to drake right now, this is a good time to make the switch.

wlandau commented 6 years ago

Another thing I really like: CodeDepends separates out NSE symbols such as the arguments to ggplot2::aes().

gmbecker commented 6 years ago

Thanks.

Glad you're finding that useful. For the record that is fully customizable via the function handlers (that's actually how it is implemented, with default handlers for the functions I knew of off the top of my head that would have NSE).

As new functions with NSE mechanics come online and become popular, they'll need to be handled with custom handlers at first and then eventually rolled into the default handlers once they are mature. Happy to consider pull requests for that (and anything else, of course!)

On Sat, Feb 17, 2018 at 10:47 AM, Will Landau notifications@github.com wrote:

Another thing I really like: CodeDepends separates out NSE symbols such as the arguments to ggplot2::aes().

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/duncantl/CodeDepends/issues/14#issuecomment-366462656, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3dsZNbcibNxXWPJqpUcXMX2j53fkwPks5tVx7JgaJpZM4N0XG- .

-- Gabriel Becker, PhD Scientist (Bioinformatics) Genentech Research

duncantl / CodeDepends

CodeDepends as a backend for reproducible build systems #14