KentonWhite / ProjectTemplate

A template utility for R projects that provides a skeletal project.
http://projecttemplate.net
GNU General Public License v3.0
622 stars 159 forks source link

changed parent of 'new.env()' to '.TargetEnv' so that 'CODE' supports… #303

Closed gisler closed 4 years ago

gisler commented 4 years ago

data.tables

Types of change

Pull request checklist


Proposed changes

Using data.tables inside the CODE argument of the cache function can cause errors most likely related to the package namespace and non-standard evaluation:

library(ProjectTemplate)
library(data.table)

cache("foo", {
    bla <- data.table(bar = 1:12)
    bla[bar <= 6, ]
})

Loading required namespace: formatR Creating cache entry from CODE: foo Error in [.data.frame(x, i, j) : object 'bar' not found

Changing the parent of the new.env() (where the CODE argument is evaluated in) to .TargetEnv, which seems to be an internal reference to the global environment, fixes this problem:

library(ProjectTemplate)
library(data.table)

cache("foo", {
    bla <- data.table(bar = 1:12)
    bla[bar <= 6, ]
})

Loading required namespace: formatR Creating cache entry from CODE: foo Loading required namespace: digest

Hugovdberg commented 4 years ago

I don't think this is the best solution, I think your problem should be fixed by passing "bar" in the depends argument to cache. Your change might break existing code, and by not explicitly listing the dependencies the cache function does not know which variables to watch to determine whether re-evaluation of the code is necessary.

Op za 11 jul. 2020 22:41 schreef Gerold Hepp notifications@github.com:

… data.tables Types of change

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Other (documentation etc)

Pull request checklist

  • Add functionality
  • Add tests
  • Update documentation in man
  • Update website documentation

Proposed changes

Using data.tables inside the CODE argument of the cache function can cause errors most likely related to the package namespace and non-standard evaluation:

library(ProjectTemplate)

library(data.table)

cache("foo", {

bla <- data.table(bar = 1:12)

bla[bar <= 6, ]

})

Loading required namespace: formatR Creating cache entry from CODE: foo Error in [.data.frame(x, i, j) : object 'bar' not found

Changing the parent of the new.env() (where the CODE argument is evaluated in) to .TargetEnv, which seems to be an internal reference to the global environment, fixes this problem:

library(ProjectTemplate)

library(data.table)

cache("foo", {

bla <- data.table(bar = 1:12)

bla[bar <= 6, ]

})

Loading required namespace: formatR Creating cache entry from CODE: foo Loading required namespace: digest


You can view, comment on, or merge this pull request online at:

https://github.com/KentonWhite/ProjectTemplate/pull/303 Commit Summary

  • changed parent of 'new.env()' to '.TargetEnv' so that 'CODE' supports 'data.table's

File Changes

Patch Links:

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/KentonWhite/ProjectTemplate/pull/303, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABU7VX63MHH6DP2IRBYF2YTR3DE75ANCNFSM4OXOZNIA .

gisler commented 4 years ago

Hi,

Thanks for the response. Well, I don't really get which existing code it might break. The documentation of cache clearly states that "variables" passed on to the variable or depends argument have to exist in the global environment. This, for example, already doesn't work without the change:

library(ProjectTemplate)

fun <- function() {
    DF <- data.frame(bar = 1:12)
    cache("DF")
}

fun()

Loading required namespace: digest Cannot cache DF: Does not exist in global environment and no code to create it

Furthermore, as you point out, CODE-blocks should be either self-sufficient or only depend on "variables" stated in depends, which are supposed to exist in the global environment. This also doesn't work:

library(ProjectTemplate)

fun <- function(x) {
    dependency <- x
    cache("DF", depends = "dependency", CODE = {data.frame(bar = 1:12)})
}

fun(1)

Loading required namespace: formatR Creating cache entry from CODE: DF Loading required namespace: digest Error in if (nrow(depends.hash) >= 1) { : argument is of length zero

So everything seems to be about the global environment and fine as far as I can see. One last example, which also already doesn't work without the change:

library(ProjectTemplate)

fun <- function() {
    DF <- data.frame(bar = 1:12)
    cache("foo", CODE = {DF})
}

fun()

Loading required namespace: formatR Creating cache entry from CODE: DF Error in eval(parse(text = CODE), envir = new.env()) : object 'DF' not found

My problem, however, is of a different nature. data.table is an enhanced data.frame and therefore inherits from data.frame:

library(data.table)

class(data.table())

[1] "data.table" "data.frame"

data.table in contrast to data.frame treats columns as if they are variables. So bar in my proposed change is actually a column and not a variable that should go into depends. And this is where the namespace thing kicks in.

Simplified, I guess, one can say that the namespace of the new.env(), where CODE is evaluated in, is about equal to the namespace of the package. data.table, however, is not imported into ProjectTemplate's namespace. Therefore, when evaluating CODE-blocks using data.tables, no methods for data.table, but methods for data.frame objects can be found and are utilised instead (as you can see in my proposed change [.data.frame(x, i, j) actually causes the error). This is due to the fact that data.table inherits from data.frame.

So my solution simply "adjusts" the namespace of the new.env() by changing its parent environment from one that lives inside the package to the global environment, which lives outside the package. This makes sure, data.frame methods cannot be found first.

Hugovdberg commented 4 years ago

You're right that variables inside the code block should exist globally. However, your change might lead temporary variables from the code block into the global namespace. That could break existing code, especially if a temporary variable name is chosen that's also used globally.

Wouldn't this also be fixed by loading the data.table package inside the code block? Then we might import the configured package explicitly in the cache block.

Op ma 13 jul. 2020 00:09 schreef Gerold Hepp notifications@github.com:

Hi,

Thanks for the response. Well, I don't really get which existing code it might break. The documentation of cache clearly states that "variables" passed on to the variable or depends argument have to exist in the global environment. This, for example, already doesn't work without the change:

library(ProjectTemplate)

fun <- function() { DF <- data.frame(bar = 1:12) cache("DF") }

fun()

Loading required namespace: digest Cannot cache DF: Does not exist in global environment and no code to create it

Furthermore, as you point out, CODE-blocks should be either self-sufficient or only depend on "variables" stated in depends, which are supposed to exist in the global environment. This also doesn't work:

library(ProjectTemplate)

fun <- function(x) { dependency <- x cache("DF", depends = "dependency", CODE = {data.frame(bar = 1:12)}) }

fun(1)

Loading required namespace: formatR Creating cache entry from CODE: DF Loading required namespace: digest Error in if (nrow(depends.hash) >= 1) { : argument is of length zero

So everything seems to be about the global environment and fine as far as I can see. One last example, which also already doesn't work without the change:

library(ProjectTemplate)

fun <- function() { DF <- data.frame(bar = 1:12) cache("foo", CODE = {DF}) }

fun()

Loading required namespace: formatR Creating cache entry from CODE: DF Error in eval(parse(text = CODE), envir = new.env()) : object 'DF' not found

My problem, however, is of a different nature. data.table is an enhanced data.frame and therefore inherits from data.frame:

library(data.table)

class(data.table())

[1] "data.table" "data.frame"

data.table in contrast to data.frame treats columns as if they are variables. So bar in my proposed change is actually a column and not a variable that should go into depends. And this is where the namespace thing kicks in.

Simplified, I guess one can say, that the namespace of the new.env(), where CODE is evaluated in, is about equal to the namespace of the package. data.table, however, is not imported into ProjectTemplate's namespace. Therefore, when evaluating CODE-blocks using data.tables, no methods for data.table, but methods for data.frame objects can be found and are utilised instead (as you can see in my proposed change [.data.frame(x, i, j) actually causes the error). This is due to the fact that data.table inherits from data.frame.

So my solution simply "adjusts" the namespace of the new.env() by changing its parent environment from one that lives inside the package to the global environment, which lives outside the package. This makes sure, data.frame methods cannot be found first.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/KentonWhite/ProjectTemplate/pull/303#issuecomment-657282536, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABU7VXZJ3JQKEFXHFOBKYULR3IX73ANCNFSM4OXOZNIA .

gisler commented 4 years ago

No, this is not the case. It's still an environment on it's own, just with another parent environment:

library(ProjectTemplate) # loaded with the proposed change

ls(envir = .GlobalEnv)

character(0)


cache("foo", {
a <- 2
b <- a * 2
data.frame(bar = 1:12)
})

ls(envir = .GlobalEnv)

> [1] "foo"

And I'm afraid loading the package inside the `CODE`-block doesn't help as well. As long as the new environment's namespace is the one from inside the package, it cannot access functions from packages not imported there (EDIT: This is of course not quite true. R will eventually look outside the package, but only if it doesn't find a function or suitable method inside the package namespace. Unfortunately, `data.frame` methods are always accessible and "suitable" for `data.table`s from R's point of view.). Imagine what could happen if changing the namespace of a package would be possible. If someone else loaded another package, which masked one or more functions used by your package, your package could then wreak havoc. Here is the example:

library(ProjectTemplate) # loaded without the proposed change

cache("foo", { library(data.table) require(data.table) bla <- data.table(bar = 1:12) bla[bar <= 6, ] })


> Loading required namespace: formatR
> Loading required namespace: digest
> Updating existing cache entry from CODE: foo
> data.table 1.12.2 using 2 threads (see ?getDTthreads). Latest news: r-datatable.com
> Error in [.data.frame(x, i, j) : object 'bar' not found
gisler commented 4 years ago

I have one more example, which might be capable of illustrating the namespace thing. Without the proposed change, CODE-blocks are able to access internal functions:

library(ProjectTemplate) # loaded without the proposed change

cache("foo", CODE = {
    .require.package("feather")
    data.frame(bar = 1:12)
})

Loading required namespace: formatR Loading required namespace: digest Updating existing cache entry from CODE: foo Loading required namespace: feather

The same with the proposed change leads to an error:

library(ProjectTemplate) # loaded with the proposed change

cache("foo", CODE = {
    .require.package("feather")
    data.frame(bar = 1:12)
})

Loading required namespace: formatR Loading required namespace: digest Updating existing cache entry from CODE: foo Error in .require.package("feather") : could not find function ".require.package"

P.S.: Hadley explains the relationship of namespaces and environments way better than I ever could: http://adv-r.had.co.nz/Environments.html

gisler commented 4 years ago

Hi again,

I investigated this further because I suspected that tibbles should be affected as well. This, however, is not the case.

So I then tried to find out why only data.tables are affected and discovered that the data.table method for the [ operator was indeed found outside the package namespace. This is where I was wrong. data.table has a mechanism to check if the calling namespace is data.table-aware though. And since the calling namespace is the package namespace, which doesn't import data.table, this is not the case and it redirects to the data.frame method.

I don't know why data.table shows such a behaviour, but I still think the proposed change is appropriate since it clearly separates the package internals from the evaluation of the CODE-block. With the proposed change the evaluation takes place in a direct child environment of the global environment, which makes more sense to me.

KentonWhite commented 4 years ago

Jumping in with an bservation and a question:

Observation: I know that travis is reporting on the page the build status. I checked in manually and the build is passing. This should protect existing code from breaking.

Question: Since ProjectTemplate should be supporting data.table, what happens if data.table is added to the package namespace using import(data.table)?

If importing into the package namespace works, we can add this into load.project if tables_type is set to "data_table". Then the change only happens if someone is using data.table in the project.

gisler commented 4 years ago

Thank you for jumping in and confirming that passing the tests should protect existing code from breaking.

Regarding your question: Yes, this is also a possibility. I already tried it out before and it worked. It's of course up to you, but I would like to point out once more that internal functions can be accessed from CODE-blocks as it is right now.

Today I also learnt that data.table offers the possibility to "whitelist" packages. It's method for the [ operator and maybe other methods as well won't redirect to the data.frame method in case the calling namespace isn't data.table-aware then. It also offers another option: adding .datatable.aware = TRUE to a package's namespace (https://github.com/Rdatatable/data.table/blob/master/R/cedta.R). Furthermore, it explains how a user like me can temporarily "whitelist" a package on its own. It doesn't really look like something I want in production code though.

KentonWhite commented 4 years ago

So if a user has selected tables_type as "data_table", we could call .datatable.aware = TRUE and your code would work?

gisler commented 4 years ago

Unfortunately, it's not that easy because you cannot simply call it from inside a function. It has to be part of the namespace, i.e. it has to be a top-level object like, for example, the cache function. If you don't want to treat data.tables like data.frames somewhere else in the package though, you can just add it to the package source and it will be there in case someone asks for it. If not, it's only a harmless internal boolean variable nobody usually takes notice of (for the following example I added it as first line in "cache.R", but I guess it could go about anywhere):

library(ProjectTemplate)
library(data.table)

print(.datatable.aware)

Error in print(.datatable.aware) : object '.datatable.aware' not found

cache("foo", {
print(.datatable.aware)
bla <- data.table(bar = 1:12)
bla[bar <= 6, ]
})

Loading required namespace: formatR Loading required namespace: digest Updating existing cache entry from CODE: foo [1] TRUE

Changing namespace objects at runtime using the <<- operator is usually not possible by the way. So initialising the variable as NULL or so and changing it to TRUE at runtime doesn't work because bindings inside packages are locked. Theoretically, it is possible to unlock package environments, change e.g. its class definitions and lock them again at runtime, but who would want to do this?

KentonWhite commented 4 years ago

Ok Thanks. @Hugovdberg you have more insight into the caching module. If the tests are passing we should have good test coverage?

gisler commented 4 years ago

So initialising the variable as ǸULL or so and changing it to TRUE at runtime doesn't work because bindings inside packages are locked.

Just found out there is another way for this using the assignInMyNamespace function (the documentation considers such a use case as "perhaps" okay for internal objects in production code). So doing something like this is could work (even though I wouldn't do it):

.datatable.aware <- NULL

cache <- function(variable=NULL, CODE=NULL, depends=NULL,  ...) {
  project_name <- .stopifnotproject("Change to a valid ProjectTemplate directory and run cache() again.")

  if (.load.config()$tables_type == "data_table") {
    assignInMyNamespace(".datatable.aware", TRUE)
  }

  # rest of cache function

I hope I didn't bore you with this, but the endless possibilities of R keep fascinating me ...

KentonWhite commented 4 years ago

Not boring me at all. There are always multiple ways of doing something in R. My challenge is exploring all the options and picking the best one. As a package maintainer, if I can find a solution that works for data.table and that I can isolate from the rest of the code (that is, the change only happens if a person is using data.table) then the overall experience for every user is improved. Evaluating all of these options helps me make an informed decision and even if we go with the first solution, if that breaks the package for someone else I can say we chose what was the best option at the time.

Hugovdberg commented 4 years ago

Sorry for the late reply, and for the misunderstanding, I think this change should be fine indeed. I initially misread it as if the code was evaluated in .TargetEnv but it's still encapsulated in its own environment. If all tests still work and there should be no problem to merge this. It might be nice though to add some tests that only work with this change so we have some documentation of the reasoning behind this change in code, and get a warning from the test when someone reverts it.

gisler commented 4 years ago

Hi again,

No worries. I learnt some very interesting details in the process of our discussion after all.

Tests for sure are a good idea. Maybe the two examples of the "Proposed changes" section of the pull request can serve as a basis. If you like, I can compile and open another pull request. Since the data.table package already is a suggested package, they would merely require a skip condition in case the package is not installed.

gisler commented 4 years ago

Or rather I can add another commit to this pull request instead if opening a new one.

Hugovdberg commented 4 years ago

I would prefer a new commit to this PR to keep the changes together in the changelogs. The examples would serve fine indeed.

Op di 1 sep. 2020 13:01 schreef Gerold Hepp notifications@github.com:

Or rather I can add another commit to this pull request instead if opening a new one.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/KentonWhite/ProjectTemplate/pull/303#issuecomment-684769847, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABU7VX667T4DOCVIZZBEQKLSDTIBLANCNFSM4OXOZNIA .

KentonWhite commented 4 years ago

Thanks @Hugovdberg @gisler If you can add the new tests to the PR I'll merge them in when everything is done.

gisler commented 4 years ago

@Hugovdberg Added two tests that should do the trick. I hope they fit in nicely.

gisler commented 4 years ago

@Hugovdberg Travis CI seems to have a problem installing the rJava package. On my local Windows machine all tests were fine though.

KentonWhite commented 4 years ago

That sometimes happens. Give it some time and I'll restart the test.

KentonWhite commented 4 years ago

Might be a bigger issue: https://travis-ci.community/t/cannot-connect-to-java-from-r/9754