Closed gisler closed 4 years ago
I don't think this is the best solution, I think your problem should be fixed by passing "bar" in the depends argument to cache. Your change might break existing code, and by not explicitly listing the dependencies the cache function does not know which variables to watch to determine whether re-evaluation of the code is necessary.
Op za 11 jul. 2020 22:41 schreef Gerold Hepp notifications@github.com:
… data.tables Types of change
- Bugfix (non-breaking change which fixes an issue)
- New feature (non-breaking change which adds functionality)
- Breaking change (fix or feature that would cause existing functionality to not work as expected)
- Other (documentation etc)
Pull request checklist
- Add functionality
- Add tests
- Update documentation in man
- Update website documentation
Proposed changes
Using data.tables inside the CODE argument of the cache function can cause errors most likely related to the package namespace and non-standard evaluation:
library(ProjectTemplate)
library(data.table)
cache("foo", {
bla <- data.table(bar = 1:12) bla[bar <= 6, ]
})
Loading required namespace: formatR Creating cache entry from CODE: foo Error in [.data.frame(x, i, j) : object 'bar' not found
Changing the parent of the new.env() (where the CODE argument is evaluated in) to .TargetEnv, which seems to be an internal reference to the global environment, fixes this problem:
library(ProjectTemplate)
library(data.table)
cache("foo", {
bla <- data.table(bar = 1:12) bla[bar <= 6, ]
})
Loading required namespace: formatR Creating cache entry from CODE: foo Loading required namespace: digest
You can view, comment on, or merge this pull request online at:
https://github.com/KentonWhite/ProjectTemplate/pull/303 Commit Summary
- changed parent of 'new.env()' to '.TargetEnv' so that 'CODE' supports 'data.table's
File Changes
- M R/cache.R https://github.com/KentonWhite/ProjectTemplate/pull/303/files#diff-86fc2143e0466338b0d2f252e515041f (2)
Patch Links:
- https://github.com/KentonWhite/ProjectTemplate/pull/303.patch
- https://github.com/KentonWhite/ProjectTemplate/pull/303.diff
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/KentonWhite/ProjectTemplate/pull/303, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABU7VX63MHH6DP2IRBYF2YTR3DE75ANCNFSM4OXOZNIA .
Hi,
Thanks for the response. Well, I don't really get which existing code it might break. The documentation of cache
clearly states that "variables" passed on to the variable
or depends
argument have to exist in the global environment. This, for example, already doesn't work without the change:
library(ProjectTemplate)
fun <- function() {
DF <- data.frame(bar = 1:12)
cache("DF")
}
fun()
Loading required namespace: digest Cannot cache DF: Does not exist in global environment and no code to create it
Furthermore, as you point out, CODE
-blocks should be either self-sufficient or only depend on "variables" stated in depends
, which are supposed to exist in the global environment. This also doesn't work:
library(ProjectTemplate)
fun <- function(x) {
dependency <- x
cache("DF", depends = "dependency", CODE = {data.frame(bar = 1:12)})
}
fun(1)
Loading required namespace: formatR Creating cache entry from CODE: DF Loading required namespace: digest Error in if (nrow(depends.hash) >= 1) { : argument is of length zero
So everything seems to be about the global environment and fine as far as I can see. One last example, which also already doesn't work without the change:
library(ProjectTemplate)
fun <- function() {
DF <- data.frame(bar = 1:12)
cache("foo", CODE = {DF})
}
fun()
Loading required namespace: formatR Creating cache entry from CODE: DF Error in eval(parse(text = CODE), envir = new.env()) : object 'DF' not found
My problem, however, is of a different nature. data.table
is an enhanced data.frame
and therefore inherits from data.frame
:
library(data.table)
class(data.table())
[1] "data.table" "data.frame"
data.table
in contrast to data.frame
treats columns as if they are variables. So bar
in my proposed change is actually a column and not a variable that should go into depends
. And this is where the namespace thing kicks in.
Simplified, I guess, one can say that the namespace of the new.env()
, where CODE
is evaluated in, is about equal to the namespace of the package. data.table
, however, is not imported into ProjectTemplate
's namespace. Therefore, when evaluating CODE
-blocks using data.table
s, no methods for data.table
, but methods for data.frame
objects can be found and are utilised instead (as you can see in my proposed change [.data.frame(x, i, j)
actually causes the error). This is due to the fact that data.table
inherits from data.frame
.
So my solution simply "adjusts" the namespace of the new.env()
by changing its parent environment from one that lives inside the package to the global environment, which lives outside the package. This makes sure, data.frame
methods cannot be found first.
You're right that variables inside the code block should exist globally. However, your change might lead temporary variables from the code block into the global namespace. That could break existing code, especially if a temporary variable name is chosen that's also used globally.
Wouldn't this also be fixed by loading the data.table package inside the code block? Then we might import the configured package explicitly in the cache block.
Op ma 13 jul. 2020 00:09 schreef Gerold Hepp notifications@github.com:
Hi,
Thanks for the response. Well, I don't really get which existing code it might break. The documentation of cache clearly states that "variables" passed on to the variable or depends argument have to exist in the global environment. This, for example, already doesn't work without the change:
library(ProjectTemplate)
fun <- function() { DF <- data.frame(bar = 1:12) cache("DF") }
fun()
Loading required namespace: digest Cannot cache DF: Does not exist in global environment and no code to create it
Furthermore, as you point out, CODE-blocks should be either self-sufficient or only depend on "variables" stated in depends, which are supposed to exist in the global environment. This also doesn't work:
library(ProjectTemplate)
fun <- function(x) { dependency <- x cache("DF", depends = "dependency", CODE = {data.frame(bar = 1:12)}) }
fun(1)
Loading required namespace: formatR Creating cache entry from CODE: DF Loading required namespace: digest Error in if (nrow(depends.hash) >= 1) { : argument is of length zero
So everything seems to be about the global environment and fine as far as I can see. One last example, which also already doesn't work without the change:
library(ProjectTemplate)
fun <- function() { DF <- data.frame(bar = 1:12) cache("foo", CODE = {DF}) }
fun()
Loading required namespace: formatR Creating cache entry from CODE: DF Error in eval(parse(text = CODE), envir = new.env()) : object 'DF' not found
My problem, however, is of a different nature. data.table is an enhanced data.frame and therefore inherits from data.frame:
library(data.table)
class(data.table())
[1] "data.table" "data.frame"
data.table in contrast to data.frame treats columns as if they are variables. So bar in my proposed change is actually a column and not a variable that should go into depends. And this is where the namespace thing kicks in.
Simplified, I guess one can say, that the namespace of the new.env(), where CODE is evaluated in, is about equal to the namespace of the package. data.table, however, is not imported into ProjectTemplate's namespace. Therefore, when evaluating CODE-blocks using data.tables, no methods for data.table, but methods for data.frame objects can be found and are utilised instead (as you can see in my proposed change [.data.frame(x, i, j) actually causes the error). This is due to the fact that data.table inherits from data.frame.
So my solution simply "adjusts" the namespace of the new.env() by changing its parent environment from one that lives inside the package to the global environment, which lives outside the package. This makes sure, data.frame methods cannot be found first.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/KentonWhite/ProjectTemplate/pull/303#issuecomment-657282536, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABU7VXZJ3JQKEFXHFOBKYULR3IX73ANCNFSM4OXOZNIA .
No, this is not the case. It's still an environment on it's own, just with another parent environment:
library(ProjectTemplate) # loaded with the proposed change
ls(envir = .GlobalEnv)
character(0)
cache("foo", { a <- 2 b <- a * 2 data.frame(bar = 1:12) })
ls(envir = .GlobalEnv)
> [1] "foo"
And I'm afraid loading the package inside the `CODE`-block doesn't help as well. As long as the new environment's namespace is the one from inside the package, it cannot access functions from packages not imported there (EDIT: This is of course not quite true. R will eventually look outside the package, but only if it doesn't find a function or suitable method inside the package namespace. Unfortunately, `data.frame` methods are always accessible and "suitable" for `data.table`s from R's point of view.). Imagine what could happen if changing the namespace of a package would be possible. If someone else loaded another package, which masked one or more functions used by your package, your package could then wreak havoc. Here is the example:
library(ProjectTemplate) # loaded without the proposed change
cache("foo", { library(data.table) require(data.table) bla <- data.table(bar = 1:12) bla[bar <= 6, ] })
> Loading required namespace: formatR
> Loading required namespace: digest
> Updating existing cache entry from CODE: foo
> data.table 1.12.2 using 2 threads (see ?getDTthreads). Latest news: r-datatable.com
> Error in [.data.frame(x, i, j) : object 'bar' not found
I have one more example, which might be capable of illustrating the namespace thing. Without the proposed change, CODE
-blocks are able to access internal functions:
library(ProjectTemplate) # loaded without the proposed change
cache("foo", CODE = {
.require.package("feather")
data.frame(bar = 1:12)
})
Loading required namespace: formatR Loading required namespace: digest Updating existing cache entry from CODE: foo Loading required namespace: feather
The same with the proposed change leads to an error:
library(ProjectTemplate) # loaded with the proposed change
cache("foo", CODE = {
.require.package("feather")
data.frame(bar = 1:12)
})
Loading required namespace: formatR Loading required namespace: digest Updating existing cache entry from CODE: foo Error in .require.package("feather") : could not find function ".require.package"
P.S.: Hadley explains the relationship of namespaces and environments way better than I ever could: http://adv-r.had.co.nz/Environments.html
Hi again,
I investigated this further because I suspected that tibble
s should be affected as well. This, however, is not the case.
So I then tried to find out why only data.table
s are affected and discovered that the data.table
method for the [
operator was indeed found outside the package namespace. This is where I was wrong. data.table
has a mechanism to check if the calling namespace is data.table
-aware though. And since the calling namespace is the package namespace, which doesn't import data.table
, this is not the case and it redirects to the data.frame
method.
I don't know why data.table
shows such a behaviour, but I still think the proposed change is appropriate since it clearly separates the package internals from the evaluation of the CODE
-block. With the proposed change the evaluation takes place in a direct child environment of the global environment, which makes more sense to me.
Jumping in with an bservation and a question:
Observation: I know that travis is reporting on the page the build status. I checked in manually and the build is passing. This should protect existing code from breaking.
Question: Since ProjectTemplate should be supporting data.table, what happens if data.table is added to the package namespace using import(data.table)?
If importing into the package namespace works, we can add this into load.project if tables_type is set to "data_table". Then the change only happens if someone is using data.table in the project.
Thank you for jumping in and confirming that passing the tests should protect existing code from breaking.
Regarding your question: Yes, this is also a possibility. I already tried it out before and it worked. It's of course up to you, but I would like to point out once more that internal functions can be accessed from CODE
-blocks as it is right now.
Today I also learnt that data.table
offers the possibility to "whitelist" packages. It's method for the [
operator and maybe other methods as well won't redirect to the data.frame
method in case the calling namespace isn't data.table
-aware then. It also offers another option: adding .datatable.aware = TRUE
to a package's namespace (https://github.com/Rdatatable/data.table/blob/master/R/cedta.R). Furthermore, it explains how a user like me can temporarily "whitelist" a package on its own. It doesn't really look like something I want in production code though.
So if a user has selected tables_type as "data_table", we could call .datatable.aware = TRUE and your code would work?
Unfortunately, it's not that easy because you cannot simply call it from inside a function. It has to be part of the namespace, i.e. it has to be a top-level object like, for example, the cache
function. If you don't want to treat data.table
s like data.frame
s somewhere else in the package though, you can just add it to the package source and it will be there in case someone asks for it. If not, it's only a harmless internal boolean variable nobody usually takes notice of (for the following example I added it as first line in "cache.R", but I guess it could go about anywhere):
library(ProjectTemplate)
library(data.table)
print(.datatable.aware)
Error in print(.datatable.aware) : object '.datatable.aware' not found
cache("foo", { print(.datatable.aware) bla <- data.table(bar = 1:12) bla[bar <= 6, ] })
Loading required namespace: formatR Loading required namespace: digest Updating existing cache entry from CODE: foo [1] TRUE
Changing namespace objects at runtime using the <<-
operator is usually not possible by the way. So initialising the variable as NULL
or so and changing it to TRUE
at runtime doesn't work because bindings inside packages are locked. Theoretically, it is possible to unlock package environments, change e.g. its class definitions and lock them again at runtime, but who would want to do this?
Ok Thanks. @Hugovdberg you have more insight into the caching module. If the tests are passing we should have good test coverage?
So initialising the variable as
ǸULL
or so and changing it toTRUE
at runtime doesn't work because bindings inside packages are locked.
Just found out there is another way for this using the assignInMyNamespace
function (the documentation considers such a use case as "perhaps" okay for internal objects in production code). So doing something like this is could work (even though I wouldn't do it):
.datatable.aware <- NULL
cache <- function(variable=NULL, CODE=NULL, depends=NULL, ...) {
project_name <- .stopifnotproject("Change to a valid ProjectTemplate directory and run cache() again.")
if (.load.config()$tables_type == "data_table") {
assignInMyNamespace(".datatable.aware", TRUE)
}
# rest of cache function
I hope I didn't bore you with this, but the endless possibilities of R keep fascinating me ...
Not boring me at all. There are always multiple ways of doing something in R. My challenge is exploring all the options and picking the best one. As a package maintainer, if I can find a solution that works for data.table and that I can isolate from the rest of the code (that is, the change only happens if a person is using data.table) then the overall experience for every user is improved. Evaluating all of these options helps me make an informed decision and even if we go with the first solution, if that breaks the package for someone else I can say we chose what was the best option at the time.
Sorry for the late reply, and for the misunderstanding, I think this change should be fine indeed. I initially misread it as if the code was evaluated in .TargetEnv
but it's still encapsulated in its own environment. If all tests still work and there should be no problem to merge this. It might be nice though to add some tests that only work with this change so we have some documentation of the reasoning behind this change in code, and get a warning from the test when someone reverts it.
Hi again,
No worries. I learnt some very interesting details in the process of our discussion after all.
Tests for sure are a good idea. Maybe the two examples of the "Proposed changes" section of the pull request can serve as a basis. If you like, I can compile and open another pull request. Since the data.table
package already is a suggested package, they would merely require a skip condition in case the package is not installed.
Or rather I can add another commit to this pull request instead if opening a new one.
I would prefer a new commit to this PR to keep the changes together in the changelogs. The examples would serve fine indeed.
Op di 1 sep. 2020 13:01 schreef Gerold Hepp notifications@github.com:
Or rather I can add another commit to this pull request instead if opening a new one.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/KentonWhite/ProjectTemplate/pull/303#issuecomment-684769847, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABU7VX667T4DOCVIZZBEQKLSDTIBLANCNFSM4OXOZNIA .
Thanks @Hugovdberg @gisler If you can add the new tests to the PR I'll merge them in when everything is done.
@Hugovdberg Added two tests that should do the trick. I hope they fit in nicely.
@Hugovdberg Travis CI seems to have a problem installing the rJava
package. On my local Windows machine all tests were fine though.
That sometimes happens. Give it some time and I'll restart the test.
Might be a bigger issue: https://travis-ci.community/t/cannot-connect-to-java-from-r/9754
…
data.table
sTypes of change
Pull request checklist
man
Proposed changes
Using
data.table
s inside theCODE
argument of thecache
function can cause errors most likely related to the package namespace and non-standard evaluation:Changing the parent of the
new.env()
(where theCODE
argument is evaluated in) to.TargetEnv
, which seems to be an internal reference to the global environment, fixes this problem: