MichaelChirico / r-bugs

A ⚠️read-only⚠️mirror of https://bugs.r-project.org/
20 stars 0 forks source link

[BUGZILLA #17700] AssignOps chain assignment (a<-b<-1) goes to same memory address, warn users in documentation #6874

Closed MichaelChirico closed 4 years ago

MichaelChirico commented 4 years ago

Help for '<-' (assignOps) notes in the Value section: "Thus one can use a <- b <- c <- 6." but doesn't inform the user that this means that both a and b and c are all occupying the same memory address and are therefore functionally the same thing. While this fact is often irrelevant when users edit those objects with R's base assignment, if one unthinkingly edits them without reference (e.g. with data.table's :=, per https://stackoverflow.com/questions/59315030/r-chain-assignment-creates-quantum-entanglement-of-newly-created-columns-beha) unexpected problems can rear up.

While arguably this is no packages's 'fault', the practical reality of the issue is that users can crash into this feature unless they happen to have learnt the deep minutiae of R's memory addressing behaviour, which many won't have done.

It would be nice if this sharp edge was flagged in the Details of assignOps, explicitly warning users that chain assignment 'a <- b <- 1' places both a and b in the same space in the memory, such that they are the same thing, and assignment by reference (e.g. data.table's :=) will act on them both identically. Chain assignment can therefore be used for (temporary) memory conservation but defensive coding practices should see this avoided, especially if using data.table or other packages which use assignment by reference. (or better text, that was just an off the cuff draft).

Cheers all.


METADATA

MichaelChirico commented 4 years ago

There is nothing special about chained assignment: it is functionally equivalent to the sequential form:

c <- 6
b <- c
a <- b
data.table::address(c)

[1] "0x557025bdde68"

data.table::address(b)

[1] "0x557025bdde68"

data.table::address(a)

[1] "0x557025bdde68"

and as you are probably aware, "c <- 6" by itself just creates a promise to do an assignment: the promise isn't evaluated until such time that c is actually needed (accessed).

c <- 6
f <- function(c) function() c
f1 <- f(c)
c <- 4
f1()

[1] 4

c <- 2
f1()

[1] 4

and note that eval()uation just creates more promises:

c <- 6
b <- eval(c)
a <- eval(b)
data.table::address(c)

[1] "0x55702409a510"

data.table::address(b)

[1] "0x55702409a510"

data.table::address(a)

[1] "0x55702409a510"

but typical users of data.table will know the way to circumvent that:

c <- 6
b <- data.table::copy(c)
a <- data.table::copy(b)
data.table::address(c)

[1] "0x557023689e78"

data.table::address(b)

[1] "0x557023f0c908"

data.table::address(a)

[1] "0x557022a935f8"


METADATA

MichaelChirico commented 4 years ago

(In reply to Benjamin Tyner from comment #1)

There is nothing special about chained assignment: it is functionally
equivalent to the sequential form:

> c <- 6
> b <- c
> a <- b
> data.table::address(c)
[1] "0x557025bdde68"
> data.table::address(b)
[1] "0x557025bdde68"
> data.table::address(a)
[1] "0x557025bdde68"

and as you are probably aware, "c <- 6" by itself just creates a promise to
do an assignment: the promise isn't evaluated until such time that c is
actually needed (accessed).

> c <- 6
> f <- function(c) function() c
> f1 <- f(c)
> c <- 4
> f1()
[1] 4
> c <- 2
> f1()
[1] 4

and note that eval()uation just creates more promises:

> c <- 6
> b <- eval(c)
> a <- eval(b)
> data.table::address(c)
[1] "0x55702409a510"
> data.table::address(b)
[1] "0x55702409a510"
> data.table::address(a)
[1] "0x55702409a510"

but typical users of data.table will know the way to circumvent that:

> c <- 6
> b <- data.table::copy(c)
> a <- data.table::copy(b)
> data.table::address(c)
[1] "0x557023689e78"
> data.table::address(b)
[1] "0x557023f0c908"
> data.table::address(a)
[1] "0x557022a935f8"

There are no promises involved in assignment -- in x < 1 + 2 the sum is evvaluated immediately and then bound to the variable.


METADATA

MichaelChirico commented 4 years ago

This is behaving as intended. If a value is bound to two variables then it has a reference count of at least two. If a package does not respect this and mutates in its C code when it should not, then that is a bug in the package and you should report it to the package maintainer.


METADATA

MichaelChirico commented 4 years ago

I don't doubt that this is behaving as intended - hence my filing under documentation rather than as a bug. I'm simply raising the point that the underlying behaviour may trip up non-expert users who aren't fluent in the fullness of the technicalities of how R promises evaluations and how objects behave in RAM.

As top-tier experts yourselves, you might well say "well just read the manual, R inferno, etc", and that's fair. But might you concede that there's a gulf between the skill and knowledge level of the core package maintainers, and most users? And that - while not expected to be exhaustive - adding the occasional small note in situations where unexpected behaviour can cause bizarre behaviour, couldn't hurt?


METADATA

MichaelChirico commented 4 years ago

(In reply to Simon Dedman from comment #4)

I don't doubt that this is behaving as intended - hence my filing under
documentation rather than as a bug. I'm simply raising the point that the
underlying behaviour may trip up non-expert users who aren't fluent in the
fullness of the technicalities of how R promises evaluations and how objects
behave in RAM.

As top-tier experts yourselves, you might well say "well just read the
manual, R inferno, etc", and that's fair. But might you concede that there's
a gulf between the skill and knowledge level of the core package
maintainers, and most users? And that - while not expected to be exhaustive
- adding the occasional small note in situations where unexpected behaviour
can cause bizarre behaviour, couldn't hurt?

If you have assignments like

a <- 6 b <- a

then modifying "b" does not modify "a". If it did, e.g. via native code in some package, that would be because that package violated rules about modifying memory, which are stated in WRE. Such packages have to be fixed. This is a problem of the packages and their authors, not base R, this has to be discussed with authors of those packages. Mentioning in R documentation that some packages are broken and violate these rules won't really help end users.

What could RCore/CRAN do would be stricter automated checking and enforcing of these rules. CRAN already runs some checks related to this: it detects when the corrupted values are also in the compiler's constant pools. More could be checked in the future, but checks will always be able to detect only some instances of the problem.


METADATA

MichaelChirico commented 4 years ago

Simon I expect you'd get more traction by posting (starting with r-help rather than bugzilla) an example of behavior that tripped you up, especially if said example pertains to a package that isn't maintained by R-core.


METADATA

MichaelChirico commented 4 years ago

Thanks for the input both.

I guess I should clarify that I don't think this is a bug in R-Core nor, indeed, in data.table or anything else - it's simply that there are sequences of operations one can perform in R, each of which seem logical and commonsense, but which result in a catastrophic failure due to background technicalities of one element clashing with background technicalities of another. Again, I don't think anyone/package is necessarily to 'blame' in these instances (sometimes they are, of course!), this is just a natural outcome of having an open and flexible language.

@Ben, I initially raised this in stackoverflow (see starting post). I'm not necessarily looking for 'traction' insofar as I want to campaign for this change. If you folks, as the arbiters of these docs, agree that mentioning this possibility in the docs could help others avoid such a situation in future, and/or diagnose the source of the problem, that's great. We're talking about adding a single line of text after all.


METADATA

MichaelChirico commented 4 years ago

(In reply to Simon Dedman from comment #7)

Thanks for the input both.

I guess I should clarify that I don't think this is a bug in R-Core nor,
indeed, in data.table or anything else - it's simply that there are
sequences of operations one can perform in R, each of which seem logical and
commonsense, but which result in a catastrophic failure due to background
technicalities of one element clashing with background technicalities of
another. Again, I don't think anyone/package is necessarily to 'blame' in
these instances (sometimes they are, of course!), this is just a natural
outcome of having an open and flexible language.

In a high level language such catastrophic failures should NEVER happen. That is a big part of why such languages exist. If such failures do happen then there is a bug somewhere. If you can reproduce the failure using base R alone, without any additional packages, then there is a bug in R. File a bug report with a reproducible example and we will look into it.

If you can only reproduce the problem when using package X, then the bug is most likely in package X. The appropriate course then is to file an issue with the maintainers for package X. In the SO reference it certainly looks like this is a case where data.table needs to be more careful. Their maintainers may have seen your SO report but maybe not. File an issue with them -- I'm sure they would like to know about it.

@Ben, I initially raised this in stackoverflow (see starting post). I'm not
necessarily looking for 'traction' insofar as I want to campaign for this
change. If you folks, as the arbiters of these docs, agree that mentioning
this possibility in the docs could help others avoid such a situation in
future, and/or diagnose the source of the problem, that's great. We're
talking about adding a single line of text after all.

METADATA