dgkf / R

An experimental reimagining of R
https://dgkf.github.io/R
GNU General Public License v3.0
136 stars 5 forks source link

Allowed Vector Subsets during assignment #129

Open sebffischer opened 7 months ago

sebffischer commented 7 months ago

This is just a copy of this comment https://github.com/dgkf/R/pull/112#issuecomment-2054041873 as I don't want it to get lost in case the PR gets merged.

R has some edge-cases where subset-assignment has weird (and I image undefined) behavior:

> x[c(1, 1)][c(1, 2)] <- c(99, 100)
> x
 [1] 100   2   3   4   5   6   7   8   9  10

I think we should only allow vector subsets in assignment calls where the behavior is well-defined. From the currently supported subsets, I think Subset::Mask and Subset::Range have this property, i.e. they can never produce a view where the same data is referenced via different indices of the resulting view.

dgkf commented 7 months ago

I agree with the sentiment, but from R's perspective, this is totally within "well-defined" behavior.

R works quite differently from this repo's implementation when it comes to subset assignment. It's horrendously inefficient with many intermediate vectors. For R, this is perfectly fine and I imagine intentional. It's elegant in how generalized its fn<- syntax is, but comes at a large cost for allocation. Let's break it down (slightly tweaking the example so the process is more evident):

trace(`[<-`, quote(print(sys.call(-4L))))
x[c(1, 1)][c(1, 2, 3)] <- c(99, 100)

The following is messy and hard to read, but effectively each [ produces an intermediate temporary values with the appropriate size, then assigns it to the next vector (which might be the next temporary vector).

  1. *tmp1* <- vector(<length of subset>, <mode of x>)
  2. [<-(*tmp1*, c(1, 2, 3), c(99, 100))
    1. *tmp1* <- c(99, 100)[<length of subset>] (= c(99, 100, 99))
  3. *tmp2* <- vector(<length of subset>, <mode of *tmp1*>)
  4. [<-(*tmp2*, c(1, 1), *tmp1*)
    1. *tmp2* <- *tmp1*[1:length(<subset>)] (= c(99, 100))
  5. x[[1]] <- *tmp2*[[1]]
  6. x[[1]] <- *tmp2*[[2]]

This is also exacerbated in list assignment in R. Assigning to list$a$b$c$d creates (I think) n^2 intermediate lists that copy their respective subset.


All of that is to say. I agree, we should be well defined, but "well defined" may include repeated indices at times - We can focus on that within the individual alt-rep assignment semantics.

sebffischer commented 5 months ago

Uff, that seems expensive, but I agree! Checking for such cases would probably also be quite the overhead! Should I close this issue?