Heuristics for materializing vectors

dgkf / R

An experimental reimagining of R

GNU General Public License v3.0

136 stars 6 forks source link

The feature of this language to have altreps deeply embedded into the language will allow all sorts of optimizations and avoidance of memory allocation, which is great. However, I think we also need to be aware that it can sometimes be hurtful.

Let's say we have the (admittedly nonsensical) code below, where the size of the underlying vector on which we have a view far exceeds the size of the view.

f <- function() {
  x <- rnorm(1000000)
  x[1:10]
}

a <- f()

If we would represent a with an altrep view onto the vector generated by rnorm(1_000_000), this would keep a very large vector in memory even though we only need the first 10 elements.

Besides memory, altreps can also make access to vector elements more expensive. Consider the (also nonsensical) example below:

b <- rnorm(1000000)
b <- b[|b| > 1][1:100] # Creates a RepType::Subset(vec![...], Subsets(vec![Subset::Mask, Subset::Range]))

Accessing elements of the (altrep) b is more expensive than accessing an materialized version of b. To really know when we should materialize altreps, we would need some sort of compilation, where we can look ahead and see how (often) the altrep is used to determine whether we should materialize it.

So for now, I think we need some reasonably heuristic to determine when we materialize an altrep. My suggestion would be that we always materialize an altrep when it is bound to a variable. The idea behind this heuristic is that:

The materialized view is (in most cases) the most efficient view (in terms of memory and access-cost) on a vector.
Binding an altrep to a variable is an indicator that it is data that will be accessed multiple times.

Probably, we also want a way for the user to opt-out of materialization on assignment, e.g. a primitive view()

x <- rnorm(1000)
x <- view(x[1:100]) # x is an altrep
x <- x[1:100]# x is materialized

Yeah, this is my thinking also. For most user-facing variable interaction, users will be interacting with materialized vectors in almost all cases by default.

This would still let us re-use alt-reps for things like:

x <- 1:10
y <- x[1:3] + x[2:4] + x[3:5] + x[4:6]

Where each x[..] is never materialized, continuing to reference the data in x directly without copying data. In an ideal world, each + would then produce an iterator over the elements of these pairwise subsets so that even the + doesn't force the materialization. It would only be on assignment to y that the iterators, and therefore the subsets, are resolved into a new vector.

We're still a long ways off from that goal, but I think that's what we should be aiming for. If we can achieve that, then we can start making judgement calls about when it's preferred to resolve an intermediate value.

I like the view() as an opt-out - especially during development. I'm undecided about whether the language should provide this (similar to julia's @view macro). Unlike julia, my goal is not necessarily to solve the "two language problem" and I think when your abstraction starts leaking and requiring knowledge about the internals, then I'm okay with deferring to the second language (rust, in this case) for people that need that level of control.

To start, I think we should introduce at least one alt-rep that is not materialized on assignment to start playing with these semantics. Similar to R, I think the first candidate is a range alt-rep. From there, we can then think about what else should maintain its alt-rep status through assignment. I think iterators could be one interesting path to explore for this also.

dgkf / R

Heuristics for materializing vectors #130