dgkf / R

An experimental reimagining of R
https://dgkf.github.io/R
GNU General Public License v3.0
136 stars 5 forks source link

Proposal: new vector representation `RepType::Closure` #165

Open sebffischer opened 3 months ago

sebffischer commented 3 months ago

NOTE: This proposal is superseded by the first comment in this Issue.

Sometimes, there are cases where we want to apply a function lazily to a vector. One example is in the internal code-base, where we would like to cast an (Obj, Obj) tuple to their minimally numeric type. Currently, this requires us to allocate a new vector.

My suggestion here is to allow for lazy computation on vectors by adding a new variant to RepType, the Map variant:

enum RepType<T> {
  Subset(CowObj<Vec<T>>, Subsets),
  Map(RepType, Box<dyn Fn(T) -> T>)
}

If we wanted to cast a tupe (Obj::Vector(Vector::Integer(_)), Obj::Vector(Vector::Double(_)) to their minimally numeric type, we could lazily transform the first element of the tuple to a Obj::Vector(Vector::Double(_)), by converting the underlying vector v (a RepType::Subset) to a RepType::Map(v, |i| i.map(i as f64)).

Of course, this RepType::Map would also be generally useful to avoid any intermediate vector allocations, such as in the example below, where we calculate the variance of a random vector. In R, this would first allocate the vector tmp1 = x - mean(x), then allocate the vector tmp2 = tmp1^2, only to then calculate mean(tmp2). If we were to add the RepType::Map, we could avoid both allocations.

x = rnorm(100)
mean((x - mean(x))^2) / length(x)

Further, this would also make the need for "fused" functions such as anyNA unnecessary. In R, calling any(is.na(x)) is slow, because is.na(x) allocates a new logical vector. But if is.na(x) produces a RepType::Map, this should not be such a big problem.


Where this Suggestion falls short

One disadvantage of the RepType::Map is that it does not allow to combine different vectors.

Let's say, we want to calculate the covariance between two random vectors:

x = rnorm(100)
y = x * 0.2 + rnorm(100) *0.3
mean((x - mean(x))(y - mean(x))) / 100

Here, we first calculate tmp1 = x - mean(x) and tmp2 = y - mean(x) which can be represented both as RepType::Maps. But if we then multiply them, we need to allocate a vector tmp3, just so we then calculate sum(tmp3) / 100.

This means that we might want to add another variant (e.g. calling it RepType::Merge) to represent such operations:

enum RepType<T> {
  ..,
  Merge(RepType<T>, RepType<T>, Box<dyn Fn(T, T) -> T>)
}
sebffischer commented 3 months ago

I think I came up with a better idea that does not require to introduction of two new RepTypes. If we just allow for a RepType::Closure(Fn(usize) -> T, Some(usize)), we have a much more flexible representation that now only covers RepType::Map and RepType::Merge as suggested initially, but also allows for efficient representation of sequences, such as 1:n or even 1:Inf.

In the suggested RepType::Closure, the Fn(usize) -> T closure represents the mapping from index to element, and the Some(usize) the length of the vector, which can be None to allow for an Infinite number of elements.

For example, the sequence 1:Inf can simple be represented as RepType::Closure(|i: usize| i as i32, None). If we want to add two vectors, we can also easily do this, which is illustrated in the example below in a simplified situation:

  let x = vec![1, 2, 3];
  let y = vec![1, 2, 3];

  let x_plus_y = {
      let x = x.clone();
      let y = y.clone();
      move |i: usize| -> i32 { x[i] + y[i] }
  };

Because Closures can capture their environment, we don't need to store any data explicitly, because the Closure can already handle this for us.

dgkf commented 3 months ago

The closure style is definitely more in the spirit of what I was imagining.

I had been thinking that this might be best solved through a RepType::Iterator, which covers both the cases of things like infinite sequences, but also handles situations when an iterator encapsulates an element-wise closure. In rust pseudo-code:

sum_iter = x.into_iter().cycle().zip(y.into_iter().cycle())
  .take(<longest length hint>)
  .map(|(xi, yi)| xi + yi)

RepType::Iterator::from(sum_iter)

This is partially why the current vectorized operations focus on their iterator implementations - so that some day I could return the un-collected iterator as a lazy operation.

If we do the iterator style, it would also mean that we could lazily compute things like:

x = [1, -1]
y = 1:Inf
head(x + y, 10)
# [1] 2 1 4 3 6 5 8 7 10 9
dgkf commented 3 months ago

Digging up an old issue, but there's a mention of this in issue for the vector re-write that added the first alt-reps.

sebffischer commented 3 months ago

Ah, i remember this issue. I initially wanted to suggest an Iterator, but I think it is a problem that one cannot access random elements, e.g. 1:Inf[1000000] would become an unreasonably expensive operation I guess.

The closure instead allows for arbitrary and also repeated access of its elements as it does not advance any state.

dgkf commented 3 months ago

Yeah - that's also a very good point. This might be a good case for doing some benchmarking. I think both styles have merits.

Just to put your explicit example in context, this particular operation would be a simple (1:Inf).into_iter().nth(1e8). The internal rust range iterator implementation should know that skipping n elements in a range can be done without iteration.

Whether these optimizations hold for enough cases that an iterator approach is justified is definitely still debatable.

My intuition is that having a closure that is called for each index would be far slower than the iterator approach, but I really have no basis for that intuition. Let's start by either finding some resources that might help evaluate the performance impact or by doing some benchmarking using some simple pure-rust examples.

Ultimately both solutions boil down to storing closures somewhere - whether it tags along with the iterator or whether it lives as a lazy representation of the AST.

sebffischer commented 3 months ago

So I conducted a simple benchmark for the range, and the results are basically identical. Maybe the compiler optimizes away the call into the closure? If this is the case, the question is whether these optimizations will also apply in the R implementation when things get more complicated. Maybe this benchmark is just too simplified. In any case, I will try to extend this later

I also think it makes sense to include these benchmarks in the main repository, let me know whether you agree.

use criterion::{criterion_group, criterion_main, Criterion};
use std::time::Duration;

fn range_closure(f: Box<dyn Fn(usize) -> i32>, n: usize) -> i32 {
    let mut sum = 0;
    for i in 1..n {
        sum += f(i);
    }
    sum
}

fn range_iter(n: usize) -> i32 {
    let mut sum = 0;

    for i in 1..n {
        sum += i as i32;
    }
    sum
}

fn benchmark(c: &mut Criterion) {
    let mut group = c.benchmark_group("Range Iteration");

    let f = |i: usize| -> i32 { i as i32 };
    // let test_values = vec![10, 100, 1000, 10_000, 100_000, 1_000_000];
    let test_values = vec![10_000];

    for &n in &test_values {
        group.bench_with_input(format!("Closure Range {}", n), &n, |b, &n| {
            b.iter(|| range_closure(Box::new(f), n));
        });

        group.bench_with_input(format!("Simple Range {}", n), &n, |b, &n| {
            b.iter(|| range_iter(n));
        });
    }

    group.finish();
}

criterion_group! {
    name = benches;
    config = Criterion::default().measurement_time(Duration::new(10, 0));
    targets = benchmark
}
criterion_main!(benches);
Screenshot 2024-08-03 at 10 04 49
sebffischer commented 3 months ago

There is also this discussion.

Yeah, I don't think that that's supported today, though I'm not sure. The compiler will already pretty aggressively inline, the attributes just let you nudge it further. Closures, especially ones that don't capture any environment or do so by ownership, are great candidates for inlining

The good news here is that we will have closures that either don't capture anything (such as sequences) or that capture by ownership (when they represent some computational graph).

sebffischer commented 3 months ago

I now ran a slightly more realistic benchmark, where an Rc<Vec<i32>> is used. I also added the time that it takes for the vector creation to the benchmark for comparison. Again, the approaches are basically identical.

fn range_closure(n: usize) -> i32 {
    let v: Rc<Vec<i32>> = Rc::new((0..n).map(|i| i as i32).collect());
    let f = |i| v[i];
    let mut sum = 0;
    for i in 0..n {
        sum += f(i);
    }
    sum
}

fn range_iter(n: usize) -> i32 {
    let v: Rc<Vec<i32>> = Rc::new((0..n).map(|i| i as i32).collect());
    let mut sum = 0;
    for x in v.iter() {
        sum += x;
    }
    sum
}

fn vec_creation(n: usize) -> i32 {
    let v: Vec<i32> = (0..n).map(|i| i as i32).collect();
    v[0]
}
fn benchmark(c: &mut Criterion) {
    let mut group = c.benchmark_group("Vector Sum");

    let test_values = vec![100_000];

    for &n in &test_values {
        group.bench_function(format!("Closure{}", n), |b| {
            b.iter(|| range_closure(n));
        });

        group.bench_function(format!("Iterator{}", n), |b| {
            b.iter(|| range_iter(n));
        });
        group.bench_function(format!("Vector Creation {}", n), |b| {
            b.iter(|| vec_creation(n));
        });
    }

    group.finish();
}
Screenshot 2024-08-03 at 16 53 02
sebffischer commented 3 months ago

I also checked whether this holds when we use the dedicated .sum() method on the Iterator and the result is also the same:

fn range_closure(n: usize) -> i32 {
    let v: Rc<Vec<i32>> = Rc::new((0..n).map(|i| i as i32).collect());
    (0..n).map(|i| v[i]).sum()
}

fn range_iter(n: usize) -> i32 {
    let v: Rc<Vec<i32>> = Rc::new((0..n).map(|i| i as i32).collect());
    v.iter().sum()
}
fn benchmark(c: &mut Criterion) {
    let mut group = c.benchmark_group("Vector Sum");

    let test_values = vec![100_000_00];

    for &n in &test_values {
        group.bench_function(format!("Closure{}", n), |b| {
            b.iter(|| range_closure(n));
        });

        group.bench_function(format!("Iterator{}", n), |b| {
            b.iter(|| range_iter(n));
        });
    }

    group.finish();
}
Screenshot 2024-08-03 at 18 38 43
sebffischer commented 1 month ago

one disadvantage of using trait objects here is the lack of inlining possibility. We should do some benchmarks on how much this diminishes the gains yielded by the closure / iterator altrep, e.g. by benchmarking any(is.na(x)) (with the standard R approaches, as well as the closure approach) and anyNA(x).

sebffischer commented 1 month ago

we could also hardcode some important closures in an enum for things like numeric operations and is.na so they can be inlined.