Bug in `len()` implementation / Implementing the `length()` primitive

sebffischer commented 6 months ago

The len() implementation of the Rep<T> enum is not correct in all cases I believe:

If there is no subset, it is correct
If there is a single Mask subset, the length of the resulting vector is not the length of last subset but the sum of the recycled logical vector
For single Indices subset, taking the minimum over the "actual vector" and the elements in the last subset does not work. Even if we assume that all indices are within bounds, it still fails in cases such as (1:2)[c(1, 1, 1)].
The names mechanism I don't yet understand, so I am not sure
For a single range it should be correct
If there are multiple subsets, things might go wrong as well

Context: I am just trying to see whether I can implement the length() primitive. My current idea would be to add a Option<usize> field to the Subset variant of Rep<T> here. Then, whenever a new subset is added to Subsets by calling push(), we check whether the length of the new vector is known. If it is, we set the field to Option::Some(length), otherwise to None. The len() method of the Rep<T> then simply reads the field.

If we define len() like this for Rep<T>, we can then define len() on Vector (https://github.com/dgkf/R/blob/2ef9780a2a7a155a22bd42cbad73d59a3b084324/src/object/vector/core.rs#L199) by calling len() on the Rep<T>. If Some(usize) is returned, we know the length. Otherwise - if None is returned - we could modify the vector in-place, i.e. call materialize on the subset and replace the existing subset-representation of the vector with the materialized one (so we avoid materializing the vector more than once).

Alternatively we can also generalize this to use a size hint for lower and upper bounds as you suggested in this issue: https://github.com/dgkf/R/issues/98

What I don't like about this solution is that it is very tailored to the length. However, if the lazy vector representation is a core feature of the language, I expect that there will be more cases like this. With "like this" I mean that functions can in some cases be calculated on the lazy representation, whereas they otherwise might need to materialize the vector.

sebffischer commented 6 months ago

Note: adding the length field to Rep is not necessary and should just be calculated on the fly.

dgkf commented 6 months ago

Vectors in R today store a bit of metadata as well. I don't remember exactly what they track, but a number of algorithms could be made faster if we always know whether a vector has missingness, is sorted, has a known length - and I'm sure many more.

Although I think it will happen eventually, I've been trying to avoid storing redundant metadata for now. Ithink we have to be really thoughtful about where it goes - whether it is specific to an alt-rep, should be universal to all vectors, or all objects. So far I've just opted for the minimum data and haven't worried too much about the performance costs of re-calculating things.

My current idea would be to add a Option field to the Subset variant of Rep here.

I'd recommend trying to store minimal state in the object. If you can get the behavior you're looking for without more data in the struct, then that simplicity is preferred for now. If it's unavoidable, then it's a pretty minimal change, so go for it.

However, if the lazy vector representation is a core feature of the language, I expect that there will be more cases like this. With "like this" I mean that functions can in some cases be calculated on the lazy representation, whereas they otherwise might need to materialize the vector.

This will be inevitable. My thinking for now is that the laziness is just hidden in the background. As soon as something needs a materialized vector it materializes. In this case, if an iterator doesn't have a conclusive size hint, then the vector is materialized and the length is returned.

sebffischer commented 6 months ago

I am stumbling upon another problem here: There are currently methods that require immutable references (such as the Display implementation for Rep). However, these methods call the .len() method of Rep. If we want to implement .len() on Rep correctly, this might require materialization and hence a mutable reference.

Therefore, either all those methods have to be rewritten, and trait implementations such as of Display discarded, or the in-place materialization of vector representations uses unsafe rust. Maybe there are also other options which I am not aware of.

My gut-feeling tells me that this might be a situation for unsafe rust (because lazy representations and materialized representations should really behave identically when accessed through their public API), but as I have very little experience in the language I am not sure...

dgkf commented 6 months ago

Yeah - I imagine things might need to change. This is good, though. This is exactly how the rest of the code has developed so far. We've identified a place where the current model of the problem doesn't work, and we've found a good, relatively minimal interface to test a redesign.

In this case, the fundamental behavior is "how do we materialize a vector?". This effectively means instantiating a new RefCell in some cases and replacing the current reference with a new one.

I'm not proficient enough to know exactly what the solution will look like at the start, so I'll just speculate and maybe some ideas spur some new directions:

It's totally fine if the .len() signature changes. Maybe it should return (&'a Rep, usize) or something that allows us to accept a reference to a Rep, but return a different Rep (with a materialized view of the original).
It might be easier to break this out into a separate as_materialized method so that specific operations require first running x.as_materialized().len() - though ideally when to materialize is implicit to the methods used, not something we have to manage on a case-by-case basis.

Therefore, either all those methods have to be rewritten, and trait implementations such as of Display discarded

Yeah... that's certainly possible! I wouldn't see it so much as a failure of the interface, just that it was originally written without the full feature set in mind. As our understanding of the problem grows, some implementations will have to adapt as well. In most cases when I initially feel overwhelmed with a rewrite, I end up pleasantly surprised with how little has to change. Most likely the complexity will move into .len() and Display will be able to stay mostly untouched.

might be a situation for unsafe rust

I don't think this will require unsafe. It might require that we re-model a lot of the language, but fundamentally what we're trying to do is not so esoteric or magical.

sebffischer commented 6 months ago

Thanks for your input!

Maybe one solution would be to introduce a VarRep (Variable Represenation) struct, which just wraps Rep<T> in a RefCell. The purpose of this struct would be for Obj::Vectors to allow them to materialize themselves even with mutable references. (I think you are right that unsafe is not needed for this, but I think interior mutability is).

pub struct VarRep<T>(RefCell<Rep<T>>);

This type can then be used as the data in the Vector enum:

#[derive(Debug, Clone, PartialEq)]
pub enum Vector {
    Double(VarRep<Double>),
    Integer(VarRep<Integer>),
    Logical(VarRep<Logical>),
    Character(VarRep<Character>),
}

For example, the Display implementation for Vector then calls the Display method for VarRep and VarRep can materialize itself if necessary and then redirect the len() call to Rep<T>.

To achieve this, we need to implement all the methods that are implemented for Rep also for VarRep. Unfortunately I think that Deref coercion does not play nicely with RefCell, so there would be a lot of boilerplate code that redirects the calls on VarRep<T> to Rep<T>, see illustrated below.

The most important method below is the materialize_inplace method, which operates on an immutable reference. Because VarRep contains a RefCell, we can swich one Rep<T> for another.

impl <T> VarRep<T>     
  // the underlying Rep<T> should not be exposed through the public API
  fn borrow(&self) -> Ref<Rep<T>> {
      self.0.borrow()
  }
  pub fn new() -> Self {
      Rep::new().into()
  }

  fn materialize_inplace(&self) -> &Self {
    self.0.replace(RefCell::new(self.borrow().materialize()))
    &self
  }
  pub fn len(&self) -> usize {
    if self.borrow().len_requires_materialization() {
      self.materialize_inplace()
    }

    self.borrow().len()
  }

  pub fn get(&self, index: usize) -> Option<Self> {
    let x = self.borrow().get(index);
    match x {
        Some(x) => Some(x.into()),
        None => None,
     }
  }

  // other methods 
}

impl<T> From<Rep<T>> for VarRep<T>
where
    T: AtomicMode + Clone + Default,
{
    fn from(rep: Rep<T>) -> Self {
        VarRep(RefCell::new(rep))
    }
}

dgkf commented 2 months ago

Testing this, I think just operating on lists is left before this is in really good shape.

dgkf / R

Bug in `len()` implementation / Implementing the `length()` primitive #105