Open sebffischer opened 6 months ago
Note: adding the length field to Rep
Vectors in R today store a bit of metadata as well. I don't remember exactly what they track, but a number of algorithms could be made faster if we always know whether a vector has missingness, is sorted, has a known length - and I'm sure many more.
Although I think it will happen eventually, I've been trying to avoid storing redundant metadata for now. Ithink we have to be really thoughtful about where it goes - whether it is specific to an alt-rep, should be universal to all vectors, or all objects. So far I've just opted for the minimum data and haven't worried too much about the performance costs of re-calculating things.
My current idea would be to add a Option
field to the Subset variant of Rep here.
I'd recommend trying to store minimal state in the object. If you can get the behavior you're looking for without more data in the struct, then that simplicity is preferred for now. If it's unavoidable, then it's a pretty minimal change, so go for it.
However, if the lazy vector representation is a core feature of the language, I expect that there will be more cases like this. With "like this" I mean that functions can in some cases be calculated on the lazy representation, whereas they otherwise might need to materialize the vector.
This will be inevitable. My thinking for now is that the laziness is just hidden in the background. As soon as something needs a materialized vector it materializes. In this case, if an iterator doesn't have a conclusive size hint, then the vector is materialized and the length is returned.
I am stumbling upon another problem here:
There are currently methods that require immutable references (such as the Display implementation for Rep
).
However, these methods call the .len()
method of Rep
.
If we want to implement .len()
on Rep
correctly, this might require materialization and hence a mutable reference.
Therefore, either all those methods have to be rewritten, and trait implementations such as of Display discarded, or the in-place materialization of vector representations uses unsafe rust. Maybe there are also other options which I am not aware of.
My gut-feeling tells me that this might be a situation for unsafe rust (because lazy representations and materialized representations should really behave identically when accessed through their public API), but as I have very little experience in the language I am not sure...
Yeah - I imagine things might need to change. This is good, though. This is exactly how the rest of the code has developed so far. We've identified a place where the current model of the problem doesn't work, and we've found a good, relatively minimal interface to test a redesign.
In this case, the fundamental behavior is "how do we materialize a vector?". This effectively means instantiating a new RefCell
in some cases and replacing the current reference with a new one.
I'm not proficient enough to know exactly what the solution will look like at the start, so I'll just speculate and maybe some ideas spur some new directions:
.len()
signature changes. Maybe it should return (&'a Rep, usize)
or something that allows us to accept a reference to a Rep, but return a different Rep (with a materialized view of the original). as_materialized
method so that specific operations require first running x.as_materialized().len()
- though ideally when to materialize is implicit to the methods used, not something we have to manage on a case-by-case basis. Therefore, either all those methods have to be rewritten, and trait implementations such as of Display discarded
Yeah... that's certainly possible! I wouldn't see it so much as a failure of the interface, just that it was originally written without the full feature set in mind. As our understanding of the problem grows, some implementations will have to adapt as well. In most cases when I initially feel overwhelmed with a rewrite, I end up pleasantly surprised with how little has to change. Most likely the complexity will move into .len()
and Display will be able to stay mostly untouched.
might be a situation for unsafe rust
I don't think this will require unsafe. It might require that we re-model a lot of the language, but fundamentally what we're trying to do is not so esoteric or magical.
Thanks for your input!
Maybe one solution would be to introduce a VarRep
(Variable Represenation) struct, which just wraps Rep<T>
in a RefCell
.
The purpose of this struct would be for Obj::Vector
s to allow them to materialize themselves even with mutable references. (I think you are right that unsafe is not needed for this, but I think interior mutability is).
pub struct VarRep<T>(RefCell<Rep<T>>);
This type can then be used as the data in the Vector
enum:
#[derive(Debug, Clone, PartialEq)]
pub enum Vector {
Double(VarRep<Double>),
Integer(VarRep<Integer>),
Logical(VarRep<Logical>),
Character(VarRep<Character>),
}
For example, the Display
implementation for Vector
then calls the Display
method for VarRep
and VarRep
can materialize itself if necessary and then redirect the len()
call to Rep<T>
.
To achieve this, we need to implement all the methods that are implemented for Rep
also for VarRep
.
Unfortunately I think that Deref
coercion does not play nicely with RefCell
, so there would be a lot of boilerplate code that redirects the calls on VarRep<T>
to Rep<T>
, see illustrated below.
The most important method below is the materialize_inplace
method, which operates on an immutable reference.
Because VarRep
contains a RefCell
, we can swich one Rep<T>
for another.
impl <T> VarRep<T>
// the underlying Rep<T> should not be exposed through the public API
fn borrow(&self) -> Ref<Rep<T>> {
self.0.borrow()
}
pub fn new() -> Self {
Rep::new().into()
}
fn materialize_inplace(&self) -> &Self {
self.0.replace(RefCell::new(self.borrow().materialize()))
&self
}
pub fn len(&self) -> usize {
if self.borrow().len_requires_materialization() {
self.materialize_inplace()
}
self.borrow().len()
}
pub fn get(&self, index: usize) -> Option<Self> {
let x = self.borrow().get(index);
match x {
Some(x) => Some(x.into()),
None => None,
}
}
// other methods
}
impl<T> From<Rep<T>> for VarRep<T>
where
T: AtomicMode + Clone + Default,
{
fn from(rep: Rep<T>) -> Self {
VarRep(RefCell::new(rep))
}
}
Testing this, I think just operating on lists is left before this is in really good shape.
The
len()
implementation of theRep<T>
enum is not correct in all cases I believe:Mask
subset, the length of the resulting vector is not the length oflast
subset but the sum of the recycled logical vectorIndices
subset, taking the minimum over the "actual vector" and the elements in the last subset does not work. Even if we assume that all indices are within bounds, it still fails in cases such as(1:2)[c(1, 1, 1)]
.Context: I am just trying to see whether I can implement the
length()
primitive. My current idea would be to add aOption<usize>
field to theSubset
variant ofRep<T>
here. Then, whenever a new subset is added toSubsets
by callingpush()
, we check whether the length of the new vector is known. If it is, we set the field toOption::Some(length)
, otherwise toNone
. Thelen()
method of theRep<T>
then simply reads the field.If we define
len()
like this forRep<T>
, we can then definelen()
onVector
(https://github.com/dgkf/R/blob/2ef9780a2a7a155a22bd42cbad73d59a3b084324/src/object/vector/core.rs#L199) by callinglen()
on theRep<T>
. IfSome(usize)
is returned, we know the length. Otherwise - ifNone
is returned - we could modify the vector in-place, i.e. call materialize on the subset and replace the existing subset-representation of the vector with the materialized one (so we avoid materializing the vector more than once).Alternatively we can also generalize this to use a size hint for lower and upper bounds as you suggested in this issue: https://github.com/dgkf/R/issues/98
What I don't like about this solution is that it is very tailored to the length. However, if the lazy vector representation is a core feature of the language, I expect that there will be more cases like this. With "like this" I mean that functions can in some cases be calculated on the lazy representation, whereas they otherwise might need to materialize the vector.