WIP: Refactor `List` to use `Rep<Obj>` internally

sebffischer commented 3 months ago

TODOs:

(maybe) properly implement .len()

sebffischer commented 3 months ago

@dgkf To allow for this refactor I think the Rep<T> should then also include the names. We would thereby also get named vectors.

I am wondering though whether there was a specific reason why the Names subset was implemented via NamedSubsets (https://github.com/dgkf/R/blob/ee9b4a43a606b492d5949f96e0406f85ac91decd/src/object/vector/subsets.rs#L9) and not using the Subset::Names (https://github.com/dgkf/R/blob/ee9b4a43a606b492d5949f96e0406f85ac91decd/src/object/vector/subset.rs#L19)

dgkf commented 3 months ago

I am wondering though whether there was a specific reason why the Names subset was implemented via NamedSubsets

Yeah, this part isn't as clean as I'd like. It could definitely benefit from a rework.

If a subset only operates on indices, then we can collapse multiple subsets into a set of indices pretty easily - x[1:10][c(2, 4, 6)][c(1, 3)] can be determined to be equivalent to x[c(2, 6)]. We don't need any other information about x. But when a vector is named we need to know its names if we want to subset using them.

These two structures serve different purposes.

Names as a subset just lists which names have been selected. When calling x[c("a", "b")], we'd have a Names subset of vec!["a", "b"].
NamedSubsets is entirely different - it provides a mechanism for handing off a vector's names to the Subset object so that the names of each element can be used to apply a Names subset. Without it, we'd have no idea which position corresponds to each name being subset.

These are poorly named and conceptually I'm not crazy about how I modeled the problem. It exists this way because I built the subset altrep while completely neglecting named subsets and only realized after the rest of the proof-of-concept was put together :grimacing:. If the feature looks tacked on, that's because it totally was.

sebffischer commented 2 months ago

Just fyi: My vacation is starting on Friday and I am planning to resume working on this!

sebffischer commented 1 month ago

Uff, so I think I started something here before properly understanding it.

The changes here basically allow Rep<Obj> to represent an unnamed list. Because I think we eventually want to distinguish between a list and an unnamed list for efficiency, I think it is worth to keep those minor changes.

One idea I had is whether we maybe want to add a NamedSubsets variant to the RepType enum. This NamedSubsets type would basically be what is now the List, with the different that it is generic in the type that it holds.

This would:

allow to unify the API of the list representation and the vector representation
give us named vectors (which I assume we want ?)
allow a more efficient representation for lists without names

I am not sure whether this is fully necessary now and maybe there is a better way to approach this. What do you think?

dgkf commented 1 month ago

One idea I had is whether we maybe want to add a NamedSubsets variant to the RepType enum.

Let me make sure I'm understanding you correctly here. The proposal is to store names in representation instead of in the Obj type?

Just from the perspective of how this abstracts the problem, I feel like this sort of conflates a couple ideas. If we still want a named object to be able to behave like an unnamed object, then having two separate variants here that have largely overlapping behaviors feels like it might not be modelling the problem as effectively as it could be.

What about something like:

RepType<T> {
  Subset(CowObj<Vec<T>>, Option<..>, Subsets)
}

Where Option<..> is optional name information. This does introduce some overhead for unnamed vectors, but I feel like it models what we want a bit more directly.

sebffischer commented 1 month ago

Yeah, I think this is better, I will try it out.

sebffischer commented 1 month ago

Another positive side effect of refactoring this is that it forces us to implement the behavior of lists and vectors consistently.

Assigning NULL to list-subsets behaves rather weirdly in R I think:

l = list(1, 2, 3, 4)
l[1:2] = NULL
l
#> [[1]]
#> [1] 3
#> 
#> [[2]]
#> [1] 4

l[[1]] = NULL
l
#> [[1]]
#> [1] 4

While I find it somewhat reasonable that assigning NULL to a slice removes the elements (l[1:2] = NULL), I find the behavior in the second assignment (l[[1]] = NULL), where we set an individual element to NULL quite weird. Instead of removing the element by assigning it to NULL, I think it should replace the value of 3 with NULL.

Furthermore, I would even argue against the semantics of the slice-assignment to NULL (l[1:2] = NULL) , as it is inconsistent with how (atomic) vectors behave:

v = 1:3
v[1:2] = NULL
#> Error in v[1:2] = NULL: replacement has length zero

For the scenario where one wants to remove specific elements from a list/vector, we can (in the future, see https://github.com/dgkf/R/issues/169) achieve this with negative indices.

However, for named lists / vectors there is an open question on how we want to handle removing a named element:

ln = list(a = 1, b = 2)
ln$a = NULL
ln
#> $b
#> [1] 2

One way to support this operation would be to somehow allow to replace line 2 from above with ln = ln[-"a"]. It is not totally clear to me how to achieve this in a streamlined way. The second (and my preferred) solution would be to implement a remove function so that line 2 could be replaced with remove(ln, "a").

dgkf commented 1 month ago

Assigning NULL as a way of removing elements is definitely odd.

One way to address this would be to implement a version of [.list that operates on functions. Then you have a lot of freedom to implement selectors such as:

l <- list(a = 1, b = 2, c = 3)
l[not("b")]

where

not <- function(not_names) {
  function(x) x[!names(x) %in% names(not_names)]
}

Then

`[.list` <- function(x, subset) {
  UseMethod("[.list", subset)
}

`[.list.function` <- function(x, subset) {
  subset(x)
}

I generally like this style, where things that are generic over higher-order functions are used to provide functionality as opposed to bespoke syntax. They open up a lot of opportunities for more composable and expressive selectors.

So maybe just to tie it back to this work, I think it's important to first and foremost implement what is most reasonable to the underlying data representation in the internals. It's the role of the standard library or other packages to add convenience syntax.

sebffischer commented 1 month ago

superseded by: https://github.com/dgkf/R/pull/180

dgkf / R

WIP: Refactor `List` to use `Rep<Obj>` internally #172