Closed sebffischer closed 1 month ago
@dgkf To allow for this refactor I think the Rep<T>
should then also include the names.
We would thereby also get named vectors.
I am wondering though whether there was a specific reason why the Names
subset was implemented via NamedSubsets
(https://github.com/dgkf/R/blob/ee9b4a43a606b492d5949f96e0406f85ac91decd/src/object/vector/subsets.rs#L9) and not using the Subset::Names
(https://github.com/dgkf/R/blob/ee9b4a43a606b492d5949f96e0406f85ac91decd/src/object/vector/subset.rs#L19)
I am wondering though whether there was a specific reason why the
Names
subset was implemented viaNamedSubsets
Yeah, this part isn't as clean as I'd like. It could definitely benefit from a rework.
If a subset only operates on indices, then we can collapse multiple subsets into a set of indices pretty easily - x[1:10][c(2, 4, 6)][c(1, 3)]
can be determined to be equivalent to x[c(2, 6)]
. We don't need any other information about x. But when a vector is named we need to know its names if we want to subset using them.
These two structures serve different purposes.
Names
as a subset just lists which names have been selected. When calling x[c("a", "b")]
, we'd have a Names
subset of vec!["a", "b"]
.NamedSubsets
is entirely different - it provides a mechanism for handing off a vector's names to the Subset object so that the names of each element can be used to apply a Names
subset. Without it, we'd have no idea which position corresponds to each name being subset. These are poorly named and conceptually I'm not crazy about how I modeled the problem. It exists this way because I built the subset altrep while completely neglecting named subsets and only realized after the rest of the proof-of-concept was put together :grimacing:. If the feature looks tacked on, that's because it totally was.
Just fyi: My vacation is starting on Friday and I am planning to resume working on this!
Uff, so I think I started something here before properly understanding it.
The changes here basically allow Rep<Obj>
to represent an unnamed list.
Because I think we eventually want to distinguish between a list and an unnamed list for efficiency, I think it is worth to keep those minor changes.
One idea I had is whether we maybe want to add a NamedSubsets
variant to the RepType
enum.
This NamedSubsets
type would basically be what is now the List
, with the different that it is generic in the type that it holds.
This would:
I am not sure whether this is fully necessary now and maybe there is a better way to approach this. What do you think?
One idea I had is whether we maybe want to add a
NamedSubsets
variant to theRepType
enum.
Let me make sure I'm understanding you correctly here. The proposal is to store names in representation instead of in the Obj
type?
Just from the perspective of how this abstracts the problem, I feel like this sort of conflates a couple ideas. If we still want a named object to be able to behave like an unnamed object, then having two separate variants here that have largely overlapping behaviors feels like it might not be modelling the problem as effectively as it could be.
What about something like:
RepType<T> {
Subset(CowObj<Vec<T>>, Option<..>, Subsets)
}
Where Option<..>
is optional name information. This does introduce some overhead for unnamed vectors, but I feel like it models what we want a bit more directly.
Yeah, I think this is better, I will try it out.
Another positive side effect of refactoring this is that it forces us to implement the behavior of lists and vectors consistently.
Assigning NULL
to list-subsets behaves rather weirdly in R I think:
l = list(1, 2, 3, 4)
l[1:2] = NULL
l
#> [[1]]
#> [1] 3
#>
#> [[2]]
#> [1] 4
l[[1]] = NULL
l
#> [[1]]
#> [1] 4
While I find it somewhat reasonable that assigning NULL
to a slice removes the elements (l[1:2] = NULL
), I find the behavior in the second assignment (l[[1]] = NULL
), where we set an individual element to NULL
quite weird. Instead of removing the element by assigning it to NULL
, I think it should replace the value of 3
with NULL
.
Furthermore, I would even argue against the semantics of the slice-assignment to NULL (l[1:2] = NULL
) , as it is inconsistent with how (atomic) vectors behave:
v = 1:3
v[1:2] = NULL
#> Error in v[1:2] = NULL: replacement has length zero
For the scenario where one wants to remove specific elements from a list/vector, we can (in the future, see https://github.com/dgkf/R/issues/169) achieve this with negative indices.
However, for named lists / vectors there is an open question on how we want to handle removing a named element:
ln = list(a = 1, b = 2)
ln$a = NULL
ln
#> $b
#> [1] 2
One way to support this operation would be to somehow allow to replace line 2 from above with ln = ln[-"a"]
.
It is not totally clear to me how to achieve this in a streamlined way.
The second (and my preferred) solution would be to implement a remove
function so that line 2 could be replaced with remove(ln, "a")
.
Assigning NULL
as a way of removing elements is definitely odd.
One way to address this would be to implement a version of [.list
that operates on functions. Then you have a lot of freedom to implement selectors such as:
l <- list(a = 1, b = 2, c = 3)
l[not("b")]
where
not <- function(not_names) {
function(x) x[!names(x) %in% names(not_names)]
}
Then
`[.list` <- function(x, subset) {
UseMethod("[.list", subset)
}
`[.list.function` <- function(x, subset) {
subset(x)
}
I generally like this style, where things that are generic over higher-order functions are used to provide functionality as opposed to bespoke syntax. They open up a lot of opportunities for more composable and expressive selectors.
So maybe just to tie it back to this work, I think it's important to first and foremost implement what is most reasonable to the underlying data representation in the internals. It's the role of the standard library or other packages to add convenience syntax.
superseded by: https://github.com/dgkf/R/pull/180
TODOs:
.len()