Suggestion: Abandon named Lists in favor our Hashmaps with value semantics

sebffischer commented 5 months ago

I have always felt, that the fact that names are not necessarily unique, while still allowing to subset named objects by there name to be somewhat odd. I would suggested that it should not be possible anymore to assign names to vectors / lists. Instead I would suggest to add something like a Dict data-type that is a hashmap with value semantics.

sebffischer commented 5 months ago

I realized that already a HashMap is used for the names, so uniqueness is not a problem

sebffischer commented 5 months ago

At least one situation where it is absolutely necessary to be able to index both with names and indexes is a data.frame, but I would argue that this should then maybe be a specific feature of the data.frame type of supporting named lists as a core data type.

Some arguments against named lists are:

slicing a named list will become much more expensive than slicing a list because rearranging the hashmap that maps the names to indices
NULLing elements in a named list is very expensive, because we have to memcopy all the vector elements, whereas removing an element from a hashmap is very cheap.

dgkf commented 5 months ago

I do think that lists, like data frames, should preserve order. As lists are often used for passing arguments via named lists or … arguments, the order here is very important for supporting metaprogramming.

We could consider providing a dict type while providing a list as a stdlib type eventually, but I’d still say for now we should just keep the list.

sebffischer commented 4 months ago

I found the indexmap data structure that might be handy here:

https://docs.rs/indexmap/latest/indexmap/map/struct.IndexMap.html

I did not read about it in detail but it tracks the order in which elements were inserted into the map

sebffischer commented 4 months ago

Ok, following up on that: What do you think about making the names in a List unique? I feel that being able to do things like c(list(x = 1), list(x = 2))$x is kind of a footgun.

sebffischer commented 4 months ago

I think I would go for c(list(x = 1), list(x = 2)) to result in list(x = 2), i.e. later names overwrite earlier ones.

sebffischer commented 4 months ago

Maybe we also want a specialization for unnamed lists, i.e. something like:


enum List {
  Named(...)
  Unnamed(...)
}

dgkf commented 4 months ago

One nice extension of lists with duplicate names is for metaprogramming of function calls.

Consider the function:

f <- fn(..rest) {
  rest
}

If called by unpacking two lists into its arguments, I like the semantics where the last named argument "wins":

x <- list(a = 1, b = 2, c = 3)
y <- list(b = 4, d = 5)
f(..x, ..y)
# (a = 1, c = 3, b = 4, d = 5)

But I think that the duplicate parameters should still be available for non-standard evaluation:

f <- fn(..rest) substitute(..rest())
f(..x, ..y)
# (a = 1, b = 2, c = 3, b = 4, d = 5)

The ability to define multiple names as arguments can be a powerful language feature. For example, dplyr::mutate can take multiple arguments of the same name, iteratively deriving a new column.

Just some general ideas

I'd like for call arguments to be available via a standard data structure
An alternative to lists which allow duplicate names would be something like julia's pair, so that instead of a named list we end up with a vector of pairs. The first element of each pair would be a symbol and could be duplicated.
Regarding Named and Unnamed lists, I'd prefer to focus just on named lists until we get the semantics figured out. Then we can work on optimizations.

sebffischer commented 3 months ago

If called by unpacking two lists into its arguments, I like the semantics where the last named argument "wins":

I can see the appeal of this, but I would prefer if this was opt-in behavior rather than the default behaviour. John Gjengset in his book "Rust for rustaceans" talks about API design and mentions that APIs should not be surprising. However, I want to argue that this API can be surprising, see the example below:

x = list(a = 1, b = 1, c = 1, d = 1)
f = fn(a, ..rest) {
  a * sum(unlist(rest))
}
f(a = 10, ..x) # latter arguments have priority: f(a = 1, b = 1, c = 1, d = 1)
#> 3

Here, I would argue that one clearly intended to set a = 10, but because of the "last name wins" rule, the a is silently ignored. I think I would prefer the above function call to return an Error.

or another example would be:

x1 = list(a = 100)
g = fn(..rest) {
  reduce(`+`, 0, rest)
}
g(..x1, ..x) # I would expect: 104
#> 4

The ability to define multiple names as arguments can be a powerful language feature. For example, dplyr::mutate can take multiple arguments of the same name, iteratively deriving a new column.

But I think that the code below can be rewritten without the language feature,

dat |> 
  mutate(x = x + 1, x = x * 10)

e.g. into

dat |> 
  mutate(x = x + 1) |>
  mutate(x = x * 10)

and I think if this is the cost to prevent the possible headaches of always having to deal with the possibility of duplicated names in non-standard evaluation, it is worth the slightly more verbose code.

An alternative to lists which allow duplicate names would be something like julia's pair, so that instead of a named list we end up with a vector of pairs. The first element of each pair would be a symbol and could be duplicated.

Using such a data structure to represent objects with duplicate names would feel more natural instinctively.

Regarding Named and Unnamed lists, I'd prefer to focus just on named lists until we get the semantics figured out. Then we can work on optimizations.

agreed!

sebffischer commented 3 months ago

The ability to define multiple names as arguments can be a powerful language feature. For example, dplyr::mutate can take multiple arguments of the same name, iteratively deriving a new column.

What basically happens with the non-standard evaluation in dplyr is that the arguments that are passed to the function are just really interpreted as code, i.e. when we call mutate(x = x * 2), it maybe should really not be interpreted as binding the expression x + 1 as a promise to the parameter x (lhs), but as passing the expression {x = x + 1} as an unnamed argument. We could then do mutate({x = x + 1; x = x * 10}), which does not look so nice. Maybe we could introduce special syntax to make this cleaner.

I am not confident in the exact details, but I am imagining something like:

f = fn( (..rest) ) {
  substitute(rest)
}
f(x = x + 1, x = x * 10)
#> (quote(x = x + 1), quote(x = x * 10))

g = fn( (..rest) ) rest
g(x = 1, x =2)
#> (1, 2)

This would:

make it opt-in, rather than the default behavior, meaning that for those functions where this is not required, people don't have to worry about duplicate names
allow us to have unique parameter names

dgkf commented 3 months ago

On substitution of rest args

I'd like to avoid requiring really bespoke syntax to skirt the argument matching (f({ x = val }), etc).

I really like your proposal of having substitute(<rest arg>) producing a list of expressions including the =, and of course there would be the sys.call() mechanism of grabbing the explicit calling code regardless of how the arguments are interpreted.

f <- fn(..rest) {
  substitute(rest)
}
f(a = 1, b = 2, c = 3)
# list(quote(a = 1), quote(b = 2), quote(c = 3))

One quirk of this is that you can no longer do something like lapply(rest, eval, envir = parent.frame()), because it would perform assignment in the environment, but I think this is a low-level enough interface that we can expect someone using this to be able to work around its quirks.

On duplicate arg last name wins

I think both styles can be unintuitive in their own ways. I've seen many instances of R developers not sanitizing ... arguments, which means you can surface a low level "duplicate argument" error even if you're using the function as documented. In my experience, most newer developers avoid this type of argument unpacking, so it's probably okay to prioritize developer-friendliness over user-facing semantics.

Just for some neighboring language inspiration, julia uses the last name wins style, which is quite convenient as a developer. The order of arguments allows you to accept a list of arguments, while still easily setting default values (before user-facing kwargs...) or fixing argument values (after user-facing kwargs...).

julia will still throw an error if the same named argument is specified twice (not using a kwargs... expansion)

fn(a = 1, a = 2)  # error
fn(; (:a => 1,)..., (:a => 2,)...)  # fine

I think this is a nice guardrail, while also leaving space for this really convenient syntax. If it's really problematic, having a global option to warn when top-level calls perform this type of masking might be enough to mitigate any new-user issues.

sebffischer commented 3 months ago

I think I would be on board with this suggestion. I was a bit over-anxious regarding the last name wins semantics, but as long as we make fn(a = 1, a = 2) throw an error, I think I am on board!

If we implement substitute() for .. arguments as discussed, does this then also mean that we can also go with the uniqueness of names in List? Another argument for this, is that if we want Lists (and later things that we built on top of it, e.g. data.frames) to work as evaluation environments (as you suggested here: https://github.com/dgkf/R/issues/57), we really want to treat the names of the list as variable names, which should definitely be unique I think.

dgkf commented 3 months ago

If we implement substitute() for .. arguments as discussed, does this then also mean that we can also go with the uniqueness of names in List?

Yeah - if we had the separate substitute(...) logic that produced full quote(param = value) expressions, then handling duplicate names could be handled in ways that allow list to retain uniqueness. Having a argpairs(substitute(...)) helper could help to provide more R-like behaviors.

I'm not sure what data structure would be appropriate here - maybe a c("pair_df", "data_frame") that has columns c("key", "value") and has methods for many of the indexing behaviors to make it feature comparable to R's lists.

I see it as something that can be provided by a standard library, not something that needs to be baked into the language, but maybe some performance considerations could change my mind.

JosiahParry commented 2 months ago

I found the indexmap data structure that might be handy here...

I have used and quite liked indexmap in practice. The issue with all of these maps is that ordering is problematic.

indexmap uses the order of insertion. HashMap does not have order. BTreeMap requires that the index has partialord. In R, you can order a list however you want and that order persists. I like that about them quite a bit. It is one of the nice parts of the flexibility that R provides.

dgkf / R