dgkf / R

An experimental reimagining of R
https://dgkf.github.io/R
GNU General Public License v3.0
136 stars 5 forks source link

Should `NA` be a valid value for subsetting #167

Open sebffischer opened 3 months ago

sebffischer commented 3 months ago

Currently we have

> [1, 2, 3][c(1, NA, 2)]
[1]  1 NA  2
> 

which is in agreement with R, but I am not so sure whether this is something that should be possible. Especially for developers this does not seem so nice, but I am also not so sure whether this is something one wants when working with the repl interactively.

I think for these types of questions it is also nice to just ask people, so I started a poll on mastodon the only problem being that I have no followers 😅 , but let's see.

sebffischer commented 3 months ago

So the poll basically ended 52:48 in favor of the feature.

sebffischer commented 3 months ago

Quoting myself here from the discussion of the thread:

I think ideally, indexing beyond bounds should already err, i.e. (1:10)[11] should not return NA but an error.

If one agrees with that, I think the problem with indexing with NA is that it is unclear whether the NA is a value within the range of indices of a vector. I.e. if I do (1:10)[NA], the NA could be a value > 10 which should result in an error.

fmarotta commented 3 months ago

Hi, just to add something to the discussion, I think R's behaviour has two advantages: first, it makes the length of the subset predictable (it will always be the same length of the subsetting vector, regardless of how many NA's there are), and second, it avoids dropping data when subsetting a vector with a function of itself, e.g.

chicken_weights <- c(1, 4, NA, 2, 3, 10)
heavy_chickens <- a[a >= 3]

If NAs weren't kept, heavy_chickens would have no missing data, which is misleading. I'm not on mastodon so I'm not sure if this issue was raised before.

sebffischer commented 3 months ago

Your example is an important use-case to keep in mind and was not yet mentioned on mastodon, thanks for raising it! I agree, that NAs should probably not be dropped, but I think it might make sense to throw an error if they are present in a subset.

This would mean that the user has to specify explicitly how to handle the NAs, which I think might be a benefit of this model and cause more careful handling of missing values. R-like behavior can still be achieved by including a check for missingness in the subset.

chicken_weights <- c(1, 4, NA, 2, 3, 10)
heavy_chickens <- a[is.na(a) | a >= 3]
fmarotta commented 3 months ago

I see, erroring out would make sense :+1: