Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.57k stars 975 forks source link

Allow coercion for NA input to fifelse #4277

Closed MichaelChirico closed 3 years ago

MichaelChirico commented 4 years ago
fifelse(c(TRUE, FALSE), NA, 1L)
# Error in fifelse(c(TRUE, FALSE), NA, 1L) : 
#   'yes' is of type logical but 'no' is of type integer. Please make sure that both arguments have the same type.

that typeof(NA) is logical I think is an implementaton detail we can obscure from users. we should coerce NA to NA_integer_ in this case.

shrektan commented 4 years ago

I'm happy to file a PR if you haven't started yet...

MichaelChirico commented 4 years ago

I haven't! please do

2005m commented 4 years ago

Just my view, but I don't think it is a good idea to i/obscure things like this to users ii/break consistency however small it is iii/introduce exception. User must be fully aware of what is going on when they use a function.

shrektan commented 4 years ago

The thing is, in my opinion, we need to provide an interface that's easy for ordinary R users.

Unfortunately, many R users just don't know much about programming, don't use GitHub, don't know there's a difference between NA and NAreal/integer_. Yes, they should be aware of this differences. However, the base R does this coercion everywhere, if 'data.table' do not, they may just think it's a bug of data.table and stop using it - and this is the least I want.

In conclusion, I think we'd better support it as well, especially considering there's no penalty for having this (the performance impact I believe is minimal and the maintenance should be fine).

2005m commented 4 years ago

So you would rather let them be ignorant instead of educating them via useful, clear and helpful error messages? Just to give you a counter example, dplyr::if_else does not do it. Just quoting it because a lot of people (around here included) consider that package as a reference. (Just to be clear it is not my case)

jangorecki commented 4 years ago

NA is quite different thing, I would rather prefer to warn on using 1 where 1L is meant to be used than NAs. For example in Julia lang, missing value does not have a type, this kind behavior is partially used in base, where NA type is automatically adjusted. I think it is good, to be fair, NA does not need to have a type from the logical perspective, it is only an R implementation that made it relevant.

2005m commented 4 years ago

I agree on the 1 and 1L but not on NA as it is clearly defined in the R documentation as "a logical constant of length 1" and you clearly do not have a NAlogical . Saying that NA type is sometimes automatically adjusted in base R is not a valid argument because a lot of things in base R completely lack of consistency. ifelse is a shocking example which also explains why there are various implementations...

shrektan commented 4 years ago

Yes, I agree that base R is lack of consistency so I have to admit my argument doesn't stand. 😢

However, despite the documentation that NA is "a logical constant of length 1", I'm afraid NA is indeed treated as a sort of generic missing value in many cases... For example, fcase() has default=NA where NA is treated as a generic... I believe this argument stands considering the fact that NA is notNA_logical_. It has no suffix like NA_real_...

jangorecki commented 4 years ago

In data.table logical NA is coerced to required type in majority of our features/functions. So having a policy about educating users on NA type sensitive usage would not be consistent. If there is no performance cost of it, then for me it is clearly better to make it easier for users. The way it could be, as for example in julia.

MichaelChirico commented 4 years ago

My 2 cents...

We're package devs for a high-level language. The point of a high-level language is for end users to be able to write simpler code & not need to know every last detail of the programs they're writing (if they do, more power to them, of course!). The burden of keeping track of these finer points is firmly in the hands of the package devs who can do a bunch of stuff "under the rug" to keep the user-facing house in order.

Not really a scientific poll but a point of reference nevertheless:

https://twitter.com/michael_chirico/status/1235175842034020352

Less than half of users following #rstats who responded knew off the bat that NA is logical. And I think that's fine! Yes, we could catch this and point this out to users, but honestly I don't see how this makes them any better off for all but a tiny set of "power users" -- that I know NA is logical in R by heart will not help me stop an epidemic, or design a new machine translation tool, or empower a charity.

The code we could write to say "hey, don't you meal NA_real_?" could just as easily be used to just silently convert NA to NA_real_.

Somewhat oblique case in point: Instagram auto-lints PRs for many common syntax inconsistencies:

https://instagram-engineering.com/static-analysis-at-scale-an-instagram-story-8f498ab71a0c

They've decided that rather than blocking new code because of style inconsistencies or other such minor things, they'd rather automatically fix them with a bot and allow engineers to focus their head space on less tedious things.

ColeMiller1 commented 4 years ago

Looks like dplyr is going this route, too:

https://github.com/tidyverse/dplyr/issues/5106#issue-598966452

TBH, when I was a R newbie I was always surprised by this type of error. SQL just has NULL - why wouldn't NA coerce to whatever it needs to be.

matthewgson commented 4 years ago

I think including some tips on the error would be nice for end-users like me. This would be enough to prevent confusion at the least.

For example,

Error in fifelse(condition, NA, 'Nah') : 
  'yes' is of type logical but 'no' is of type of character. Please make sure that both arguments have the same type. 
Do you mean fifelse(condition, NA_character_, 'Nah')?