Closed MichaelChirico closed 3 years ago
I'm happy to file a PR if you haven't started yet...
I haven't! please do
Just my view, but I don't think it is a good idea to i/obscure things like this to users ii/break consistency however small it is iii/introduce exception. User must be fully aware of what is going on when they use a function.
The thing is, in my opinion, we need to provide an interface that's easy for ordinary R users.
Unfortunately, many R users just don't know much about programming, don't use GitHub, don't know there's a difference between NA and NAreal/integer_. Yes, they should be aware of this differences. However, the base R does this coercion everywhere, if 'data.table' do not, they may just think it's a bug of data.table and stop using it - and this is the least I want.
In conclusion, I think we'd better support it as well, especially considering there's no penalty for having this (the performance impact I believe is minimal and the maintenance should be fine).
So you would rather let them be ignorant instead of educating them via useful, clear and helpful error messages?
Just to give you a counter example, dplyr::if_else
does not do it. Just quoting it because a lot of people (around here included) consider that package as a reference. (Just to be clear it is not my case)
NA is quite different thing, I would rather prefer to warn on using 1
where 1L
is meant to be used than NAs. For example in Julia lang, missing value does not have a type, this kind behavior is partially used in base, where NA type is automatically adjusted. I think it is good, to be fair, NA does not need to have a type from the logical perspective, it is only an R implementation that made it relevant.
I agree on the 1
and 1L
but not on NA
as it is clearly defined in the R documentation as "a logical constant of length 1" and you clearly do not have a NAlogical . Saying that NA type is sometimes automatically adjusted in base R is not a valid argument because a lot of things in base R completely lack of consistency. ifelse
is a shocking example which also explains why there are various implementations...
Yes, I agree that base R is lack of consistency so I have to admit my argument doesn't stand. 😢
However, despite the documentation that NA
is "a logical constant of length 1", I'm afraid NA
is indeed treated as a sort of generic missing value in many cases... For example, fcase()
has default=NA
where NA
is treated as a generic... I believe this argument stands considering the fact that NA
is notNA_logical_
. It has no suffix like NA_real_
...
In data.table logical NA is coerced to required type in majority of our features/functions. So having a policy about educating users on NA type sensitive usage would not be consistent. If there is no performance cost of it, then for me it is clearly better to make it easier for users. The way it could be, as for example in julia.
My 2 cents...
We're package devs for a high-level language. The point of a high-level language is for end users to be able to write simpler code & not need to know every last detail of the programs they're writing (if they do, more power to them, of course!). The burden of keeping track of these finer points is firmly in the hands of the package devs who can do a bunch of stuff "under the rug" to keep the user-facing house in order.
Not really a scientific poll but a point of reference nevertheless:
https://twitter.com/michael_chirico/status/1235175842034020352
Less than half of users following #rstats who responded knew off the bat that NA
is logical. And I think that's fine! Yes, we could catch this and point this out to users, but honestly I don't see how this makes them any better off for all but a tiny set of "power users" -- that I know NA
is logical in R by heart will not help me stop an epidemic, or design a new machine translation tool, or empower a charity.
The code we could write to say "hey, don't you meal NA_real_
?" could just as easily be used to just silently convert NA
to NA_real_
.
Somewhat oblique case in point: Instagram auto-lints PRs for many common syntax inconsistencies:
https://instagram-engineering.com/static-analysis-at-scale-an-instagram-story-8f498ab71a0c
They've decided that rather than blocking new code because of style inconsistencies or other such minor things, they'd rather automatically fix them with a bot and allow engineers to focus their head space on less tedious things.
Looks like dplyr
is going this route, too:
https://github.com/tidyverse/dplyr/issues/5106#issue-598966452
TBH, when I was a R newbie I was always surprised by this type of error. SQL just has NULL - why wouldn't NA
coerce to whatever it needs to be.
I think including some tips on the error would be nice for end-users like me. This would be enough to prevent confusion at the least.
For example,
Error in fifelse(condition, NA, 'Nah') :
'yes' is of type logical but 'no' is of type of character. Please make sure that both arguments have the same type.
Do you mean fifelse(condition, NA_character_, 'Nah')?
that
typeof(NA)
islogical
I think is an implementaton detail we can obscure from users. we should coerceNA
toNA_integer_
in this case.