Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.6k stars 982 forks source link

Should logical01 be TRUE by default? #5856

Closed MichaelChirico closed 9 months ago

MichaelChirico commented 10 months ago

Follow-up to #5840.

When it was originally developed, the intention was to eventually turn on this argument by default.

The benefit is getting the storage type right by default. Columns with values 0/1/NA should be "logical" storage in R. R is also very good about converting back to numeric 0/1 when appropriate.

There are, however, a few downsides:

That's a dump of my thoughts on this issue for now. Opening this for discussion as we head into the next release.

markseeto commented 9 months ago

@MichaelChirico Thanks for opening this discussion.

From my point of view as a user, I'm not in favour of changing the default to logical01=TRUE.

I could have a count variable where values of 0 and 1 are common, and values greater than 1 are possible but relatively uncommon. My data might only have values 0 and 1, and I wouldn't want this changed to FALSE/TRUE. Similarly, I could have a categorical variable with 3 categories coded as 0/1/2, and I might be looking at interim data in which the value 2 hasn't appeared yet. In Michael's example with split files, some files might only contain 0/1 for a particular variable, while other files might contain 0/1/2 for that variable.

Even if I have a 0/1 variable that really is a logical variable, my preference would be to read it into R as 0/1 by default, rather than having it changed to FALSE/TRUE. If I want it to be FALSE/TRUE, I'll change it. My preference is to read data into R with minimal changes. But it's understandable that not everyone shares that preference.

Sometimes I have to send a modified data set back to whoever sent the raw data to me, and if they sent a variable as 0/1 then I would want to send it back as 0/1. If the default is logical01=TRUE, then if I'm not careful, I might not even realise that a FALSE/TRUE variable in my R data was actually 0/1 in the raw data.

However, if the default was changed to logical01=TRUE, I don't think it would bother me because I could just specify logical01=FALSE.

MichaelChirico commented 9 months ago

Indeed the problem is more general than I thought. I'm reminded of how split() can drop factor levels and doing analysis on data like this can be a PITA in general.

jangorecki commented 9 months ago

Default to TRUE means less predictability about schema, so IMO default FALSE make more sense.