Should logical01 be TRUE by default?

MichaelChirico commented 10 months ago

Follow-up to #5840.

When it was originally developed, the intention was to eventually turn on this argument by default.

The benefit is getting the storage type right by default. Columns with values 0/1/NA should be "logical" storage in R. R is also very good about converting back to numeric 0/1 when appropriate.

There are, however, a few downsides:

We can pretty easily get it wrong. In particular, the current logic just says basically "if all(is.na(x) | x %in% c(0, 1)) as.logical(x) else x i.e., any column with only values among 0/1/NA will be read as type logical by default. Where that rule comes across as funny to me is in cases where there is only one unique value (or perhaps 2, if some entries are missing). The vector [1 1 1 1 1 1 1 1 1] might be logical, but it could just be a constant numeric column, too. Same for a vector of zeroes.

That hints at a potential improvement to the logic here, namely if (all((is.na(x) | x %in% c(0, 1)) && any(x == 0) & any(x == 1)) as.logical(x) else x, i.e., any column with only values 0/1/NA that takes both 0 and 1 as values at least once will be read as type logical. E.g. we could call this behavior logical01 = "auto".

But that opens us up to a new problem -- reading directories where "auto" guesses differently for different files, and therefore the resulting tables can't be trivially rbind()'d. For example, consider one process doing lapply(split(x, by = col), fwrite, ...), i.e., sharding x by some column col. There is a column v in x that only takes values 0/1, but in some shards it's constant, e.g. in the first file all(v == 0), in the second file there's a mix, in the third file all(v == 1). Then logical01 = "auto" will read v in the first & third files as numeric but as logical in the second file. Supplying logical01 = TRUE/FALSE explicitly is a workaround for this case, but it seems unpleasant if this comes up with any frequency. Auto-promoting logical->(integer|double) in rbindlist() would also work, but should be considered more carefully within rbindlist() itself whether this behavior makes sense in general.

Back-compatibility issue. There are (as of this writing) 16 CRAN packages broken by changing logical01=TRUE by default, namely: NMdata, PeakSegDisk, Rdiagnosislist, antaresEditObject, antaresRead, bigsnpr, chicane, damr, googleCloudVisionR, growthcleanr, gtfsrouter, pepr, prioriactions, skynet, spatsoc, targets.

It could be that this is a simple issue in tests, e.g. as happened in #5843 where our own tests had examples of one-row inputs requiring a simple 1->TRUE change. One-row minimal examples for tests are not representative of end-user data, so the problem could be minimal. But it requires a bit more investigation to understand the severity of each individual breakage. On the flipside, this was always planned & stated in the NEWS for 6 years as the intention, and the initial phase would offer a simple global R option() to retain logical01=FALSE if desired, i.e., this is part of a normal deprecation cycle.

That's a dump of my thoughts on this issue for now. Opening this for discussion as we head into the next release.

markseeto commented 9 months ago

@MichaelChirico Thanks for opening this discussion.

From my point of view as a user, I'm not in favour of changing the default to logical01=TRUE.

I could have a count variable where values of 0 and 1 are common, and values greater than 1 are possible but relatively uncommon. My data might only have values 0 and 1, and I wouldn't want this changed to FALSE/TRUE. Similarly, I could have a categorical variable with 3 categories coded as 0/1/2, and I might be looking at interim data in which the value 2 hasn't appeared yet. In Michael's example with split files, some files might only contain 0/1 for a particular variable, while other files might contain 0/1/2 for that variable.

Even if I have a 0/1 variable that really is a logical variable, my preference would be to read it into R as 0/1 by default, rather than having it changed to FALSE/TRUE. If I want it to be FALSE/TRUE, I'll change it. My preference is to read data into R with minimal changes. But it's understandable that not everyone shares that preference.

Sometimes I have to send a modified data set back to whoever sent the raw data to me, and if they sent a variable as 0/1 then I would want to send it back as 0/1. If the default is logical01=TRUE, then if I'm not careful, I might not even realise that a FALSE/TRUE variable in my R data was actually 0/1 in the raw data.

However, if the default was changed to logical01=TRUE, I don't think it would bother me because I could just specify logical01=FALSE.

MichaelChirico commented 9 months ago

Indeed the problem is more general than I thought. I'm reminded of how split() can drop factor levels and doing analysis on data like this can be a PITA in general.

jangorecki commented 9 months ago

Default to TRUE means less predictability about schema, so IMO default FALSE make more sense.

Rdatatable / data.table

Should logical01 be TRUE by default? #5856