KentonWhite / ProjectTemplate

A template utility for R projects that provides a skeletal project.
http://projecttemplate.net
GNU General Public License v3.0
622 stars 159 forks source link

Make tibble optional #271

Closed jeromyanglim closed 5 years ago

jeromyanglim commented 5 years ago

Report an Issue / Request a Feature

I'm submitting a (Check one with "x") :


Issue Severity Classification -

(Check one with "x") :

Expected Behavior

In version 0.8.2, data.frames are automatically converted to tibbles. Tibbles are not always compatible with functions that expect data.frames. While some people fully embrace the tidyverse, many others do not. It seems inappropriate to impose tibbles on users, when they break existing projects. A really important feature of ProjectTemplate is that it should be stable because it a foundational component of data analysis projects.

For example, if you have code that requires row names, then tibbles will break the code. So I think it's really important to give the user the choice of adopting tibbles.

Possible Solution

Make as_tibble an option in global.dcf. as_tibble: TRUE could be the default.

Presumably, then you'd just have to update the code:

 if (config$data_tables) {
    .convert.to.data.table(vars.new)
  } else {
    .convert.to.tibble(vars.new)
  }

to something like:

  if (config$data_tables) {
    .convert.to.data.table(vars.new)
  } else if (config$as_tibble) {
    .convert.to.tibble(vars.new)
  }
jeromyanglim commented 5 years ago

In the interim or if this suggestion is not implemented, I found putting the following code in an rscript in the lib directory disables the tibble conversion.

.convert.to.tibble <- function (data.sets)  NULL
assignInNamespace(".convert.to.tibble", .convert.to.tibble, 
                  "ProjectTemplate")

Obviously, it's a bit of a hack, but it seems to work. So for people that are currently experiencing issues with tibbles, this code might help.

Hugovdberg commented 5 years ago

Could you please provide some specific test cases that break when using a tibble instead of a data.frame? Currently none of our tests fail, which lead us to believe no problems were introduced by the change.

jeromyanglim commented 5 years ago

Tibbles do not support row names, or at least row names are deprecated. So any code that uses row names is going to break, if not now, then very soon. Tibbles are not designed to behave identically to data.frames. That's why they exist. Tidyverse people believe they provide a cleaner interface to data. But they are not indented to be backwards compatible with data.frames.

Tibbles wont support row.names

I'm not quite sure how the test system works, but basically assigning row names would be a test that will fail soon.

E.g.,

xdf <- data.frame(a = 1:10, b = 1:10)
xtibble <- as_tibble(xdf)
row.names(xtibble) <- letters[1:10]

This gives you the following:

Warning message:
Setting row names on a tibble is deprecated. 

Tibbles have different behaviour when selecting a single column

Specifically, tibbles return a tibble/data.frame with one column, whereas data.frames return a vector.

length(xdf[, "b"]) == 10
length(xtibble[, "b"]) == 10

length(xdf[, "b"]) == 10 [1] TRUE length(xtibble[, "b"]) == 10 [1] FALSE

In summary

Tibbles are not backwards compatible with data.frames. There are reasons for these incompatibilities.

That said, people build projects with ProjectTemplate over the years. You do a data analysis project say for a journal article or whatever. You archive it. You'd like for people to be able to take that code years later and be able to run the code and for it not to break. I imagine tibbles will break most of my existing projects.

More generally, some people just prefer straight data.frames. I.e., the way they print, the robustness and long term stability and backwards compatibility of base R.

There is an extensive thread here about issues that arise when a function expects a data.frame but gets a tibble:

https://stat.ethz.ch/pipermail/r-package-devel/2017q3/001896.html

KentonWhite commented 5 years ago

Easiest way forward could be a flag set in global.dcf for tibbles. Default can be tibbles. Would this help your case @jeromyanglim? Thoughts from others?

Some background — ProjectTemplate is meant to be somewhat opinionated. Recently we've started moving towards the tidyverse as an opinionated way of doing things.

But.... Opinionated doesn't mean "on rails". Our opinion is meant to guide the novice in making choices, not handcuffing the seasoned practitioner. Allowing others to override our choices is part of being opinionated!

jeromyanglim commented 5 years ago

A tibble option in global.dcf sounds like a great resolution to me.

A few other more minor considerations to think about:

Backwards compatability

I also think backwards compatibility is a consideration. So for example, my suggestion would be, assuming global.dcf gets something like as_tibble

The basic assumption is that when you introduce a default option in a later release of ProjectTemplate that is likely to break old projects, then make the migrated or implied option that which would be most consistent with behaviour of old projects.

Hugovdberg commented 5 years ago

It is indeed a lengthy discussion, but apart from dropping dimensions, but I haven't seen a concrete example of a function that breaks when passed a tibble instead of a data.frame, besides the (in my opinion mistaken) dropping of dimensions. I'm still not convinced that tibble shouldn't be the default, even for existing projects. But a separate option to disable tibble is probably a good way to keep some backward compatibility without adding too much complexity.

KentonWhite commented 5 years ago

What looks to be a better solution is replace the config variable data_tables with a tables_type variable that can take the arguments data_frame, data_table, or tibble. Default would be tibble. Not only is this cleaner, will allow a way to support backwards compatibility when the new hotness is introduced.

Thoughts?

jeromyanglim commented 5 years ago

Providing three options to such an argument makes a lot of sense to me. I presume, datasets can't be data tables and tibbles at the same time.

It seems a bit like stringsAsFactors. It's something that you want the config file to make consistent and remove ambiguity about.

My vote would be that when migrating old config files or running old config files, it should try to stay consistent with previous config: e.g., if config$data_table is true then data_table, if false, then data_frame. That way it will be least likely to break old code.

I think the idea of making tibbles default for new projects makes sense. I personally find tibbles irritating because of the disruption and backwards compatibility issues they are causing. But I agree with the point that ProjectTemplate is a tool that provides hints and good workflow, and if the majority of people here are embracing the tidyverse dialect of R, then making tibbles default makes sense.

KentonWhite commented 5 years ago

Doh! I made tibbels default when migrating old projects. Yes it makes better sense to make data frames default when migrating old projects. Will update the PR. Thanks for the catch!