STAT545-UBC / Discussion

Public discussion
37 stars 20 forks source link

tibble vs dataframe? #350

Open RosieRedfield opened 7 years ago

RosieRedfield commented 7 years ago

I'm still struggling with very fundamental concepts:

Is it right to say that all tibbles are dataframes, but not all dataframes are tibbles?

Is tibble-ness just a special way (the tidyverse way, within R) of viewing and interacting with data that's in a dataframe?

Or is tibble-ness a property of the data itself? Do some sets of data have properties that prevent them from being treated as tibbles? If so, what are these properties?

(I've looked through R for Data Science, and searched Google, but nothing I've found explains this.)

dtavern commented 7 years ago

Hey @RosieRedfield, tibbles are a special type of dataframe. If you look at their attribute, their class includes data.frame. It contains all those properties but also has some extra features, like the ability to print the first 10 rows (among others). If you would like more clarification check out the R for Data Science section 20.7 (specifically 20.7.3) as well as this RStudio blog post. If you're still confused, I imagine the TAs/Jenny can explain further.

Edit: I recommend to check out the R for Data Science section 20.6 on attributes, as it might help you understand what's going on behind the scenes in R while dealing with different data types.

RosieRedfield commented 7 years ago

@dtavern: But surely the ability to print the first ten rows isn't an attribute of the data. I can print the first n rows of any dataframe using head, if I specify n after the filename, right?

My Google search did find the RStudio blog post you link too but it didn't seem to answer my question. This 'vignette' is a bit better. It describes things that don't happen when a tibble is created from one or more 'vectors', so I infer that these things can happen when an ordinary dataframe is created. But even it is clearly written for semi-experts, not beginners.

Sections 20.6 and 20.7 of R for Data Science are a bit over my head. Sec. 20.7.3 tells me a difference between a tibble and a 'list', but that's not what I'm asking.

My fundamental problem is that all the information I can find explains tibbles in terms whose meanings I don't know (e.g. 'vector', 'list'). So I have to make guesses... For example, I think a 'vector' is an ordered list of items that are all the same type (character strings or integers or...). But this explanation shouldn't use the word 'list' because apparently a 'list' is something else.

jennybc commented 7 years ago

Is it right to say that all tibbles are dataframes, but not all dataframes are tibbles?

Yes.

Is tibble-ness just a special way (the tidyverse way, within R) of viewing and interacting with data that's in a data frame?

No. Or rather I would state it differently. Tibble-ness allows the tidyverse to override certain default behaviours for data frames with behaviour that is nicer and/or safer. It has nothing to do with what's possible but about what happens by default.

Or is tibble-ness a property of the data itself?

No, any data frame can be made into a tibble. Literally, a class is being added to the object, so that instead of just being of class data.frame, it is also of class tbl_df.

Do some sets of data have properties that prevent them from being treated as tibbles? If so, what are these properties?

No. If it can be made into a data frame, it can be made into a tibble.

I am trying to think of a good analogy. Here's a so-so one. Making a data frame into a tibble is like me getting a Nexus card. It doesn't change my citizenship or my legal rights in Canada or the US. I can still prove my citizenship or cross the border with my passport if I like. But having the Nexus card gives me a smoother experience at high-traffic border crossings or airports. Sometime there is no Nexus line, in which case I just show my passport and wait in line like everyone else. My status as a Nexus card holder is pure convenience, but since I cross to/from the US a lot, it is a very important convenience.

Tibbles are valid data frames and any data frame can be made into tibble if someone cares enough to do it. And certain actions, like printing, happen in a nicer way for tibbles than data frames. Other actions, like summary() or nrow(), happen exactly the same for both tibbles and data frames.

Does that help at all?

samhinshaw commented 7 years ago

To expand on Jenny's point here:

No, any data frame can be made into a tibble. Literally, a class is being added to the object, so that instead of just being of class data.frame, it is also of class tbl_df.

The class tbl_df can be applied to a data.frame by the function tbl_df().

This may not help answer the more abstract questions, but if you wanted to experiment with the behavior of two otherwise identical data.frames, I would recommend downloading the gapminder TSV file from Jenny's gapminder package.

Then read it into your R environment as such:

gapminderDF <- read.delim("gapminder.tsv")
gapminderTibble <- tbl_df(gapminderDF)

and play around with the differences between gapminderDF and gapminderTibble