Tidying data: Comparing gather to pivot_longer and spread to pivot_wider

Who is the audience?

@vintented and @Clare2D proposed this topic, which should be useful to anyone who uses or is interested in using tidy data for their analyses. This includes not only folks at 2DII but also beyond.

Why is this important?

Tidy datasets are easy to manipulate, model and visualise, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table. This framework makes it easy to tidy messy datasets because only a small set of tools are needed to deal with a wide range of un-tidy datasets. This structure also makes it easier to develop tidy tools for data analysis, tools that both input and output tidy datasets. The advantages of a consistent data structure and matching tools are demonstrated with a case study free from mundane data manipulation chores. -- https://vita.had.co.nz/papers/tidy-data.pdf

What should be covered? Suggested speakers or contributors

The speaker will be @cjyetman, who will run a live example and discuss it. Many folks at 2DII have experience reshaping data (e.g. @vintented, @Clare2D, and @jdhoffa) and may be able to contribute with questions, comments, and recomendations.

Some example questions to inspire the discussion:

What is tidy data?
How tidy data relates to tidyverse tools.
Why gather() is superseded by pivot_longer() and spread() by pivot_wider()?

Resources

GitHub doesn't support attaching a R file (?!), so here it is in formatted code...

# original Tidy Data treatise: https://vita.had.co.nz/papers/tidy-data.pdf
# current Tidy Data discussion: https://r4ds.had.co.nz/tidy-data.html

library(dplyr)
library(tidyr)

billboard

# gather(data, key = "key", value = "value", ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE)

billboard %>% gather(key, value, wk1, wk2, wk3)

billboard %>% gather('key', 'value', wk1, wk2, wk3)

billboard %>% gather('key', 'value', wk1, wk2, wk3) %>% select(artist, track, date.entered, key, value)

billboard %>% gather('key', 'value', 4:ncol(billboard))

billboard %>% gather('key', 'value', -artist, -track, -date.entered)

billboard %>% gather('key', 'value', -(1:3))

billboard %>% gather('key', 'value', wk1:wk76)

billboard %>% gather('key', 'value', starts_with('wk'))

billboard %>% gather('key', 'value', matches('wk[0-9]*$'))

billboard %>% gather('week', 'rank', matches('wk[0-9]*$'))

billboard %>% gather(key = 'week', value = 'rank', matches('wk[0-9]*$'))

# pivot_longer(
#   data,
#   cols,
#   names_to = "name",
#   names_prefix = NULL,
#   names_sep = NULL,
#   names_pattern = NULL,
#   names_ptypes = list(),
#   names_repair = "check_unique",
#   values_to = "value",
#   values_drop_na = FALSE,
#   values_ptypes = list()
# )

billboard %>% pivot_longer(wk1, wk2, wk3)

billboard %>% pivot_longer(wk1:wk76)

billboard %>% pivot_longer(4:ncol(billboard))

billboard %>% pivot_longer(-(1:3))

billboard %>% pivot_longer(matches('wk[0-9]*$'))

billboard %>% pivot_longer(wk1:wk76, names_to = 'week', values_to = 'rank')

billboard %>% pivot_longer(wk1:wk76,
                           names_to = 'week', values_to = 'rank',
                           names_prefix = 'wk')

billboard %>% pivot_longer(wk1:wk76,
                           names_to = 'week', values_to = 'rank',
                           names_prefix = 'wk',
                           names_ptypes = list(week = integer()))

billboard_long <-
  billboard %>%
  pivot_longer(wk1:wk76, names_to = 'week', values_to = 'rank',
               names_prefix = 'wk', names_ptypes = list(week = integer()))

# spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)

billboard_long

billboard_long %>% spread(week, rank)

billboard_long %>% spread(week, rank, sep = '_')

billboard_long %>% spread(week, rank, sep = '')

billboard_long %>% rename(wk = week) %>% spread(wk, rank, sep = '')

# pivot_wider(
#   data,
#   id_cols = NULL,
#   names_from = name,
#   names_prefix = "",
#   names_sep = "_",
#   names_repair = "check_unique",
#   values_from = value,
#   values_fill = NULL,
#   values_fn = NULL
# )

billboard_long %>% pivot_wider(names_from = week, values_from = rank)

billboard_long %>% pivot_wider(names_from = week, values_from = rank, names_prefix = 'wk')

2DegreesInvesting / ds-incubator