EvaMaeRey / mytidytuesday

little demos with tidytuesday data
https://evamaerey.github.io/mytidytuesday/
5 stars 3 forks source link

tidypivot feedback #3

Open brshallo opened 2 years ago

brshallo commented 2 years ago

Per Twitter DM conversation, here are my initial notes on your investigation into a pivot table like interface (tables)

On the whole, sounds like a really good idea -- am a bit surprised someone hasn't done a tidyverse <--> pivot-table like interface (which seems could be useful for analysts coming over to R from excel).

Notes through step 0

Did not know about uncount() , thanks!

base R's table() is a bit more pivot table like than the tidyr::pivot_* functions, but it is limited to just the 2-d case.

with(tidy_titanic, table(Sex, Survived))

Notes step 1

I see how you have y and y0, I'm wondering if the equivalent for x's may be limited by tibble/data.frame structure. It may work building this on top of data.frames but also makes me wonder if a new class would be better that allowed for multiple column headers and multiple row labels... or something... (but trying to do this would obviously add a lot of work... I do know tidyverse has documentation on building new classes on top of dataframes but haven't ever looked at.)

Notes on question at end

  1. I don't know if it does or doesn't. I feel like there is a lot of #rstats stuff out there that attempts to replicate bits of excel functionality so wouldn't be surprised if something like it did, but I don't know of it. You might ping David Gohel and see if he's aware of anything.

  2. I like the API you describe here better (allowing each to take multiple arguments), I also think I like arguments of rows and cols better.

I feel like rather than having a y0 argument, this should be handled by multiple variables passed into a vector or tidyselection specification.

For example rather than:

tidy_titanic %>% 
  pivot_count(y = Survived, x = Sex, y0 = Class)

Would have:

# function doesn't really work outside of specific examples below
pivot_table <- function(data, rows, cols){

  tidy <- data %>% 
    group_by(across(c({{rows}}, {{cols}}))) %>% 
    summarize(value = n()) %>%
    ungroup()

  tidy %>% 
    pivot_wider(names_from = {{cols}})
}

tidy_titanic %>% 
  pivot_table(cols = Sex, rows = c(Class, Survived))

(Which has the same output as your example.)

This works the other way too but has the issue mentioned above of only one column header (so levels just get concatenated).

tidy_titanic %>% 
  pivot_table(cols = c(Class, Survived), rows = Sex)

Like idea of cols and rows both be able to take either a vector (as shwon above) or any tidyselection (e.g. all_of("Class", "Survived")).

Also think would be nice to have a .fns argument that could take in things like list(n = n, sum = sum), etc. (similar to how dplyr::across() takes in functions). If you do this, could have a more generic name like pivot_table() and don't need to have separate functions for each aggregation type (and gives more flexibility to user-- there's a lot could do here in terms of interface, e.g. if wanted to make it possible for aggregations to specify variables inside of them...).

Similarly, I'm wondering if there's some way you wouldn't have to make a separate function like two_way_prop()and what kind of API make sense for getting across the "Show Values As" options in excel. Maybe a within-like argument, where this is just a grouping specification, or maybe there are different functions that get across the logic for the "Show Values As" logic...

Alternative approach

One thing about the approach described above is that while it supports tidyselection it's not a traditional lego-block/pipe-based approach that is characteristic of tidyverse style syntax -- an alternative more pipe-based API would be something where you build up a pivot table piece by piece (more like a ggplot2 or tidymodels workflow object), e.g.

tidy_titanic %>%
  pivot_table() %>%
  add_rows(c(Class, Survived)) %>%
  add_cols(Sex)
#  add_funs(n)

This mirrors the way pivot tables are created step by step (though am not sure if this is actually necessary).

One potential advantage with this approach is maybe you could have each addition have a within argument that was set to either TRUE or FALSE and would specify whether the specification was used to set the grouping level for how the values are displayed so as to cover the "Show Values As" functionality. May want to use a different argument than within ... maybe not though, again though handling proportions and an interface for the options in "Show values as" in excel may take more thought.

Other questions

Regarding 3, would probably take a little more thought to get really down, two good resources are rlang Tidy evaluation tab (which was updated late last year): https://rlang.r-lib.org/index.html along with "programming with dplyr" vignette in dplyr. You also might take a look at {dplyover} (though I know author plans to refactor code, so don't know if internals are that great, but I think the interface for variable selection is good -- is pretty much the same as how is done in my package {pwiser}, of which I actually didn't write the internals).

For 4. I'm not sure on best approach there -- was a discussion on dplyr about such a feature and why they don't include it in dplyr: https://github.com/tidyverse/dplyr/issues/3463 , the {janitor} package has janitor::adorn_totals() that maybe would be useful. Some of these parts are tricky because pivot tables are kind of 1/2 analysis 1/2 reporting/aesthetics (which you kind of suggest with your use of {gt}).

Hope notes are helpful!

brshallo commented 2 years ago

P.s. counter to how I wrote my example, would probably have rows before cols in argument order (given this is typical order in matrix notation).

brshallo commented 2 years ago

Here's an alternative version of pivot_table() that has the advantage that it won't break if either rows or cols aren't specified:

# Inspiration for using `rlang::quo_is_null()` came from here:
# https://stackoverflow.com/a/66580483/9059865

pivot_table <- function(data, rows = NULL, cols = NULL){

  cols_quo <- enquo(cols)

  tidy <- data %>% 
    group_by(across(c({{rows}}, {{cols}}))) %>% 
    summarize(value = n()) %>%
    ungroup()

  if(rlang::quo_is_null(cols_quo)) return(tidy)

  tidy %>% 
    pivot_wider(names_from = {{cols}})
}
shannonpileggi commented 2 years ago

pivot_table <- function(data, rows, cols)

where you can pass multiple vars to rows, cols - I really like this approach!

more pipe-based API

yes! where possible, i lean towards buildable code w/ pipes over functions w/ many arguments

brshallo commented 2 years ago

Writing pipeable packages...?

@shannonpileggi do you know any good example packages or tutorials on building pipeable packages? I feel like the tidymodels packages may be a place to look (e.g. {infer}, {recipes} etc) but the thought of going through the source code there is a bit intimidating (even if maybe less daunting than going through dplyr's). I imagine those packages are doing it the "right" way though with setting-up specifications, delaying evaluation etc. and I feel like that may be important if taking a more pipe-based API approach to tidypivot.

The couple toy pipe-based packages I've written are pretty rough and I just actively do the transformations in each step... though don't actually think this is the "right" way to go about it.

e.g. animatrixr:

image

But pretty sure isn't "right" way.


More notes on interface

Could also have it be optional, i.e. a user could either do

pivot_table(df, rows = c(var1, var2), cols = var3)

OR

df %>%
  pivot_table() %>%
  add_rows(var1, var2) %>%
  add_cols(var3)

(Similar to how tidymodels let's you do things in a number of different ways.)

Aggregation "values" may be better handled by pipe-based approach too, you might also want to have special functions to handle "show-values as" e.g.

mpg %>%
  pivot_table(rows = c(manufacturer, trans), 
                     cols = cyl)) %>%
  add_values(count = n(),
                     avg_hwy = mean(hwy)) %>% 
  add_values(prop = n(), show_as = prop_grand_total())

For something like...

image

image

But don't know...

Also don't know how to handle all the options for totals/subtotal that are commonly used in pivot tables (e.g. pivot_table() %>% add_total() ?) -- but again may really need to look outside of tibble / data.frame class. Your DM showing Shannon's comment about tabyl class in {janitor} looks like a good place to start, never used them before though (other than being familiar with janitor::adorn_totals())

shannonpileggi commented 2 years ago

I don't know any tutorials, unfortunately. One package that I think models this functionality well is gtsummary, and now that I think about it, I wonder if there is some overlap here.

EvaMaeRey commented 2 years ago

Adding a couple of links to related work: https://cran.r-project.org/web/packages/janitor/vignettes/tabyls.html
https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html

EvaMaeRey commented 2 years ago

And package here. https://github.com/EvaMaeRey/tidypivot

brshallo commented 2 years ago

@EvaMaeRey on an R4DS community thread someone posted asking for how to do pivot tables in R. Listing links below that were shared there (note R4DS community threads are deleted every few months)