apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.54k stars 3.54k forks source link

[R] Support mutate/summarize with implicit join #29537

Closed asfimport closed 6 months ago

asfimport commented 3 years ago

mtcars %>%
  group_by(cyl) %>%
  mutate(x = hp - mean(hp)

essentially means something like


mtcars %>%
  left_join(mtcars %>%
    group_by(cyl) %>%
    summarize(tmp = mean(hp))
  ) %>%
  mutate(x = hp - tmp) %>%
  select(-tmp)

Apparently you can do the same inside summarize() too (though IDK if that's behavior we want to encourage). Once we can do joins, we can support these queries.

Reporter: Neal Richardson / @nealrichardson

Related issues:

Note: This issue was originally created as ARROW-13926. Please see the migration documentation for further details.

asfimport commented 3 years ago

Ian Cook / @ianmcook: FWIW, this is perhaps better construed as an implicit window function (i.e. an OVER expression in SQL). When you do this type of operation with dbplyr, the SQL it generates uses an OVER expression:


mtcars_db <- dbplyr::memdb_frame(mtcars)
mtcars_db %>%
  group_by(cyl) %>%
  transmute(x = hp - mean(hp)) %>%
  show_query()

#> <SQL>
#> SELECT `hp` - AVG(`hp`) OVER (PARTITION BY `cyl`) AS `x`
#> FROM `dbplyr_002`