gdemin / maditr

Fast Data Aggregation, Modification, and Filtering
61 stars 3 forks source link

Group computation in maditr? #7

Closed hope-data-science closed 4 years ago

hope-data-science commented 5 years ago

I am a loyal user of dplyr but turn to data.table when efficiency is taken into consideration. maditr provides a very good approach for users like me, but grouping and nesting are very important methods too. Is there any plans to add these to maditr? Something for reference: Grouping-- https://cran.r-project.org/web/packages/rqdatatable/vignettes/GroupedSampling.html Nesting--https://tysonbarrett.com/tidyfast/reference/dt_unnest.html

gdemin commented 5 years ago

Do you mean by 'grouping' exactly sampling from each group?

I will add these functions in the next release.

hope-data-science commented 5 years ago

In dplyr there is group_by function, while in data.table it always combine group_by and summarise in one expression. In dtplyr this is accomplished by lazy evaluation, and transfer to data.table code finally. I wonder if there is a way to get a dt_group_by in maditr, and let the user decide what to be done next.

hope-data-science commented 4 years ago

I think I have nailed it down. If there are any mistakes, let me know. See https://hope-data-science.github.io/tidydt/reference/group_dt.html.

gdemin commented 4 years ago

If you really want to have dplyr-style 'group_by', I think it is better to use data.table built-in keys functionality:

data(mtcars)
library(data.table)
library(magrittr)

dt_mt = as.data.table(mtcars)
group_dt = function(data, ...){
    if(!is.data.table(data)) data = as.data.table(data)
    setkey(data, ..., verbose = FALSE)
}

summarise_dt = function(data, ...){
    if(!is.data.table(data)) data = as.data.table(data)
    keys = key(data)
    args = substitute(list(...))
    res = data[, eval(args), by =  keys]
    # in dplyr after summarizing we drop last grouping variable
    new_keys = keys[-length(keys)]
    if(length(new_keys)==0) new_keys = NULL
    setkeyv(res, cols = new_keys, verbose = FALSE)
    ####
    res
} 

mtcars %>% 
    group_dt(am, vs) %>%
    summarise_dt(mpg = mean(mpg)) %>% 
    print() %>% 
    summarise_dt(mpg = mean(mpg))

But, of course, it is up to you:)

hope-data-science commented 4 years ago

I think you have handle this in the most correct way. All tidydt do is to "translate", but your code is the most analogous to group_by. It should be implemented in maditr, but maybe there's more work to update every function to catch the key.

Forgive me. My capability just does not allow me to write and handle such advanced codes. Do make it in maditr if you have time, I love it!