TidierOrg / Tidier.jl

Meta-package for data analysis in Julia, modeled after the R tidyverse.
MIT License
515 stars 14 forks source link

cumsum function in mutate does not return values as expected on grouped dataframe #105

Closed yahrMason closed 1 year ago

yahrMason commented 1 year ago

I am looking to implement a cumulative sum over a column, by group. I am using the following Tidier.jl code, but it does not return what I expect.

using DataFrames, Tidier

df = DataFrame(
    grp = ["one", "one", "one", "two", "two", "two"],
    t = [1,2,3,1,2,3],
    x = [4,5,6,8,1,2]
)

df2 = @chain df begin
    @group_by(grp)
    @mutate(x_cumsum = cumsum(x))
    @ungroup
end

Returns a dataframe that looks like this (apologies for horrid formatting)

Row │ grp t x x_cumsum │ String Int64 Int64 Array… ───┼──────────────────────────────── 1 │ one 1 4 fill(4) 2 │ one 2 5 fill(5) 3 │ one 3 6 fill(6) 4 │ two 1 8 fill(8) 5 │ two 2 1 fill(1) 6 │ two 3 2 fill(2)

kdpsingh commented 1 year ago

I think this is an auto-vectorization issue.

Can you try adding a tilde in front of cumsum() like this?

df2 = @chain df begin
    @group_by(grp)
    @mutate(x_cumsum = ~cumsum(x))
    @ungroup
end

If that works, I can update the code to remove the need for a tilde. Let me know. Thanks for reporting!

yahrMason commented 1 year ago

I can confirm that adding the tilde in front of the cumsum now works as intended. Thanks!

kdpsingh commented 1 year ago

Great, we'll get this fixed soon so that it will work both with and without the tilde.

What's happening is that cumsum() is being auto-vectorized to cumsum.() by Tidier. We have an array of exclusions that include which functions not to vectorize.

I'll add it to that list of exclusions, and in the longer term, the plan is to make that list modifiable by end-users so they can register other functions that they don't want to see auto-vectorized by Tidier in their workflow.

kdpsingh commented 1 year ago

This is now fixed on the GitHub version of Tidier. cumsum(), cumprod(), and accumulate() no longer require a tilde. Will get this update on the registry shortly.