TidierOrg / TidierData.jl

Tidier data transformations in Julia, modeled after the dplyr/tidyr R packages.
MIT License
86 stars 7 forks source link

Cannot @summarise a column of type DateTime #105

Closed metanoid closed 4 months ago

metanoid commented 5 months ago

I'm trying to get a count of the number of distinct values in a datetime column within a grouped dataframe. This works using the standard DataFrames.jl mini language, but not using @ summarise, and I'm at a loss to understand what I'm doing wrong.

using DataFrames
using Tidier
using Dates
using Chain

current_time =  Dates.DateTime(2024,05,26,07,32,01,01)
last_time =  Dates.DateTime(2024,05,27,07,32,01,01)
sequence = current_time:Hour(1):last_time
example = DataFrame(Group = repeat(1:5, outer = 5), Timing = sequence)
example

# outside of Tidier
test1 = @chain example begin
    groupby(:Group)
    combine(:Timing => (x -> length(unique(x))) => :num_periods)
end
# 5×2 DataFrame
#  Row │ Group  num_periods 
#      │ Int64  Int64       
# ─────┼────────────────────
#    1 │     1            5
#    2 │     2            5
#    3 │     3            5
#    4 │     4            5
#    5 │     5            5

# inside of Tidier
test2 = @chain example begin
    @group_by(Group)
    @summarise(num_periods = length(unique(Timing)))
end
# ERROR: MethodError: no method matching iterate(::DateTime)
drizk1 commented 5 months ago

So taking a quick look. I think this is a vectorization issue, similar to #103

In your first example, the function unique is not vectorized to timing column. If you were to write length(unique.(x)) you actually create the same error no method matching iterate(::DateTime) because this would vectorize unique.

From the referenced isssue "Since most functions and operators do require vectorization, TidierData defaults to vectorizing functions and operators unless it knows not to. The way it knows which ones not to vectorize is using a look-up table. This is called "auto-vectorization" and is part of the magic (for good and bad) of TidierData."

below I achieve the same results as you did for test1 with test2 by adding a ~ to prevent autovectorization.

test2 = @chain example begin
    @group_by(Group)
    @summarise(num_periods = length(~unique(Timing)))
end
5×2 DataFrame
 Row │ Group  num_periods 
     │ Int64  Int64       
─────┼────────────────────
   1 │     1            5
   2 │     2            5
   3 │     3            5
   4 │     4            5
   5 │     5            5

More details on this behavior and how to do this are located in the documentation page here: https://tidierorg.github.io/TidierData.jl/latest/examples/generated/UserGuide/autovec/

kdpsingh commented 5 months ago

Thanks @drizk1, and I'll plan to add unique() to the do-not-vectorize list to prevent the need for a tilde in the next update.

metanoid commented 5 months ago

User skill issue

kdpsingh commented 5 months ago

Let's leave this open until I fix how unique() is handled by default.

kdpsingh commented 4 months ago

This is fixed in #107