codedthinking / Kezdi.jl

An umbrella of Julia packages for data analysis, in loving memory of Gábor Kézdi
Other
9 stars 0 forks source link

feat: speed up `egen` #83

Closed korenmiklos closed 2 days ago

korenmiklos commented 2 days ago

collapse is fast. collapsing 10m rows into 10k groups takes 1.4 seconds:

julia> @time @collapse df mean_x = mean(x), by(y)
  1.352974 seconds (5.04 M allocations: 648.601 MiB, 8.88% gc time, 78.05% compilation time: 45% of which was recompilation)
10001×2 DataFrame
   Row │ y      mean_x
       │ Int64  Float64
───────┼─────────────────────
     1 │     0    500.0

By contrast, egen takes 81:

julia> @time @egen df mean_x = mean(x), by(y)
 80.653723 seconds (1.40 M allocations: 839.755 GiB, 12.42% gc time, 0.38% compilation time: 17% of which was recompilation)
10000000×3 DataFrame
      Row │ y      x         mean_x
          │ Int64  Int64     Float64?
──────────┼─────────────────────────────
        1 │     0         1  500.0