TidierOrg / TidierData.jl

Tidier data transformations in Julia, modeled after the dplyr/tidyr R packages.
MIT License
85 stars 7 forks source link

speed comparison of dplyr and tidierdata #24

Open Zhaoju-Deng opened 1 year ago

Zhaoju-Deng commented 1 year ago

Hi Karandeep, Its nice to have the tidyverse package in Julia! I tried TidierData to create two new columns for a ~1Gb dataset, I was actually expecting TidierData much more faster than dplyr, but the results showed dplyr was much faster than TidierData (0.9s in dplyr and 5.5-8s in TidierData), would it possible to fine-tune the speed so that TidierData would be much more efficient in manipulating large datasets. Personally I think that matters for data analysis.

kind regards, Zhaoju

kdpsingh commented 1 year ago

Benchmarking is a tricky issue in Julia because the first time you run code in Julia, it is compiled (leading to a compilation delay). The code usually runs much, much faster after code is compiled. This is true even if you change the underlying dataset or certain parameters, so it's not that Julia is cheating from having cached the answer but is legimately way faster on the second run.

This issue is been mitigated in Julia 1.9 by precompilation, which caches compiled code. DataFrames.jl (which TidierData.jl wraps) takes advantage of precompilation workflows, but TidierData.jl for now doesn't do any additional precompilation (which we probably should).

I'll move this issue to the TidierData.jl repo, because I agree we should periodically run some basic benchmarks to understand how TidierData.jl stacks up against DataFrames.jl (to understand how much overhead is added) and against R tidyverse.

tl;dr: I agree speed is important. Many published benchmarks show DataFrames.jl to be faster than R tidyverse, but we on the first run the compilation can introduce a delay.

Two questions:

kdpsingh commented 1 year ago

From some initial testing, there is likely room for optimization within TidierData.jl. We will do some profiling on our end to understand the bottlenecks as well as the relationship between data size and overhead (which could be possible if the additional allocations in TidierData are related to inadvertent copies being made).

Zhaoju-Deng commented 1 year ago

Hi Kdpsingh, I tested it on Julia version 1.9.0 on vscode windows 10 intel i7-9750H cpu and 64G RAM. I have already ran TidierData multiple times, and the initial time was ~ 9 seconds, and thereafter were in the range of 5-7 seconds. I quite appreciate for this package(s), and I think it would be more attractive for people using R to use Julia for data manipulation. looking forward for the next version of this package!

kdpsingh commented 1 year ago

Thanks for sharing. Right now, this package does a lot of extra stuff on top of DataFrames.jl for the sake of user convenience, and I imagine some of that is responsible for the slowdown.

However, I do think some of it is fixable because we can avoid certain steps that I think will speed things up.

So in summary, the package's main selling point at the moment is the consistent syntax. Hoping in the near future that the speed penalty won't be as large.

Zhaoju-Deng commented 1 year ago

sounds great, while I can not contribute to the development of this package, I will test it when it release again.

kdpsingh commented 1 year ago

Ok, I did some initial exploration and think I know what's responsible for the slowdown. Some of the functions call an extra select() and/or transform(), and I believe that's the underlying cause of the extra allocations and slowness.

We will try the following things in future releases.

kdpsingh commented 1 year ago

@Zhaoju-Deng, thanks for bringing this up. This issue is mostly resolved in v0.10.0, which is on the registry now. I haven't yet added support for PrecompileTools (which will minimize differences between the first run of the code and subsequent runs), but otherwise you should see major speed-ups in the performance in v0.10.0.

I'll leave this issue open mostly as a placeholder so that we can return to it and add support for PrecompileTools.

Feel free to check it out and see if you notice any difference on your end.

Zhaoju-Deng commented 1 year ago

@kdpsingh I just upgraded Julia to v1.10beta1 and tested it again, however, the first compile time increased to 13.9 seconds and the following compile time to be 6-7 seconds (see in the attached screenshot). it is not a big issue for now, but hope it could be solved soon. 5ac16e5f1b516712c24ab6d84de3b5c

kdpsingh commented 1 year ago

Thanks for sharing the screenshot!

I'll try to recreate this on my end. If the dataset happens to be publicly available, please let me know -- otherwise I'll create some synthetic data with similar properties.

The precompilation issue will be fixed in a future update.

However, we shouldn't be several-fold slower than dplyr so let me look at this carefully.

kdpsingh commented 1 year ago

I think I know what is going on. Tidier.jl currently points to the old version of TidierData.jl, so you're not seeing the changes from the new version yet.

I bet if you go to the package REPL by pressing ] and type in st, it'll point to the older version of TidierData.

I'm fixing the Tidier dependencies right now.

For TidierData.jl: Slow version = 0.9.2 Fast version = 0.10.0

Feel free to confirm.

kdpsingh commented 1 year ago

A simple way to fix this is to remove Tidier.jl and to directly update TidierData.jl.

I just pushed the updated version of Tidier.jl to the Julia repository, so that should be fixed soon.

kdpsingh commented 1 year ago

The new version of Tidier.jl is now on the registry. If you update it using ] update Tidier that should now install the latest version of TidierData.jl, which should be much faster.

Zhaoju-Deng commented 1 year ago

hi @kdpsingh , I used Tidier v0.7.6 and julia v1.10beta1, while the first compile time ~8 seconds and the following compile time in the range 4.8-5.1 seconds. while the TidierData v10.0.0, the first compile time to be 4.63 second and the following compile time to be 2.6-2.9 seconds. it seems improved a lot! but still slower than dplyr, hope you could re-fine it to be much faster than dplyr!

kdpsingh commented 1 year ago

We'll keep working on it! Step 1 is for us to try to reproduce this result. I'm surprised it is slower than dplyr here but have some ideas.

kdpsingh commented 1 year ago

Note to self: My suspicion is that there is still some recompilation happening here because this code isn't wrapped in a function. Will test it out.

Zhaoju-Deng commented 1 year ago

great, I am very intersted to see its lightning fast performance!

drizk1 commented 1 year ago

@Zhaoju-Deng Thanks your efforts here!

I was wondering, to keep it consistent with the style of benchmarking I have been trying, would it be too much trouble for you to try the following:

using BenchmarkTools

function trial()
@chain dt begin
   @group_by(tmvFrmId, tmvLifeNumber)
   @mutate(mean_my = mean(skipmissing (tmvMviMilkYield)),
           mean_scc = mean(skipmissing(tmvMviCelICountUdder))
  @ungroup 
end 

@benchmark trial()

This will keep the methods consistent with what I was trying.

I'll have a benchmark.jl in the git later this week.

Thank you!

kdpsingh commented 1 year ago

I'm fairly convinced that the @time macro being used in the global scope is leading to unreliable results in the timing here (see discussion for a different package at https://github.com/kdpsingh/TidyTable.jl/issues/3).

Wrapping it in a function at @drizk1 suggests is the easiest way to check this.

@drizk1, if we work on a benchmark.jl, we may want to show why folks may get different results when using @time in the global scope, and what the implications of this are (which I can help with).

I don't want to assume that this is the issue (until we check it), so we'll revisit further optimization until after we generate a set of benchmarks and explain the implications of benchmarking within functions vs. global scope.

drizk1 commented 1 year ago

I just ran a quick test with @time vs @benchmark on the file I've been working off. @Time took over twice as long as @benchmark with 80% of @time being in recompilation. Very curious to see what @Zhaoju-Deng might find.

Zhaoju-Deng commented 1 year ago

I just ran @benchmark and the estimated time was indeed only half of the time reported by @time, however, the @benchmark time was not the total compile time fo the code, the actual running time of the code was much longer than the estimated time by @benchmark. I am not familiar with the underlying algorithm for calculating the compiling time of @benchmark and @time, however, to my feeling, the @time estimated time is more close to the "actual" running time of code

kdpsingh commented 1 year ago

@Zhaoju-Deng, thanks for doing that.

The short answer is this. If you write code like this...

  @chain dt begin
   @group_by(tmvFrmId, tmvLifeNumber)
   @mutate(mean_my = mean(skipmissing (tmvMviMilkYield)),
           mean_scc = mean(skipmissing(tmvMviCelICountUdder))
  @ungroup 

...in the global scope, then Julia first compiles the code (which takes a second), and then runs it. It sometimes has to do less compilation the second time around, but still has to do compilation.

However, if you wrap that same code in a function like this...

function analysis()
  @chain dt begin
     @group_by(tmvFrmId, tmvLifeNumber)
     @mutate(mean_my = mean(skipmissing (tmvMviMilkYield)),
                 mean_scc = mean(skipmissing(tmvMviCelICountUdder))
  @ungroup
end

...and then you run analysis(), the first time you run it, it compiles, and then it doesn't have to compile again.

Now you might wonder, well how does that help in interactive usage?

Well, if you redefine the function like this, with the data frame dt as an argument...

function analysis(dataset)
  @chain dataset begin
     @group_by(tmvFrmId, tmvLifeNumber)
     @mutate(mean_my = mean(skipmissing (tmvMviMilkYield)),
                 mean_scc = mean(skipmissing(tmvMviCelICountUdder))
  @ungroup
end

...then you can update the data frame and re-run the function with the updated data frame, and it'll be lightning fast (often 10x+ faster than tidyverse).

So to summarize:

I had the same questions as you about @benchmark() vs. @time, so let me show you a way you can test this out using only the @time macro.

Try the following set-up using @time(). Is it any faster the second time around?

function analysis(dataset)
  @chain dataset begin
     @group_by(tmvFrmId, tmvLifeNumber)
     @mutate(mean_my = mean(skipmissing (tmvMviMilkYield)),
                 mean_scc = mean(skipmissing(tmvMviCelICountUdder))
  @ungroup
end

# the first time, the function compiles
@time analysis(dt) # assuming your dataset is named `dt`

# the second time, there should be no recompilation
@time analysis(dt)
drizk1 commented 10 months ago

With the recent updates to TidierData.jl, I was curious to revisit some benchmarks.

I benchmarked DF.jl vs TidierData.jl on a dataframe that was about 7.4 mil rows x 11 columns

Overall they performed nearly identically, coming within 15-20 ms of each other (different cases would lead one to be faster than the other, but minimally (ie 812ms vs 828ms)). The only significant time difference was when the summarize macro was used, at which point TidierData was notably slower.

Overall, the progress and performance of TidierData.jl is incredible ! Just thought I'd share the update here.

kdpsingh commented 10 months ago

Thanks for that update. This is a great reminder that I need to review the benchmarking page you had prepared for our documentation site, clean it up a bit, and make it public.

I'll try to run the the @summarize() benchmark to see if I can reproduce. I think I know why it's happening -- it's probably because I think we make an extra copy of the data. I have to look and see if it's avoidable (it probably is). The reason the code needs to be slightly different here (than @mutate) is that while there is a transform!() function, there's understandably no combine!() function, so I think I end up making a copy up front that's not needed.

kdpsingh commented 10 months ago

Also, at some point we should add precompilation to TidierData to remove any lag from first usage. Even though we are primarily wrapping DataFrames.jl (which already caches precompiled code), the parsing functions should be precompiled.