Inclusion of Julia in the benchmarks

bkamins commented 1 year ago

Are you open to re-include Julia DataFrames.jl in the benchmarks?

The scripts in https://github.com/duckdblabs/db-benchmark/tree/master/juliadf should just work. However, I can also update them to latest Julia and related packages versions. It would roughly be https://github.com/h2oai/db-benchmark/pull/232.

A related big question is precompilation. In the H2O benchmark the compilation time for Julia was included in the benchmark (that is the reason why there are two runs per test). Also there was a requirement that no named functions should be created and called from the code performing the operations. Do you want to keep these two restrictions?

Tmonster commented 1 year ago

Hi Bogumił,

Apologies for leaving out JuliaDF results, I have started a run to include them using the latest Julia version. The results should be up by the end of the week. I will also upload the repository with the changes.

As for the restrictions, we do plan on keeping precompilation time. Precompilation time is still time the user has to wait before results are produced, and the spirit of the h2o benchmark is to avoid bias towards results for "hot" runs.

For named functions, I'll have to think about this a bit more. I'm unfamiliar with Julia and don't quite understand all the benefits of named functions. I've done some micro benchmarks for groupby q1 on 0.5GB and 5GB workloads. I wasn't able to notice any major performance improvements over named functions. Could you point me to some documentation describing the benefits of named functions for the benchmark? Or maybe provide a PR where groupby-juliadf uses named functions?

bkamins commented 1 year ago

I'm unfamiliar with Julia and don't quite understand all the benefits of named functions.

Here is an example of the difference:

With anonymous functions:

julia> using DataFrames

julia> df = DataFrame(a=1:3)
3×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> @time combine(df, :a => x -> x => :b);
  0.094240 seconds (34.31 k allocations: 2.272 MiB, 85.24% compilation time)

julia> @time combine(df, :a => x -> x => :b);
  0.022870 seconds (7.31 k allocations: 506.077 KiB, 97.94% compilation time)

With named function (fresh session):

julia> using DataFrames

julia> f(x) = x
f (generic function with 1 method)

julia> df = DataFrame(a=1:3)
3×1 DataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> @time combine(df, :a => f => :b);
  0.018758 seconds (9.72 k allocations: 677.229 KiB, 96.25% compilation time)

julia> @time combine(df, :a => f => :b); # no compilation here
  0.000150 seconds (113 allocations: 6.656 KiB)

The differences would be probably visible on small datasets. The issue is that with named functions compilation happens once. While if you pass anonymous function in global scope then each time you pass it it is a fresh function that has to be compiled. (also anonymous function within another function is compiled only once - the only problem is when multiple anonymous functions with the same body are created in global scope - but this is exactly what we had in H2O benchmarks).

In parallel I will open a PR that is similar to https://github.com/h2oai/db-benchmark/pull/232 to update the codes to the latest version of DataFrames.jl.

PallHaraldsson commented 1 year ago

Confusingly I do see already:

Environment configuration R 4.2.2 python 3.10 Julia 1.6.1

Despite no DataFrames.jl. Since Julia 1.9 seems just around the corner, I think it ok to wait for that, and you could also updated to latest Python (and R also outdated?). I'm guessing the text is just hardcoded...

we do plan on keeping precompilation time.

Julia 1.9 precompiles to native code, so at least first run (excluding precompilation, that just happens once, and should be excluded) runs faster now after a fresh start of Julia. [Second run, is then still faster, what I think you were referring to not doing.]

bkamins commented 1 year ago

I'm guessing the text is just hardcoded...

Yes, In #7 I update it to Julia 1.9.0-rc2.

excluding precompilation, that just happens once, and should be excluded

It is not counted

runs faster now after a fresh start of Julia

Yes, it should run faster.

Tmonster commented 1 year ago

Will also update the benchmark results tomorrow. Decided to wait since you opened the PR with updated Julia version

bkamins commented 1 year ago

Thank you!

Tmonster commented 1 year ago

Results have been published to https://duckdblabs.github.io/db-benchmark/. Seems like an upgrade to Julia 1.9.0-rc2 caused the group by 50GB benchmark to run out of memory. Looking at the logs, the initial run I made (on Julia 1.8.5 I believe) finished all group by queries for the G1_1e9_1e2_0_0 dataset. https://github.com/duckdblabs/db-benchmark/blob/gh-pages/logs.csv#L1222 the last two columns are what was logged to stderr, and the return code.

Ran the benchmark again on Julia 1.9.0-rc2 with your changes and it looks like a OOM error for the same dataset G1_1e9_1e2_0_0. https://github.com/duckdblabs/db-benchmark/blob/gh-pages/logs.csv#L1286

bkamins commented 1 year ago

Thank you very much for running the benchmarks. They are extremely useful.

How many physical cores does the machine on which you run the tests have?

In particular (self-note):

it is clear that we need to enable multi threading (which we do not have for all operations now) especially for the larger tests it is significantly relevant;
we need to think about precompilation more (as it affects results a lot especially for the smallest data case)

I add @quinnj to the discussion as we need to assess if OOM is caused by CSV reader (CSV.jl) or DataFrames.jl.

Tmonster commented 1 year ago

Hi Bogumił,

You're welcome! The benchmarks are run on a m4.10xlarge AWS machine which advertises 40vCPUs. I'm not sure how that translates to physical cpus though. The instance has 160 GB of memory. https://aws.amazon.com/ec2/instance-types/

As for OOM and the group by queries. The data is first loaded from csv and then all 10 queries are executed. First the 5 basic group by, then the 5 advanced group by queries. If one query reports OOM, all queries afterwards aren't run and report an OOM failure. This doesn't mean they actually failed due to OOM, they just were never run due to an earlier OOM failure.

For JuliaDF, it seems like query 3 in advanced queries (or query 8 in groupie-juliadf.jl) is the first query that doesn't finish, so that query is a good place to start an investigation. I'm not sure how data is handled by Julia once it is read into memory, it could be that by query 8 memory fragmentation starts happening and requested block sizes can no longer be allocated, so the process is killed.

duckdblabs / db-benchmark

Inclusion of Julia in the benchmarks #4