duckdblabs / db-benchmark

reproducible benchmark of database-like ops
https://duckdblabs.github.io/db-benchmark/
Mozilla Public License 2.0
144 stars 28 forks source link

New polars and Duckdb Times #87

Closed Tmonster closed 2 months ago

Tmonster commented 2 months ago

Keep v3 a float64 in polars, otherwise answers are no longer consistent with previous polars versions

ritchie46 commented 2 months ago

Keep v3 a float64 in polars, otherwise answers are no longer consistent with previous polars versions

I wouldn't say that's fair. There is quite a performance benefit with regard to cache sizes on f32. That the numbers deviate isn't wrong. It is just a result of operating on a lower precision. I think the checks numbers just need to updated now the precision of the queries are adapted.

Tmonster commented 2 months ago

Hi @ritchie46,

Sorry for the wait, but I had other duckdblabs matters to attend to, and wanted to be sure the deviations were within tolerance.

I understand that the answer might be different because of the change from float64 to float32, but a tolerance of of 0.001 (or 0.1%) is allowed between answer for the same question between versions. I think this is an acceptable tolerance and solutions/answers should remain precise within this tolerance between versions. I've gone through many of the non-matching answers and while most respect the tolerance there are a few deviations that concern me

Take the following results for example (groupby q5).

┌───────────────────────────────────────────┬─────────────────────────────────────────┬────────────────────────────┬────────────────┬───────────┬───────────┐
│                 t1.chk                    │                t2.chk                   │          question          │      data      │t1.version │t2.version │
│                  varchar                  │                 varchar                 │          varchar           │    varchar     │ varchar   │ varchar   │
├───────────────────────────────────────────┼─────────────────────────────────────────┼────────────────────────────┼────────────────┼───────────┼───────────┤
│ 2849922064.0;7600000111.0;47498842805.648 │ 2849922064.0;7600000111.0;47271718912.0 │ sum v1:v3 by id6           │ G1_1e9_1e2_5_0 │ 0.19.8    │ 1.0.0     │
│ 2849922064.0;7600000111.0;47498842805.648 │ 2849922064.0;7600000111.0;47271723008.0 │ sum v1:v3 by id6           │ G1_1e9_1e2_5_0 │ 0.19.8    │ 1.0.0     │
│ 2849922064.0;7600000111.0;47498842805.648 │ 2849922064.0;7600000111.0;47271718912.0 │ sum v1:v3 by id6           │ G1_1e9_1e2_5_0 │ 0.20.31   │ 1.0.0     │
├───────────────────────────────────────────┴─────────────────────────────────────────┴────────────────────────────┴────────────────┴───────────┴───────────┤

chk is created using the following SQL queries

CREATE  TABLE ans AS SELECT id6, sum(v1) AS v1, sum(v2) AS v2, sum(v3) AS v3 FROM x GROUP BY id6;
-- chk value
SELECT sum(v1) AS v1, sum(v2) AS v2, sum(v3) AS v3 FROM ans;

v1, v2, and v3 are then combined with a ';' separator.

Polars using float64 vs. float32 have results of 47498842805.648 and 47271718912 respectively for chk.v3. The difference is 227123894 (or 0.4%). This deviation is too high, and can potentially lead to more incorrect results in future runs which I prefer not to debug.

other system results for q5. duckdb: 47498842806 spark: 47498842805.649 data.table: 47498842806 arrow: 47498842806

From what I've seen, no other systems use float32. Most likely because the results are no longer valid within the tolerances.

Because of the deviation, I will not be merging results that use float32. I thought about partial results, but if one question is omitted, then the whole solution posts "exception/internal error etc." on the leaderboard and is not ranked, which I also don't think is fair.

You can check these answers with the following SQL queries (in the branch Tmonster/polars_results_july_5)

-- create table
create table timings as select * from read_csv('time.csv');
-- polars results
select t1.chk, t2.chk, t1.question, t1.data, t1.version, t2.version from
     timings t1, timings t2
   where t1.chk != t2.chk
   and t1.question = t2.question
   and t1.task = t2.task
   and t1.solution = 'polars'
   and t1.data not like 'G1_1e9_2e%'
   and t1.solution = t2.solution
   and t1.task = 'groupby'
   and t1.version < t2.version
   and t2.version = '1.0.0' and t1.data = t2.data  group by all order by t1.data;

select chk, question, data, version from timings 
     where question = 'sum v1:v3 by id6' 
     and data = 'G1_1e9_1e2_5_0' 
     and solution = 'duckdb'; 
Tmonster commented 2 months ago

I am going to merge these results today, and if you want to continue a discussion on moving to float32, we can move it to another issue/PR

ritchie46 commented 2 months ago

From what I've seen, no other systems use float32. Most likely because the results are no longer valid within the tolerances.

I assumed DuckDB used float32 because of this line: https://github.com/duckdblabs/db-benchmark/blob/e54b17fd80fc6bc57be214bb88e5c849fa5b515a/duckdb/groupby-duckdb.R#L47

Of course I am fine with float64, and if you say all systems do so, great! I just want to make sure we all benchmark the same data-types otherwise it's not apples to apples.