cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.87k stars 3.77k forks source link

Order of evaluation of aggregations causes precision issues for float #95665

Closed cockroach-teamcity closed 1 year ago

cockroach-teamcity commented 1 year ago

roachtest.unoptimized-query-oracle/disable-rules=half/seed-multi-region failed with artifacts on master @ b21379bb56dd206e6f63cc7d07ca72e85db7a4c4:

test artifacts and logs in: /artifacts/unoptimized-query-oracle/disable-rules=half/seed-multi-region/run_1
(query_comparison_util.go:251).runOneRoundQueryComparison: failed to set random seed. 1026 statements run: dial tcp 34.138.224.32:26257: connect: connection refused

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

/cc @cockroachdb/sql-queries

This test on roachdash | Improve this report!

Jira issue: CRDB-23664

cockroach-teamcity commented 1 year ago

roachtest.unoptimized-query-oracle/disable-rules=half/seed-multi-region failed with artifacts on master @ dc974cd698364a4db5acb630162dd8b1856cfad6:

test artifacts and logs in: /artifacts/unoptimized-query-oracle/disable-rules=half/seed-multi-region/run_1
(query_comparison_util.go:251).runOneRoundQueryComparison: . 1043 statements run: expected unoptimized and optimized results to be equal
  []string{
    "0",
-   "9.39932155415957e-48",
+   "1.4148125887129337e-47",
  }
sql: SELECT
    stddev(tab53._float8::FLOAT8)::FLOAT8 AS co̎l154
FROM
    defaultdb.public.seed_mr_table@[0] AS "%9ftab52"
    JOIN defaultdb.public.seed_mr_table@[0] AS tab53 ON ("%9ftab52"._int4) = (tab53._int8)
GROUP BY
    tab53._float8
HAVING
    every(tab53._bool::BOOL)::BOOL

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: [roachtest README](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/README.md) See: [How To Investigate \(internal\)](https://cockroachlabs.atlassian.net/l/c/SSSBr8c7)

This test on roachdash | Improve this report!

michae2 commented 1 year ago

It would be good to check that in both cases we're using the same stddev algorithm, one of the algorithms that's more numerically stable. (I think we have at least one implementation based on Welford-Knuth but maybe we have more than one implementation?)

cucaroach commented 1 year ago

Second failure is dupe of https://github.com/cockroachdb/cockroach/issues/95264

msirek commented 1 year ago

I couldn't reconstruct the stddev issue on first attempt, but here's a simplified test case with SUM illustrating the order of evaluation problem:

create table t1 (a float8, b int, c int, index c_idx(c desc) storing (a, b));
insert into t1 values (1e4, 1, 0);
insert into t1 values (1.00000000000000001e-48, 1, 1);
insert into t1 values (-1e4, 1, 2);
select sum(a) from t1@t1_pkey;
  sum
-------
    0

select sum(f) from (values (1e4), (1.00000000000000001e-48), (-1e4)) v(f);
            sum
---------------------------
  1.00000000000000001E-48

If the order of insertion of rows 2 and 3 in the table is swapped, we get the result:

select sum(a) from t1@t1_pkey;
   sum
---------
  1e-48

Here's a test case where selecting from the primary key vs. an index yields different results:

create table t1 (a float8, b int, c int, index c_idx(c desc) storing (a, b));
insert into t1 values (1e4, 1, 0);
insert into t1 values (-1e4, 1, 2);
insert into t1 values (1e-12, 1, 1);
select sum(a) from t1@t1_pkey;
   sum
---------
  1e-12

select sum(a) from t1@c_idx;
           sum
--------------------------
  1.8189894035458565e-12
msirek commented 1 year ago

The same issue occurs on postgres:

test=# create table t1 (a float8, b int, c int);
CREATE TABLE
test=# insert into t1 values (1e4, 1, 0);
INSERT 0 1
test=# insert into t1 values (1.00000000000000001e-48, 1, 1);
INSERT 0 1
test=# insert into t1 values (-1e4, 1, 2);
INSERT 0 1
test=# select sum(a) from t1;
 sum
-----
   0
(1 row)

test=# delete from t1;
DELETE 3
test=# insert into t1 values (1e4, 1, 0);
INSERT 0 1
test=# insert into t1 values (-1e4, 1, 2);
INSERT 0 1
test=# insert into t1 values (1.00000000000000001e-48, 1, 1);
INSERT 0 1
test=# select sum(a) from t1;
  sum
-------
 1e-48
msirek commented 1 year ago

Removing release-blocker label. Order of execution precision differences are nothing new, and are expected.

msirek commented 1 year ago

Precision differences could possibly be eliminated by using a different data type under the covers to perform the operation, like decimal. But this may use more memory and perform worse. Using float vs. decimal is a trade-off of performance vs. precision.

msirek commented 1 year ago

TLP may have a way to avoid this issue. Maybe this was not extended to unoptimized-query-oracle or costfuzz.

rharding6373 commented 1 year ago

We discussed this family of issues in our collab session today. We decided this might be a good meta issue for tracking investigation into float precision in aggregation operations.

Options we discussed (in no particular order or recommendation):

  1. Only compare floats to some amount of precision.
  2. Don't use floats in unoptimized-query-oracle testing.
  3. Don't run queries with stddev or avg on floats in unoptimized-query-oracle testing.
  4. Constrain floats inserted into the test tables to a narrow range of precision.
  5. Use higher precision types (like decimal) internally for the sum and convert back after taking the average.