[BUG] STDDEV produces wrong results

drabastomek commented 3 years ago

Describe the bug The STDDEV SQL function produces wrong result -- given how the standard deviation, population and sample standard deviation are defined, it should produce the same results as STDDEV_POP but produces the same results as STDDEV_SAMP.

Steps/Code to reproduce bug Steps to reproduce

import cudf
from blazingsql import BlazingContext
bc = BlazingContext()

df = cudf.DataFrame({
    'np_float64': np.float64(np.arange(-5.5, 5.5))
    , 'category': [
        'A', 'A', 'A'
        , 'B', 'B', 'B'
        , 'C', 'C', 'C'
        , 'D', 'D'
        ]
})
bc.create_table('dtype_test', df)
bc.sql('''
    SELECT category
        , STDDEV(np_float64)
        , STDDEV_POP(np_float64)
        , STDDEV_SAMP(np_float64)
    FROM dtype_test
    GROUP BY category
''')

produces

category	EXPR$1	EXPR$2	EXPR$3
A	1.000000	0.816497	1.000000
B	1.000000	0.816497	1.000000
C	1.000000	0.816497	1.000000
D	0.707107	0.500000	0.707107

Expected behavior Table should produce the following

category	EXPR$1	EXPR$2	EXPR$3
A	0.816497	0.816497	1.000000
B	0.816497	0.816497	1.000000
C	0.816497	0.816497	1.000000
D	0.500000	0.500000	0.707107

The same results were replicated in MariaDB/MySQL database.

Environment overview (please complete the following information)

Environment location: Bare-metal
Method of BlazingSQL install: source
BlazingSQL Version: 'ba4a97a3ce3f1c733a135bf88518f2f67b12b519'

wmalpica commented 3 years ago

STDDEV and STDDEV_POP are not explicitly implemented in BSQL, instead, calcite converts them into a composition of different operations, which we should support. The first step to troubleshoot this is to make sure that the logical plan provided by looks correct

drabastomek commented 3 years ago

Logical plans are as follows:

bc.explain('SELECT category, STDDEV(np_float64) FROM dtype_test GROUP BY category')

produces

LogicalProject(category=[$0], EXPR$1=[POWER(/(-($1, /(*($2, $2), $3)), CASE(=($3, 1), null:BIGINT, -($3, 1))), 0.5:DECIMAL(2, 1))])
  LogicalProject(category=[$0], $f1=[CASE(=($2, 0), null:DOUBLE, $1)], $f2=[CASE(=($4, 0), null:DOUBLE, $3)], $f3=[$4])
    LogicalAggregate(group=[{0}], agg#0=[$SUM0($2)], agg#1=[COUNT($2)], agg#2=[$SUM0($1)], agg#3=[COUNT($1)])
      LogicalProject(category=[$0], np_float64=[$1], $f2=[*($1, $1)])
        BindableTableScan(table=[[main, dtype_test]], projects=[[1, 0]], aliases=[[category, np_float64]])

bc.explain('SELECT category, STDDEV_POP{(np_float64) FROM dtype_test GROUP BY category')

produces

LogicalProject(category=[$0], EXPR$1=[POWER(/(-($1, /(*($2, $2), $3)), $3), 0.5:DECIMAL(2, 1))])
  LogicalProject(category=[$0], $f1=[CASE(=($2, 0), null:DOUBLE, $1)], $f2=[CASE(=($4, 0), null:DOUBLE, $3)], $f3=[$4])
    LogicalAggregate(group=[{0}], agg#0=[$SUM0($2)], agg#1=[COUNT($2)], agg#2=[$SUM0($1)], agg#3=[COUNT($1)])
      LogicalProject(category=[$0], np_float64=[$1], $f2=[*($1, $1)])
        BindableTableScan(table=[[main, dtype_test]], projects=[[1, 0]], aliases=[[category, np_float64]])

3.

bc.explain('SELECT category, STDDEV_SAMP{(np_float64) FROM dtype_test GROUP BY category')

produces

LogicalProject(category=[$0], EXPR$1=[POWER(/(-($1, /(*($2, $2), $3)), CASE(=($3, 1), null:BIGINT, -($3, 1))), 0.5:DECIMAL(2, 1))])
  LogicalProject(category=[$0], $f1=[CASE(=($2, 0), null:DOUBLE, $1)], $f2=[CASE(=($4, 0), null:DOUBLE, $3)], $f3=[$4])
    LogicalAggregate(group=[{0}], agg#0=[$SUM0($2)], agg#1=[COUNT($2)], agg#2=[$SUM0($1)], agg#3=[COUNT($1)])
      LogicalProject(category=[$0], np_float64=[$1], $f2=[*($1, $1)])
        BindableTableScan(table=[[main, dtype_test]], projects=[[1, 0]], aliases=[[category, np_float64]])

The logical plans are identical for STDDEV and STDDEV_SAMP thus this is a Calcite issue rather than ours.

BlazingDB / blazingsql

[BUG] STDDEV produces wrong results #1501