BlazingDB / blazingsql

BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.
https://blazingsql.com
Apache License 2.0
1.92k stars 181 forks source link

[BUG] STDDEV produces wrong results #1501

Closed drabastomek closed 3 years ago

drabastomek commented 3 years ago

Describe the bug The STDDEV SQL function produces wrong result -- given how the standard deviation, population and sample standard deviation are defined, it should produce the same results as STDDEV_POP but produces the same results as STDDEV_SAMP.

Steps/Code to reproduce bug Steps to reproduce

import cudf
from blazingsql import BlazingContext
bc = BlazingContext()

df = cudf.DataFrame({
    'np_float64': np.float64(np.arange(-5.5, 5.5))
    , 'category': [
        'A', 'A', 'A'
        , 'B', 'B', 'B'
        , 'C', 'C', 'C'
        , 'D', 'D'
        ]
})
bc.create_table('dtype_test', df)
bc.sql('''
    SELECT category
        , STDDEV(np_float64)
        , STDDEV_POP(np_float64)
        , STDDEV_SAMP(np_float64)
    FROM dtype_test
    GROUP BY category
''')

produces

  category EXPR$1 EXPR$2 EXPR$3
A 1.000000 0.816497 1.000000
B 1.000000 0.816497 1.000000
C 1.000000 0.816497 1.000000
D 0.707107 0.500000 0.707107

Expected behavior Table should produce the following

  category EXPR$1 EXPR$2 EXPR$3
A 0.816497 0.816497 1.000000
B 0.816497 0.816497 1.000000
C 0.816497 0.816497 1.000000
D 0.500000 0.500000 0.707107

The same results were replicated in MariaDB/MySQL database.

Environment overview (please complete the following information)

wmalpica commented 3 years ago

STDDEV and STDDEV_POP are not explicitly implemented in BSQL, instead, calcite converts them into a composition of different operations, which we should support. The first step to troubleshoot this is to make sure that the logical plan provided by looks correct

drabastomek commented 3 years ago

Logical plans are as follows:

  1. bc.explain('SELECT category, STDDEV(np_float64) FROM dtype_test GROUP BY category')

    produces

LogicalProject(category=[$0], EXPR$1=[POWER(/(-($1, /(*($2, $2), $3)), CASE(=($3, 1), null:BIGINT, -($3, 1))), 0.5:DECIMAL(2, 1))])
  LogicalProject(category=[$0], $f1=[CASE(=($2, 0), null:DOUBLE, $1)], $f2=[CASE(=($4, 0), null:DOUBLE, $3)], $f3=[$4])
    LogicalAggregate(group=[{0}], agg#0=[$SUM0($2)], agg#1=[COUNT($2)], agg#2=[$SUM0($1)], agg#3=[COUNT($1)])
      LogicalProject(category=[$0], np_float64=[$1], $f2=[*($1, $1)])
        BindableTableScan(table=[[main, dtype_test]], projects=[[1, 0]], aliases=[[category, np_float64]])
  1. bc.explain('SELECT category, STDDEV_POP{(np_float64) FROM dtype_test GROUP BY category')

    produces

LogicalProject(category=[$0], EXPR$1=[POWER(/(-($1, /(*($2, $2), $3)), $3), 0.5:DECIMAL(2, 1))])
  LogicalProject(category=[$0], $f1=[CASE(=($2, 0), null:DOUBLE, $1)], $f2=[CASE(=($4, 0), null:DOUBLE, $3)], $f3=[$4])
    LogicalAggregate(group=[{0}], agg#0=[$SUM0($2)], agg#1=[COUNT($2)], agg#2=[$SUM0($1)], agg#3=[COUNT($1)])
      LogicalProject(category=[$0], np_float64=[$1], $f2=[*($1, $1)])
        BindableTableScan(table=[[main, dtype_test]], projects=[[1, 0]], aliases=[[category, np_float64]])

3.

bc.explain('SELECT category, STDDEV_SAMP{(np_float64) FROM dtype_test GROUP BY category')

produces

LogicalProject(category=[$0], EXPR$1=[POWER(/(-($1, /(*($2, $2), $3)), CASE(=($3, 1), null:BIGINT, -($3, 1))), 0.5:DECIMAL(2, 1))])
  LogicalProject(category=[$0], $f1=[CASE(=($2, 0), null:DOUBLE, $1)], $f2=[CASE(=($4, 0), null:DOUBLE, $3)], $f3=[$4])
    LogicalAggregate(group=[{0}], agg#0=[$SUM0($2)], agg#1=[COUNT($2)], agg#2=[$SUM0($1)], agg#3=[COUNT($1)])
      LogicalProject(category=[$0], np_float64=[$1], $f2=[*($1, $1)])
        BindableTableScan(table=[[main, dtype_test]], projects=[[1, 0]], aliases=[[category, np_float64]])

The logical plans are identical for STDDEV and STDDEV_SAMP thus this is a Calcite issue rather than ours.