apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.49k stars 1.02k forks source link

coalesce() of two timestamps returns Utf8 result #8790

Open sergiimk opened 5 months ago

sergiimk commented 5 months ago

Describe the bug

The coalesce() function when applied to two Timestamp(_, _) columns unexpectedly returns Utf8 result.

To Reproduce

In datafusion-cli:

create or replace table x as (
    select 
        cast(null as timestamp) as n, 
        cast('2020-01-01T00:00:00Z' as timestamp) as v
);

describe x;
+-------------+-----------------------------+-------------+
| column_name | data_type                   | is_nullable |
+-------------+-----------------------------+-------------+
| n           | Timestamp(Nanosecond, None) | YES         |
| v           | Timestamp(Nanosecond, None) | NO          |
+-------------+-----------------------------+-------------+

select * from x;
+---+---------------------+
| n | v                   |
+---+---------------------+
|   | 2020-01-01T00:00:00 |
+---+---------------------+

create or replace table y as (
    select coalesce(n, v) as v from x
);

select * from y;
+---------------------+
| v                   |
+---------------------+
| 2020-01-01T00:00:00 |
+---------------------+

describe y;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| v           | Utf8      | YES         |
+-------------+-----------+-------------+

Expected behavior

1) Result column is of Timestamp type 2) Result column is NOT nullable, as one of the arguments to coalesce() cannot be NULL

Additional context

Note that this works fine (in respect to expectation 1, but not 2):

create or replace table y as (
    select case when n is null then v else n end as v from x
);

describe y;
+-------------+-----------------------------+-------------+
| column_name | data_type                   | is_nullable |
+-------------+-----------------------------+-------------+
| v           | Timestamp(Nanosecond, None) | YES         |
+-------------+-----------------------------+-------------+
alamb commented 5 months ago

I agree this is unexpected and confusing

dhamotharan-ps commented 5 months ago

@alamb In /expr/src/conditional_expressions.rs file, it is mentioned that coalesce() is supported only for below datatypes and not supported for Timestamp. And so, the return value of coalesce() with timestamp is considered as UTF8. Is there any specific reason for not supporting other datatypes?

/// Currently supported types by the coalesce function.
/// The order of these types correspond to the order on which coercion applies
/// This should thus be from least informative to most informative
pub static SUPPORTED_COALESCE_TYPES: &[DataType] = &[
    DataType::Boolean,
    DataType::UInt8,
    DataType::UInt16,
    DataType::UInt32,
    DataType::UInt64,
    DataType::Int8,
    DataType::Int16,
    DataType::Int32,
    DataType::Int64,
    DataType::Float32,
    DataType::Float64,
    DataType::Utf8,
    DataType::LargeUtf8,
];
alamb commented 5 months ago

Is there any specific reason for not supporting other datatypes?

Hi @dhamotharan-ps -- looking at the actual functions implementation, it does not appear to depend on any specific type, so I don't see a reason for not supporting other data types

https://github.com/apache/arrow-datafusion/blob/e78f4bd9289db4ed41e445e56d59540491aeced2/datafusion/physical-expr/src/conditional_expressions.rs#L26-L80

Maybe we just need to change https://github.com/apache/arrow-datafusion/blob/e78f4bd9289db4ed41e445e56d59540491aeced2/datafusion/expr/src/built_in_function.rs#L996-L998 to Signature::uniform 🤔

dhamotharan-ps commented 5 months ago

@alamb Signature::uniform is used for fixed number of arguments with same type where as variadic() is used for An arbitrary number of arguments with the same type.

To fix this issue, we need to add missing datatypes to conditional_expressions::SUPPORTED_COALESCE_TYPES. I will fix this issue.

pub static SUPPORTED_COALESCE_TYPES: &[DataType] = &[
    DataType::Boolean,
    DataType::UInt8,
    DataType::UInt16,
    DataType::UInt32,
    DataType::UInt64,
    DataType::Int8,
    DataType::Int16,
    DataType::Int32,
    DataType::Int64,
    DataType::Float32,
    DataType::Float64,
    DataType::Utf8,
    DataType::LargeUtf8,
    DataType::Timestamp(Nanosecond, None),
    DataType::Timestamp(Microsecond, None),
    DataType::Timestamp(Millisecond, None),
    DataType::Timestamp(Second, None),
    DataType::Date32,
    DataType::Date64,
    DataType::Binary,
    ];
alamb commented 5 months ago

Maybe we also need Time and Time64?

@dhamotharan-ps -- What about using variadic_equal? https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.Signature.html#method.variadic_equal

An arbitrary number of arguments of the same type.

That way we don't restrict the types to a fixed list but support any data type

dhamotharan-ps commented 5 months ago

@alamb yes, using variadic_equal would be the better option. May I fix this and add testcases for other datatypes? Or will you commit it?

alamb commented 5 months ago

May I fix this and add testcases for other datatypes?

Please do! That would be most appreciated

dhamotharan-ps commented 5 months ago

@alamb When I use variadic_equal(), the following tests were failing.

failures:
    tpcds_logical_q40
    tpcds_logical_q49
    tpcds_logical_q67
    tpcds_logical_q75
    tpcds_logical_q77
    tpcds_logical_q78
    tpcds_logical_q80
    tpcds_physical_q40
    tpcds_physical_q49
    tpcds_physical_q67
    tpcds_physical_q75
    tpcds_physical_q77
    tpcds_physical_q78
    tpcds_physical_q80

The error messages are:

---- tpcds_physical_q78 stdout ----
Error: Plan("No function matches the given name and argument types 'coalesce(Decimal128(17, 2), Int64)'. You might need to add explicit type casts.\n\tCandidate functions:\n\tcoalesce(CoercibleT, .., CoercibleT)")

---- tpcds_physical_q80 stdout ----
Error: Plan("No function matches the given name and argument types 'coalesce(Decimal128(7, 2), Int64)'. You might need to add explicit type casts.\n\tCandidate functions:\n\tcoalesce(CoercibleT, .., CoercibleT)")

---- Another error for UInt64
External error: query failed: DataFusion error: Error during planning: No function matches the given name and argument types 'coalesce(UInt64, Int64)'. You might need to add explicit type casts.
    Candidate functions:
    coalesce(CoercibleT, .., CoercibleT)
[SQL] select col1, col2, coalesce(sum_col3, 0) as sum_col3
from (select distinct col2 from tbl) AS q1
cross join (select distinct col1 from tbl) AS q2
left outer join (SELECT col1, col2, sum(col3) as sum_col3 FROM tbl GROUP BY col1, col2) AS q3
USING(col2, col1)
ORDER BY col1, col2
at test_files/joins.slt:1785

The root-cause is coerced_from() is not handling Decimcal128() and UInt64 datatypes.

Before this fix, Decimal types are typecasted to Float64 before coming to coerced_from() and so they were working.

We would require to improve coerced_from() function to handle more datatypes. I also try other approaches to fix this issue.

dhamotharan-ps commented 5 months ago

Without fix, the following are return types of coalesce() function.

create or replace table x as (
        select 
            cast(null as timestamp) as n, 
            cast('2020-01-01T00:00:00Z' as timestamp) as v
    );
    select coalesce(n, v) from x;   -- return datatype is Utf8

create or replace table x as (
        select 
            cast(null as Decimal) as n, 
            cast(8 as BIGINT) as v
    );
select coalesce(n, v) from x;   -- return datatype is Float64
create or replace table x as (
        select 
            cast(null as Date) as n, 
            cast('2020-01-01T00:00:00Z' as Timestamp) as v
    );
select coalesce(n, v) from x;   -- return datatype is Utf8

create or replace table x as (
        select 
            cast(null as  Bigint unsigned) as n, 
            cast(8 as BIGINT) as v
    );
select coalesce(n, v) from x;   -- return datatype is Float32

With fix,

create or replace table x as (
        select 
            cast(null as timestamp) as n, 
            cast('2020-01-01T00:00:00Z' as timestamp) as v
    );
    select coalesce(n, v) from x;   -- return datatype is Timestamp(Nanosecond, None)

create or replace table x as (
        select 
            cast(null as Decimal) as n, 
            cast(8 as BIGINT) as v
    );
select coalesce(n, v) from x;   -- return datatype is Decimal128(38, 10)
create or replace table x as (
        select 
            cast(null as Date) as n, 
            cast('2020-01-01T00:00:00Z' as Timestamp) as v
    );
select coalesce(n, v) from x;   -- return datatype is Timestamp(Nanosecond, None)

create or replace table x as (
        select 
            cast(null as  Bigint unsigned) as n, 
            cast(8 as BIGINT) as v
    );
select coalesce(n, v) from x;   -- return datatype is Int64
alamb commented 5 months ago

Maybe we need to add automatic coercsion from int64 to decimal 🤔 @viirya / @liukun4515 do you remember if there is some reason we don't do this?

comphead commented 4 months ago

Adding concise test

❯ select arrow_typeof(coalesce(cast(null as timestamp), cast(null as timestamp)));
+-----------------------------------+
| arrow_typeof(coalesce(NULL,NULL)) |
+-----------------------------------+
| Utf8                              |
+-----------------------------------+
1 row in set. Query took 0.002 seconds.