Open sergiimk opened 5 months ago
I agree this is unexpected and confusing
@alamb In /expr/src/conditional_expressions.rs file, it is mentioned that coalesce() is supported only for below datatypes and not supported for Timestamp. And so, the return value of coalesce() with timestamp is considered as UTF8. Is there any specific reason for not supporting other datatypes?
/// Currently supported types by the coalesce function.
/// The order of these types correspond to the order on which coercion applies
/// This should thus be from least informative to most informative
pub static SUPPORTED_COALESCE_TYPES: &[DataType] = &[
DataType::Boolean,
DataType::UInt8,
DataType::UInt16,
DataType::UInt32,
DataType::UInt64,
DataType::Int8,
DataType::Int16,
DataType::Int32,
DataType::Int64,
DataType::Float32,
DataType::Float64,
DataType::Utf8,
DataType::LargeUtf8,
];
Is there any specific reason for not supporting other datatypes?
Hi @dhamotharan-ps -- looking at the actual functions implementation, it does not appear to depend on any specific type, so I don't see a reason for not supporting other data types
Maybe we just need to change https://github.com/apache/arrow-datafusion/blob/e78f4bd9289db4ed41e445e56d59540491aeced2/datafusion/expr/src/built_in_function.rs#L996-L998 to Signature::uniform
🤔
@alamb Signature::uniform is used for fixed number of arguments with same type where as variadic() is used for An arbitrary number of arguments with the same type.
To fix this issue, we need to add missing datatypes to conditional_expressions::SUPPORTED_COALESCE_TYPES. I will fix this issue.
pub static SUPPORTED_COALESCE_TYPES: &[DataType] = &[
DataType::Boolean,
DataType::UInt8,
DataType::UInt16,
DataType::UInt32,
DataType::UInt64,
DataType::Int8,
DataType::Int16,
DataType::Int32,
DataType::Int64,
DataType::Float32,
DataType::Float64,
DataType::Utf8,
DataType::LargeUtf8,
DataType::Timestamp(Nanosecond, None),
DataType::Timestamp(Microsecond, None),
DataType::Timestamp(Millisecond, None),
DataType::Timestamp(Second, None),
DataType::Date32,
DataType::Date64,
DataType::Binary,
];
Maybe we also need Time
and Time64
?
@dhamotharan-ps -- What about using variadic_equal
? https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.Signature.html#method.variadic_equal
An arbitrary number of arguments of the same type.
That way we don't restrict the types to a fixed list but support any data type
@alamb yes, using variadic_equal would be the better option. May I fix this and add testcases for other datatypes? Or will you commit it?
May I fix this and add testcases for other datatypes?
Please do! That would be most appreciated
@alamb When I use variadic_equal(), the following tests were failing.
failures:
tpcds_logical_q40
tpcds_logical_q49
tpcds_logical_q67
tpcds_logical_q75
tpcds_logical_q77
tpcds_logical_q78
tpcds_logical_q80
tpcds_physical_q40
tpcds_physical_q49
tpcds_physical_q67
tpcds_physical_q75
tpcds_physical_q77
tpcds_physical_q78
tpcds_physical_q80
The error messages are:
---- tpcds_physical_q78 stdout ----
Error: Plan("No function matches the given name and argument types 'coalesce(Decimal128(17, 2), Int64)'. You might need to add explicit type casts.\n\tCandidate functions:\n\tcoalesce(CoercibleT, .., CoercibleT)")
---- tpcds_physical_q80 stdout ----
Error: Plan("No function matches the given name and argument types 'coalesce(Decimal128(7, 2), Int64)'. You might need to add explicit type casts.\n\tCandidate functions:\n\tcoalesce(CoercibleT, .., CoercibleT)")
---- Another error for UInt64
External error: query failed: DataFusion error: Error during planning: No function matches the given name and argument types 'coalesce(UInt64, Int64)'. You might need to add explicit type casts.
Candidate functions:
coalesce(CoercibleT, .., CoercibleT)
[SQL] select col1, col2, coalesce(sum_col3, 0) as sum_col3
from (select distinct col2 from tbl) AS q1
cross join (select distinct col1 from tbl) AS q2
left outer join (SELECT col1, col2, sum(col3) as sum_col3 FROM tbl GROUP BY col1, col2) AS q3
USING(col2, col1)
ORDER BY col1, col2
at test_files/joins.slt:1785
The root-cause is coerced_from() is not handling Decimcal128() and UInt64 datatypes.
Before this fix, Decimal types are typecasted to Float64 before coming to coerced_from() and so they were working.
We would require to improve coerced_from() function to handle more datatypes. I also try other approaches to fix this issue.
Without fix, the following are return types of coalesce() function.
create or replace table x as (
select
cast(null as timestamp) as n,
cast('2020-01-01T00:00:00Z' as timestamp) as v
);
select coalesce(n, v) from x; -- return datatype is Utf8
create or replace table x as (
select
cast(null as Decimal) as n,
cast(8 as BIGINT) as v
);
select coalesce(n, v) from x; -- return datatype is Float64
create or replace table x as (
select
cast(null as Date) as n,
cast('2020-01-01T00:00:00Z' as Timestamp) as v
);
select coalesce(n, v) from x; -- return datatype is Utf8
create or replace table x as (
select
cast(null as Bigint unsigned) as n,
cast(8 as BIGINT) as v
);
select coalesce(n, v) from x; -- return datatype is Float32
With fix,
create or replace table x as (
select
cast(null as timestamp) as n,
cast('2020-01-01T00:00:00Z' as timestamp) as v
);
select coalesce(n, v) from x; -- return datatype is Timestamp(Nanosecond, None)
create or replace table x as (
select
cast(null as Decimal) as n,
cast(8 as BIGINT) as v
);
select coalesce(n, v) from x; -- return datatype is Decimal128(38, 10)
create or replace table x as (
select
cast(null as Date) as n,
cast('2020-01-01T00:00:00Z' as Timestamp) as v
);
select coalesce(n, v) from x; -- return datatype is Timestamp(Nanosecond, None)
create or replace table x as (
select
cast(null as Bigint unsigned) as n,
cast(8 as BIGINT) as v
);
select coalesce(n, v) from x; -- return datatype is Int64
Maybe we need to add automatic coercsion from int64
to decimal
🤔 @viirya / @liukun4515 do you remember if there is some reason we don't do this?
Adding concise test
❯ select arrow_typeof(coalesce(cast(null as timestamp), cast(null as timestamp)));
+-----------------------------------+
| arrow_typeof(coalesce(NULL,NULL)) |
+-----------------------------------+
| Utf8 |
+-----------------------------------+
1 row in set. Query took 0.002 seconds.
Describe the bug
The
coalesce()
function when applied to twoTimestamp(_, _)
columns unexpectedly returnsUtf8
result.To Reproduce
In
datafusion-cli
:Expected behavior
1) Result column is of
Timestamp
type 2) Result column is NOT nullable, as one of the arguments tocoalesce()
cannot be NULLAdditional context
Note that this works fine (in respect to expectation 1, but not 2):