Open goldmedal opened 2 days ago
This is a good example of the kind of bug that @findepi 's proposal will hopefully help with:
Specifically once we get a standardized way to make specialized function bodies maybe this bug would be fixed by switching to that standardized way
@goldmedal can you please check the following queries?
orignal query from this isuse
> select strpos(arrow_cast('123', '(Dictionary(Int32, Utf8)'), '123');
Error during planning: Unsupported type '(Dictionary(Int32, Utf8)'. [...]
a query without (
before Dictionary
in the cast definition:
> select strpos(arrow_cast('123', 'Dictionary(Int32, Utf8)'), '123');
+-----------------------------------------------------------------------------+
| strpos(arrow_cast(Utf8("123"),Utf8("Dictionary(Int32, Utf8)")),Utf8("123")) |
+-----------------------------------------------------------------------------+
| 1 |
+-----------------------------------------------------------------------------+
1 row(s) fetched.
It seems true that strpos
doesn't natively handle dictionaries as input type. It seems they are handled here by requesting a coercion
https://github.com/apache/datafusion/blob/1b3608da7ca59d8d987804834d004e8b3e349d18/datafusion/functions/src/unicode/strpos.rs#L83-L84
but i don't see this reflected in the EXPLAIN plan, so not sure how it works.
BTW strpos use of function-driven coercion rules is non-orthodox and seems not exactly in line with generally recommended approach (see @alamb 's https://github.com/apache/datafusion/pull/12308#discussion_r1763545855), which may lead to inconsistencies between this function and others. It seems that function implementors have too much control over planning/coercions. cc @alamb @jayzhan211 @notfilippo
a query without
(
beforeDictionary
in the cast definition:> select strpos(arrow_cast('123', 'Dictionary(Int32, Utf8)'), '123'); +-----------------------------------------------------------------------------+ | strpos(arrow_cast(Utf8("123"),Utf8("Dictionary(Int32, Utf8)")),Utf8("123")) | +-----------------------------------------------------------------------------+ | 1 | +-----------------------------------------------------------------------------+ 1 row(s) fetched.
Oh, you're right. The scalar value works well. I actually found this bug in another case. I'll update the reproducing steps. Thanks!
> create table test_source as values
('Andrew', 'X', 'datafusion📊🔥', '🔥'),
('Xiangpeng', 'Xiangpeng', 'datafusion数据融合', 'datafusion数据融合'),
('Raphael', 'R', 'datafusionДатаФусион', 'аФус'),
(NULL, 'R', NULL, '🔥');
0 row(s) fetched.
Elapsed 0.068 seconds.
> create table test_basic_operator as
select
arrow_cast(column1, 'Dictionary(Int32, Utf8)') as ascii_1,
arrow_cast(column2, 'Dictionary(Int32, Utf8)') as ascii_2,
arrow_cast(column3, 'Dictionary(Int32, Utf8)') as unicode_1,
arrow_cast(column4, 'Dictionary(Int32, Utf8)') as unicode_2
from test_source;
0 row(s) fetched.
Elapsed 0.043 seconds.
> SELECT
STRPOS(ascii_1, 'e'),
STRPOS(ascii_1, 'ang'),
STRPOS(ascii_1, NULL),
STRPOS(unicode_1, 'и'),
STRPOS(unicode_1, 'ион'),
STRPOS(unicode_1, NULL)
FROM test_basic_operator;
Execution error: Unsupported data type combination (Dictionary(Int32, Utf8), Utf8) for function strpos
It seems to have something to do with NULL handling. Your last query runs fine on current main
if you remove STRPOS(..., NULL)
cases. And a query with a NULL second arg fails even in single-row / VALUES case:
> select strpos(arrow_cast('123', 'Dictionary(Int32, Utf8)'), NULL);
Execution error: Unsupported data type combination (Dictionary(Int32, Utf8), Utf8) for function strpos
I am unsure why this function would need manual type coercion. I haven't tested but I would think that a signature such as :
Signature::one_of(
vec![
Exact(vec![Utf8View, Utf8View]),
Exact(vec![Utf8View, Utf8]),
Exact(vec![Utf8View, LargeUtf8]),
Exact(vec![Utf8, Utf8View]),
Exact(vec![Utf8, Utf8]),
Exact(vec![Utf8, LargeUtf8]),
Exact(vec![LargeUtf8, Utf8View]),
Exact(vec![LargeUtf8, Utf8]),
Exact(vec![LargeUtf8, LargeUtf8]),
],
Volatility::Immutable,
)
should be enough?
I am unsure why this function would need manual type coercion. I haven't tested but I would think that a signature such as
I agree that seems reasonable to me. Maybe the code just needs to be cleaned up
I am unsure why this function would need manual type coercion. I haven't tested but I would think that a signature such as :
Signature::one_of( vec![ Exact(vec![Utf8View, Utf8View]), Exact(vec![Utf8View, Utf8]), Exact(vec![Utf8View, LargeUtf8]), Exact(vec![Utf8, Utf8View]), Exact(vec![Utf8, Utf8]), Exact(vec![Utf8, LargeUtf8]), Exact(vec![LargeUtf8, Utf8View]), Exact(vec![LargeUtf8, Utf8]), Exact(vec![LargeUtf8, LargeUtf8]), ], Volatility::Immutable, )
should be enough?
Because we need to handle Nulls as well
Because we need to handle Nulls as well
I feel like we need to teach the coercion logic that it can substitute Null
for any data type passed in
So given a signature like
Exact(vec![Utf8View, Utf8View]),
I would expect the coercion logic to be able to handle inputs like the following (by casting Null
to Utf8View)
(Null, Null)
(Null, Utf8View)
(Utf8View, Null)
(Utf8View, Utf8View)
Filed https://github.com/apache/datafusion/issues/12698 to track trying to improve this
Describe the bug
While working on #12415, I found
STRPOS
can't accept a dictionary string as an argument.To Reproduce
The following SQL can reproduce this bug
Expected behavior
This SQL should work.
Additional context
Actually, it may be a bug in recent commits. I am really sure it worked until 26c8004dc4cfd1a55d080409477ee7bef1b54a3b (#12619). After rebasing to the latest main branch, the test case added by #12618 is broken. The bug maybe caused by 524e56d8e59932fd643e6a72c53d5089b7812db7..c21d025df463ce623f9193c4b24d86141fce81ca