apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.02k stars 1.14k forks source link

`STRPOS` doesn't support `(Dictionary(Int32, Utf8)` as input type #12670

Open goldmedal opened 2 days ago

goldmedal commented 2 days ago

Describe the bug

While working on #12415, I found STRPOS can't accept a dictionary string as an argument.

To Reproduce

The following SQL can reproduce this bug

> create table test_source as values
  ('Andrew', 'X', 'datafusion📊🔥', '🔥'),
  ('Xiangpeng', 'Xiangpeng', 'datafusion数据融合', 'datafusion数据融合'),
  ('Raphael', 'R', 'datafusionДатаФусион', 'аФус'),
  (NULL, 'R', NULL, '🔥');
0 row(s) fetched. 
Elapsed 0.068 seconds.

> create table test_basic_operator as
select
    arrow_cast(column1, 'Dictionary(Int32, Utf8)') as ascii_1,
    arrow_cast(column2, 'Dictionary(Int32, Utf8)') as ascii_2,
    arrow_cast(column3, 'Dictionary(Int32, Utf8)') as unicode_1,
    arrow_cast(column4, 'Dictionary(Int32, Utf8)') as unicode_2
from test_source;
0 row(s) fetched. 
Elapsed 0.043 seconds.

> SELECT
  STRPOS(ascii_1, 'e'),
  STRPOS(ascii_1, 'ang'),
  STRPOS(ascii_1, NULL),
  STRPOS(unicode_1, 'и'),
  STRPOS(unicode_1, 'ион'),
  STRPOS(unicode_1, NULL)
FROM test_basic_operator;
Execution error: Unsupported data type combination (Dictionary(Int32, Utf8), Utf8) for function strpos

Expected behavior

This SQL should work.

Additional context

Actually, it may be a bug in recent commits. I am really sure it worked until 26c8004dc4cfd1a55d080409477ee7bef1b54a3b (#12619). After rebasing to the latest main branch, the test case added by #12618 is broken. The bug maybe caused by 524e56d8e59932fd643e6a72c53d5089b7812db7..c21d025df463ce623f9193c4b24d86141fce81ca

alamb commented 2 days ago

This is a good example of the kind of bug that @findepi 's proposal will hopefully help with:

Specifically once we get a standardized way to make specialized function bodies maybe this bug would be fixed by switching to that standardized way

findepi commented 1 day ago

@goldmedal can you please check the following queries?

orignal query from this isuse

> select strpos(arrow_cast('123', '(Dictionary(Int32, Utf8)'), '123');
Error during planning: Unsupported type '(Dictionary(Int32, Utf8)'. [...]

a query without ( before Dictionary in the cast definition:

> select strpos(arrow_cast('123', 'Dictionary(Int32, Utf8)'), '123');
+-----------------------------------------------------------------------------+
| strpos(arrow_cast(Utf8("123"),Utf8("Dictionary(Int32, Utf8)")),Utf8("123")) |
+-----------------------------------------------------------------------------+
| 1                                                                           |
+-----------------------------------------------------------------------------+
1 row(s) fetched.
findepi commented 1 day ago

It seems true that strpos doesn't natively handle dictionaries as input type. It seems they are handled here by requesting a coercion https://github.com/apache/datafusion/blob/1b3608da7ca59d8d987804834d004e8b3e349d18/datafusion/functions/src/unicode/strpos.rs#L83-L84

but i don't see this reflected in the EXPLAIN plan, so not sure how it works.

BTW strpos use of function-driven coercion rules is non-orthodox and seems not exactly in line with generally recommended approach (see @alamb 's https://github.com/apache/datafusion/pull/12308#discussion_r1763545855), which may lead to inconsistencies between this function and others. It seems that function implementors have too much control over planning/coercions. cc @alamb @jayzhan211 @notfilippo

goldmedal commented 1 day ago

a query without ( before Dictionary in the cast definition:

> select strpos(arrow_cast('123', 'Dictionary(Int32, Utf8)'), '123');
+-----------------------------------------------------------------------------+
| strpos(arrow_cast(Utf8("123"),Utf8("Dictionary(Int32, Utf8)")),Utf8("123")) |
+-----------------------------------------------------------------------------+
| 1                                                                           |
+-----------------------------------------------------------------------------+
1 row(s) fetched.

Oh, you're right. The scalar value works well. I actually found this bug in another case. I'll update the reproducing steps. Thanks!

> create table test_source as values
  ('Andrew', 'X', 'datafusion📊🔥', '🔥'),
  ('Xiangpeng', 'Xiangpeng', 'datafusion数据融合', 'datafusion数据融合'),
  ('Raphael', 'R', 'datafusionДатаФусион', 'аФус'),
  (NULL, 'R', NULL, '🔥');
0 row(s) fetched. 
Elapsed 0.068 seconds.

> create table test_basic_operator as
select
    arrow_cast(column1, 'Dictionary(Int32, Utf8)') as ascii_1,
    arrow_cast(column2, 'Dictionary(Int32, Utf8)') as ascii_2,
    arrow_cast(column3, 'Dictionary(Int32, Utf8)') as unicode_1,
    arrow_cast(column4, 'Dictionary(Int32, Utf8)') as unicode_2
from test_source;
0 row(s) fetched. 
Elapsed 0.043 seconds.

> SELECT
  STRPOS(ascii_1, 'e'),
  STRPOS(ascii_1, 'ang'),
  STRPOS(ascii_1, NULL),
  STRPOS(unicode_1, 'и'),
  STRPOS(unicode_1, 'ион'),
  STRPOS(unicode_1, NULL)
FROM test_basic_operator;
Execution error: Unsupported data type combination (Dictionary(Int32, Utf8), Utf8) for function strpos
findepi commented 1 day ago

It seems to have something to do with NULL handling. Your last query runs fine on current main if you remove STRPOS(..., NULL) cases. And a query with a NULL second arg fails even in single-row / VALUES case:

> select strpos(arrow_cast('123', 'Dictionary(Int32, Utf8)'), NULL);
Execution error: Unsupported data type combination (Dictionary(Int32, Utf8), Utf8) for function strpos
Omega359 commented 1 day ago

I am unsure why this function would need manual type coercion. I haven't tested but I would think that a signature such as :

       Signature::one_of(
                vec![
                    Exact(vec![Utf8View, Utf8View]),
                    Exact(vec![Utf8View, Utf8]),
                    Exact(vec![Utf8View, LargeUtf8]),
                    Exact(vec![Utf8, Utf8View]),
                    Exact(vec![Utf8, Utf8]),
                    Exact(vec![Utf8, LargeUtf8]),
                    Exact(vec![LargeUtf8, Utf8View]),
                    Exact(vec![LargeUtf8, Utf8]),
                    Exact(vec![LargeUtf8, LargeUtf8]),
                ],
                Volatility::Immutable,
            )

should be enough?

alamb commented 1 day ago

I am unsure why this function would need manual type coercion. I haven't tested but I would think that a signature such as

I agree that seems reasonable to me. Maybe the code just needs to be cleaned up

jayzhan211 commented 18 hours ago

I am unsure why this function would need manual type coercion. I haven't tested but I would think that a signature such as :

       Signature::one_of(
                vec![
                    Exact(vec![Utf8View, Utf8View]),
                    Exact(vec![Utf8View, Utf8]),
                    Exact(vec![Utf8View, LargeUtf8]),
                    Exact(vec![Utf8, Utf8View]),
                    Exact(vec![Utf8, Utf8]),
                    Exact(vec![Utf8, LargeUtf8]),
                    Exact(vec![LargeUtf8, Utf8View]),
                    Exact(vec![LargeUtf8, Utf8]),
                    Exact(vec![LargeUtf8, LargeUtf8]),
                ],
                Volatility::Immutable,
            )

should be enough?

Because we need to handle Nulls as well

alamb commented 17 hours ago

Because we need to handle Nulls as well

I feel like we need to teach the coercion logic that it can substitute Null for any data type passed in

So given a signature like

                    Exact(vec![Utf8View, Utf8View]),

I would expect the coercion logic to be able to handle inputs like the following (by casting Null to Utf8View)

(Null, Null)
(Null, Utf8View) 
(Utf8View, Null)
(Utf8View, Utf8View)
alamb commented 17 hours ago

Filed https://github.com/apache/datafusion/issues/12698 to track trying to improve this