apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.48k stars 1.01k forks source link

Implement equality `=` and inequality `<>` support for `StringView` #10919

Closed alamb closed 1 week ago

alamb commented 2 weeks ago

Is your feature request related to a problem or challenge?

Part of https://github.com/apache/datafusion/issues/10918, [StringViewArray](https://docs.rs/arrow/latest/arrow/array/type.StringViewArray.html) support in DataFusion

There are several queries in the clickbench suite like follows:

SELECT "MobilePhone", "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhone", "MobilePhoneModel" ORDER BY u DESC LIMIT 10;
SELECT "SearchPhrase", COUNT(*) AS c FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;
SELECT "SearchPhrase", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY u DESC LIMIT 10;
SELECT "SearchEngineID", "SearchPhrase", COUNT(*) AS c FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "SearchPhrase" ORDER BY c DESC LIMIT 10;

where "MobilePhoneModel" and "SearchPhrase" are string columns with predicates (in this case checking for empty string)

Describe the solution you'd like

In order to improve performance of these queries we will need the ability to actually compare StringViewArrays to constant strings (and likely to each other)

Thus I would like to be able to run

StringViewColumn = scalar StringViewColumn = StringViewColumn

(and likewise for BinaryView)

I basically want to to run the following queries (where table foo has StringView columns)

> create table foo as values ('Andrew', 'X'), ('Xiangpeng', 'Xiangpeng'), ('Raphael', 'R');
0 row(s) fetched.
Elapsed 0.002 seconds.

> select * from foo where column1 = 'Andrew';
+---------+---------+
| column1 | column2 |
+---------+---------+
| Andrew  | X       |
+---------+---------+
1 row(s) fetched.
Elapsed 0.003 seconds.

> select * from foo where column1 <> 'Andrew';
+-----------+-----------+
| column1   | column2   |
+-----------+-----------+
| Xiangpeng | Xiangpeng |
| Raphael   | R         |
+-----------+-----------+
2 row(s) fetched.
Elapsed 0.001 seconds.

> select * from foo where column1 = column2;
+-----------+-----------+
| column1   | column2   |
+-----------+-----------+
| Xiangpeng | Xiangpeng |
+-----------+-----------+
1 row(s) fetched.
Elapsed 0.002 seconds.

> select * from foo where column1 <> column2;
+---------+---------+
| column1 | column2 |
+---------+---------+
| Andrew  | X       |
| Raphael | R       |
+---------+---------+
2 row(s) fetched.
Elapsed 0.001 seconds.

Describe alternatives you've considered

I suspect we will need to update the coercion logic and maybe also the arrow equality kernels like https://docs.rs/arrow/latest/arrow/compute/kernels/cmp/fn.eq.html

Additional context

No response

Weijun-H commented 2 weeks ago

I am glad to pick this ticket.

Weijun-H commented 2 weeks ago

This issue must wait until #10920 because there is currently no convenient way to create a StringViewArray in Datafusion. If I am mistaken, please correct me.

alamb commented 2 weeks ago

This issue must wait until #10920 because there is currently no convenient way to create a StringViewArray in Datafusion. If I am mistaken, please correct me.

I think you are right -- conveniently @XiangpengHao has one here https://github.com/apache/datafusion/pull/10925

XiangpengHao commented 2 weeks ago

Hi @Weijun-H , great to know you are working on this! I believe implementing this feature will eventually require https://github.com/apache/arrow-rs/issues/5897 to be solved, so I'm working on that issue so you won't be blocked

alamb commented 1 week ago

BTW I made a branch to work on StringView in DataFusion: https://github.com/apache/datafusion/issues/10961

alamb commented 1 week ago

StringView comparison added in https://github.com/apache/datafusion/pull/10985