apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.3k stars 1.19k forks source link

[Epic] Complete Initial `StringView` in DataFusion #11752

Closed alamb closed 2 weeks ago

alamb commented 3 months ago

Is your feature request related to a problem or challenge?

This ticket is a follow on to https://github.com/apache/datafusion/issues/10918 where we implemented enough initial support for StringView / BinaryView that we can show some pretty sweet ClickBench results

Describe the solution you'd like

This epic tracks remaining work to complete the "initial" work which I would like to define as "enable using StringView when reading Strings from Parquet by default"

I am sure there will be additional work / support to add StringView to various other features of DataFusion that we can maybe track with another follow on ticket

Required for enabling StringView by default:

Could work around but really should be fixed upstream

Additional "Nice to have" Features

alamb commented 2 months ago

An update here is that @XiangpengHao has a PR with various changes in https://github.com/apache/datafusion/pull/11862

We still need to check that PR and figure out what else is in that PR is needed to be enabled "for real" (with tests, etc)

alamb commented 2 months ago

My ideal resolution here is that we end up in the state where the only change we need to enable string view by default is switch the config setting. I will do some more ticket triage later today to outline other items I know of

2010YOUY01 commented 2 months ago

Do we have tickets for regexp binary operators? (like ~, !~...) https://datafusion.apache.org/user-guide/sql/operators.html#op-re-match

I noticed stringview is not supported on them yet and they have separate implementation than regexp functions

Details

``` /*DML*/CREATE TABLE t0(v0 DOUBLE, v1 DOUBLE, v2 BOOLEAN, v3 BOOLEAN, v4 BOOLEAN, v5 STRING); /*DML*/INSERT INTO t0(v1, v5, v2) VALUES (0.7183242196192607, 'Tn', true); /*DML*/CREATE TABLE t0_stringview AS SELECT v0, v1, v2, v3, v4, arrow_cast(v5, 'Utf8View') as v5 FROM t0; > select v5 ~ 'foo' from t0_stringview; Internal error: Data type Utf8View not supported for binary_string_array_flag_op_scalar operation 'regexp_is_match' on string array. This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker > select regexp_match(v5, 'foo') from t0_stringview; +--------------------------------------------+ | regexp_match(t0_stringview.v5,Utf8("foo")) | +--------------------------------------------+ | | +--------------------------------------------+ 1 row(s) fetched. Elapsed 0.034 seconds. ```

alamb commented 2 months ago

Do we have tickets for regexp binary operators? (like ~, !~...)

Not that I know of -- it would be great to add them

alamb commented 2 months ago

Do we have tickets for regexp binary operators? (like , !...)

Filed https://github.com/apache/datafusion/issues/12180

alamb commented 1 month ago

I am going to try and polish up PR to enable string view by default PR (with the arrow upgrade and various recent improvements) and see how close we are https://github.com/apache/datafusion/pull/12092

alamb commented 2 weeks ago

StringView by default is finally merged into DataFusion: https://github.com/apache/datafusion/pull/13101 so I am claiming success and completion of this issue