apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.31k stars 678 forks source link

[EPIC] Implement `StringViewArray` and `BinaryViewArray` #5374

Open alamb opened 4 months ago

alamb commented 4 months ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. Recently two new types were added to the Arrow format that make it more suitable for certain types of operations on strings

Specifically when doing filtering / take with string data, creating a new Utf8Array requires copying the strings to a new, packed binary buffer. The "VariableSizeBinaryView" was designed to solve this limitation and recently added to the Arrow spec.

Describe the solution you'd like I would like to implement StringViewArray and BinaryViewArray following the spec: The spec: https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout https://github.com/apache/arrow/blob/3fe598ae4dfd7805ab05452dd5ed4b0d6c97d8d5/format/Schema.fbs#L187-L205

Initially, I would suggest we get the basic types in place:

Then as follow on PRs, add support for key features

Describe alternatives you've considered I think a good plan would be to dust off the prototype on https://github.com/apache/arrow-rs/pull/4585 from @tustvold (linked from https://github.com/apache/arrow-rs/issues/4253).

Initially, the idea would be to dust off the PR and split it into a few smaller PRs with tests and docs.

Additional context Polars implemented it recently in rust so that can serve as a motivation Blog Post https://pola.rs/posts/polars-string-type/ https://twitter.com/RitchieVink/status/1749466861069115790

Facebook/Velox's take: https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/

Related PRs: https://github.com/pola-rs/polars/pull/13748 https://github.com/pola-rs/polars/pull/13839 https://github.com/pola-rs/polars/pull/13489

alamb commented 4 months ago

@tustvold pointed out the previous PR he made was https://github.com/apache/arrow-rs/pull/4585

We can probably dust that off and get it ready to merge as part of this project

alamb commented 4 months ago

@kallisti-dev I believe may do some work on this.

I suggested breaking it down into a sequence of PRs (keeping notes of what is not yet implemented along the way)

Specifically, I suggest the first PR should have:

  1. The new DataType
  2. The new StringViewArray and impl Array
  3. A basic constructor for creating and validating StringViewArray
  4. Any feature (e.g. IPC support, parquet support, etc) should return a Arrow::NotYetImplemented error (rather than panic)

That will likely be a sizeable PR on its own.

alamb commented 4 months ago

Post from Velox on iimportance of stringview: https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/

alamb commented 4 months ago

Moving an update from @sundy-li and @ariesdevil from https://github.com/apache/arrow-rs/pull/4585#issuecomment-1965896483 to here for visibility

I believe @ariesdevil is considering working on this relatively shortly

Hi @alamb, I'm willing to work on this feature after datafuselabs/databend#14662 is finished.

@XiangpengHao, who will be working with us at InfluxData this summer (starting in May) is also interested in this project, so when we get closer to May we'll engage more fully and perhaps we can figure out how to split up some of the work.

ariesdevil commented 4 months ago

Moving an update from @sundy-li and @ariesdevil from #4585 (comment) to here for visibility↳

I believe @ariesdevil is considering working on this relatively shortly↳

Hi @alamb, I'm willing to work on this feature after datafuselabs/databend#14662 is finished.↳

@XiangpengHao, who will be working with us at InfluxData this summer (starting in May) is also interested in this project, so when we get closer to May we'll engage more fully and perhaps we can figure out how to split up some of the work.

Do we have a group similar to Discord or Slack?

alamb commented 4 months ago

Do we have a group similar to Discord or Slack?

@ariesdevil There is both discord and slack -- links are here https://github.com/apache/arrow-rs?tab=readme-ov-file#arrow-rust-community

The discord server doesn't require an invite and seems to be a bit more active recently

alamb commented 3 months ago

@ariesdevil says they have time to begin work on this so I will begin filing subtasks

Update: I filed these two tasks to get us started.

alamb commented 3 months ago

Update here is that @ariesdevil has a PR https://github.com/apache/arrow-rs/pull/5481 that we plan to merge tomorrow with the initial array implementations and we'll continue iterating from there

Once that is merged, I'll plan to write some more tickets up

alamb commented 3 months ago

I filed a bunch of follow on tickets, in case anyone cares

https://github.com/apache/arrow-rs/issues/5506 https://github.com/apache/arrow-rs/issues/5507 https://github.com/apache/arrow-rs/issues/5508 https://github.com/apache/arrow-rs/issues/5509 https://github.com/apache/arrow-rs/issues/5510 https://github.com/apache/arrow-rs/issues/5511 https://github.com/apache/arrow-rs/issues/5513

cgbur commented 3 months ago

Is there an issue to make sure that parquet support for the new arrow view types are supported. Currently can not use with the parquet writer when they are specified in the schema.

I am actually more familiar with Arrow than I am Parquet so excuse the question: is there an official Parquet spec or implementation plan for the view types?

ariesdevil commented 3 months ago

I'm working on read/write parquet for view types.

alamb commented 3 months ago

I'm working on read/write parquet for view types.

I filed https://github.com/apache/arrow-rs/issues/5530 to track

alamb commented 1 month ago

BTW if anyone is interested, I have a proposal for how to improve manipulating the views in Rust https://github.com/apache/arrow-rs/pull/5619 (I really enjoy the rust type system in general -- it is quite cool)

alamb commented 2 weeks ago

Update here: thanks to @ariesdevil and @XiangpengHao and @RinChanNOWWW we have pretty good basic support for StringView in arrow-rs -- we have basic construction, cast and soon comparison functionality complete. Also, we can now read directly from parquet into StringView arrays https://github.com/apache/arrow-rs/issues/5530

We have also begun the work to integrate this into DataFusion, which is tracked in https://github.com/apache/datafusion/issues/10918