Open alamb opened 4 months ago
@tustvold pointed out the previous PR he made was https://github.com/apache/arrow-rs/pull/4585
We can probably dust that off and get it ready to merge as part of this project
@kallisti-dev I believe may do some work on this.
I suggested breaking it down into a sequence of PRs (keeping notes of what is not yet implemented along the way)
Specifically, I suggest the first PR should have:
DataType
StringViewArray
and impl Array
Arrow::NotYetImplemented
error (rather than panic
)That will likely be a sizeable PR on its own.
Post from Velox on iimportance of stringview: https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/
Moving an update from @sundy-li and @ariesdevil from https://github.com/apache/arrow-rs/pull/4585#issuecomment-1965896483 to here for visibility
I believe @ariesdevil is considering working on this relatively shortly
Hi @alamb, I'm willing to work on this feature after datafuselabs/databend#14662 is finished.
@XiangpengHao, who will be working with us at InfluxData this summer (starting in May) is also interested in this project, so when we get closer to May we'll engage more fully and perhaps we can figure out how to split up some of the work.
Moving an update from @sundy-li and @ariesdevil from #4585 (comment) to here for visibility↳
I believe @ariesdevil is considering working on this relatively shortly↳
Hi @alamb, I'm willing to work on this feature after datafuselabs/databend#14662 is finished.↳
@XiangpengHao, who will be working with us at InfluxData this summer (starting in May) is also interested in this project, so when we get closer to May we'll engage more fully and perhaps we can figure out how to split up some of the work.
Do we have a group similar to Discord or Slack?
Do we have a group similar to Discord or Slack?
@ariesdevil There is both discord and slack -- links are here https://github.com/apache/arrow-rs?tab=readme-ov-file#arrow-rust-community
The discord server doesn't require an invite and seems to be a bit more active recently
@ariesdevil says they have time to begin work on this so I will begin filing subtasks
Update: I filed these two tasks to get us started.
Update here is that @ariesdevil has a PR https://github.com/apache/arrow-rs/pull/5481 that we plan to merge tomorrow with the initial array implementations and we'll continue iterating from there
Once that is merged, I'll plan to write some more tickets up
I filed a bunch of follow on tickets, in case anyone cares
https://github.com/apache/arrow-rs/issues/5506 https://github.com/apache/arrow-rs/issues/5507 https://github.com/apache/arrow-rs/issues/5508 https://github.com/apache/arrow-rs/issues/5509 https://github.com/apache/arrow-rs/issues/5510 https://github.com/apache/arrow-rs/issues/5511 https://github.com/apache/arrow-rs/issues/5513
Is there an issue to make sure that parquet support for the new arrow view types are supported. Currently can not use with the parquet writer when they are specified in the schema.
I am actually more familiar with Arrow than I am Parquet so excuse the question: is there an official Parquet spec or implementation plan for the view types?
I'm working on read/write parquet for view types.
I'm working on read/write parquet for view types.
I filed https://github.com/apache/arrow-rs/issues/5530 to track
BTW if anyone is interested, I have a proposal for how to improve manipulating the views in Rust https://github.com/apache/arrow-rs/pull/5619 (I really enjoy the rust type system in general -- it is quite cool)
Update here: thanks to @ariesdevil and @XiangpengHao and @RinChanNOWWW we have pretty good basic support for StringView in arrow-rs -- we have basic construction, cast and soon comparison functionality complete. Also, we can now read directly from parquet into StringView arrays https://github.com/apache/arrow-rs/issues/5530
We have also begun the work to integrate this into DataFusion, which is tracked in https://github.com/apache/datafusion/issues/10918
Is your feature request related to a problem or challenge? Please describe what you are trying to do. Recently two new types were added to the Arrow format that make it more suitable for certain types of operations on strings
Specifically when doing filtering / take with string data, creating a new
Utf8Array
requires copying the strings to a new, packed binary buffer. The "VariableSizeBinaryView" was designed to solve this limitation and recently added to the Arrow spec.Describe the solution you'd like I would like to implement
StringViewArray
andBinaryViewArray
following the spec: The spec: https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout https://github.com/apache/arrow/blob/3fe598ae4dfd7805ab05452dd5ed4b0d6c97d8d5/format/Schema.fbs#L187-L205Initially, I would suggest we get the basic types in place:
BinaryViewArray
implementation and layout and basic constructionThen as follow on PRs, add support for key features
interleave
(used in Sort)Describe alternatives you've considered I think a good plan would be to dust off the prototype on https://github.com/apache/arrow-rs/pull/4585 from @tustvold (linked from https://github.com/apache/arrow-rs/issues/4253).
Initially, the idea would be to dust off the PR and split it into a few smaller PRs with tests and docs.
Additional context Polars implemented it recently in rust so that can serve as a motivation Blog Post https://pola.rs/posts/polars-string-type/ https://twitter.com/RitchieVink/status/1749466861069115790
Facebook/Velox's take: https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/
Related PRs: https://github.com/pola-rs/polars/pull/13748 https://github.com/pola-rs/polars/pull/13839 https://github.com/pola-rs/polars/pull/13489