Open mapleFU opened 2 months ago
Besides, can we separate it into two tasks:
Put
for encoderI think this making finishing the task more smothly
I'm currently trying on part of this feature.
I found each encoder uses an assert function AssertBaseBinary
which wraps on is_base_binary_like
https://github.com/apache/arrow/blob/de17643e3176d821e02873c776ee91f6e53d3314/cpp/src/parquet/encoding.cc#L251-L255
However, this trait will return false on {String, Binary}ViewArray
https://github.com/apache/arrow/blob/de17643e3176d821e02873c776ee91f6e53d3314/cpp/src/arrow/type_traits.h#L1124-L1142
I didn't find a trait that represents is_base_binary_like + {String, Binary}View
or just represents {String, Binary}View
, do we need a new trait? @mapleFU @felipecrv
@IndifferentArea Nice catch, I think we can do some changes like:
void AssertBaseBinary(const ::arrow::Array& values) {
if (!::arrow::is_base_binary_like(values.type_id())) {
throw ParquetException("Only BaseBinaryArray and subclasses supported");
}
}
void AssertViewBinary(const ::arrow::Array& values);
Then dispatch by view or binary type?
For now, I implement a local is_binary_view_like
trait in encoding.cc
.
bool is_binary_view_like(::arrow::Type::type type_id){
return type_id == ::arrow::Type::BINARY_VIEW || type_id == ::arrow::Type::STRING_VIEW;
}
But I still think this trait may be necessary in type_traits.h
. eg, in current implementation, BinaryArray
's constructor is implemented as:
https://github.com/apache/arrow/blob/c66b3f149f92e1fae0b33cc63c6093db2deedd29/cpp/src/arrow/array/array_binary.cc#L35-L38
However, BinaryViewArray
's constructor is implemented as:
https://github.com/apache/arrow/blob/c66b3f149f92e1fae0b33cc63c6093db2deedd29/cpp/src/arrow/array/array_binary.cc#L93-L96
As a result, we can't build a BinaryViewArray
from StringViewArray
. Comparing to BinaryArray
and StringArray
, I think it's not expected?
As a result, we can't build a BinaryViewArray from StringViewArray.
Lol, this is funny but StringViewArray
just call default ctor in BinaryViewArray
and construct like below:
StringViewArray::StringViewArray(std::shared_ptr<ArrayData> data) {
ARROW_CHECK_EQ(data->type->id(), Type::STRING_VIEW);
SetData(std::move(data));
}
https://github.com/apache/arrow/issues/42247 @felipecrv Hi, just wonder during our testing, currently we don't support reading string_views from arrow, so maybe "cast" or something is required for testing. Would we have cast or there is some workaround?
@mapleFU here is my draft PR: https://github.com/apache/arrow/pull/43302
Blocked on dictionary casts. I will see if I can make it pass the tests and keep dictionary casts as a TODO.
Describe the enhancement requested
This part requires write arrow's StringView/BinaryView/LargeStringView/LargeBinaryView to Parquet file.
The Parquet library has the layers below:
In this part, we should:
View
types inparquet/arrow/schema
module, and add tests.Dictionary<View>
might also be testTypedColumnWriterImpl<ByteArrayType>::WriteArrowDense
and allowing it write view typeDeltaLengthByteArrayEncoder<DType>::Put
,DeltaByteArrayEncoder<DType>::Put
,DictEncoderImpl<ByteArrayType>::Put
andPlainEncoder<ByteArrayType>::Put
supports view typeComponent(s)
C++, Parquet