apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.35k stars 3.49k forks source link

[C++][Parquet] Supports write BinaryView/StringView to Parquet file #43244

Open mapleFU opened 2 months ago

mapleFU commented 2 months ago

Describe the enhancement requested

This part requires write arrow's StringView/BinaryView/LargeStringView/LargeBinaryView to Parquet file.

The Parquet library has the layers below:

  1. arrow/parquet: which accepts the input "Array" and able to write them to the underlying Parquet api
  2. parquet: the core Parquet libs

In this part, we should:

  1. Check and test View types in parquet/arrow/schema module, and add tests. Dictionary<View> might also be test
  2. Update TypedColumnWriterImpl<ByteArrayType>::WriteArrowDense and allowing it write view type
  3. Allowing DeltaLengthByteArrayEncoder<DType>::Put, DeltaByteArrayEncoder<DType>::Put, DictEncoderImpl<ByteArrayType>::Put and PlainEncoder<ByteArrayType>::Put supports view type

Component(s)

C++, Parquet

mapleFU commented 2 months ago

Besides, can we separate it into two tasks:

I think this making finishing the task more smothly

IndifferentArea commented 2 months ago

I'm currently trying on part of this feature. I found each encoder uses an assert function AssertBaseBinary which wraps on is_base_binary_like https://github.com/apache/arrow/blob/de17643e3176d821e02873c776ee91f6e53d3314/cpp/src/parquet/encoding.cc#L251-L255

However, this trait will return false on {String, Binary}ViewArray https://github.com/apache/arrow/blob/de17643e3176d821e02873c776ee91f6e53d3314/cpp/src/arrow/type_traits.h#L1124-L1142

I didn't find a trait that represents is_base_binary_like + {String, Binary}View or just represents {String, Binary}View, do we need a new trait? @mapleFU @felipecrv

mapleFU commented 2 months ago

@IndifferentArea Nice catch, I think we can do some changes like:

 void AssertBaseBinary(const ::arrow::Array& values) { 
   if (!::arrow::is_base_binary_like(values.type_id())) { 
     throw ParquetException("Only BaseBinaryArray and subclasses supported"); 
   } 
 } 

void AssertViewBinary(const ::arrow::Array& values);

Then dispatch by view or binary type?

IndifferentArea commented 2 months ago

For now, I implement a local is_binary_view_like trait in encoding.cc.

bool is_binary_view_like(::arrow::Type::type type_id){
  return type_id == ::arrow::Type::BINARY_VIEW || type_id == ::arrow::Type::STRING_VIEW;
}

But I still think this trait may be necessary in type_traits.h. eg, in current implementation, BinaryArray's constructor is implemented as: https://github.com/apache/arrow/blob/c66b3f149f92e1fae0b33cc63c6093db2deedd29/cpp/src/arrow/array/array_binary.cc#L35-L38 However, BinaryViewArray's constructor is implemented as: https://github.com/apache/arrow/blob/c66b3f149f92e1fae0b33cc63c6093db2deedd29/cpp/src/arrow/array/array_binary.cc#L93-L96 As a result, we can't build a BinaryViewArray from StringViewArray. Comparing to BinaryArray and StringArray, I think it's not expected?

mapleFU commented 2 months ago

As a result, we can't build a BinaryViewArray from StringViewArray.

Lol, this is funny but StringViewArray just call default ctor in BinaryViewArray and construct like below:

StringViewArray::StringViewArray(std::shared_ptr<ArrayData> data) {
  ARROW_CHECK_EQ(data->type->id(), Type::STRING_VIEW);
  SetData(std::move(data));
}
mapleFU commented 2 months ago

https://github.com/apache/arrow/issues/42247 @felipecrv Hi, just wonder during our testing, currently we don't support reading string_views from arrow, so maybe "cast" or something is required for testing. Would we have cast or there is some workaround?

felipecrv commented 2 months ago

@mapleFU here is my draft PR: https://github.com/apache/arrow/pull/43302

Blocked on dictionary casts. I will see if I can make it pass the tests and keep dictionary casts as a TODO.