apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14k stars 3.41k forks source link

[C++][Parquet] ByteArray Reader: Extend current DictReader to supports building a LargeBinary #41104

Open mapleFU opened 3 months ago

mapleFU commented 3 months ago

Describe the enhancement requested

Previously, an issue ( https://github.com/apache/arrow/pull/35825 ) shows that directly read large binary by dict is not supported.

During writing to parquet, we don't allow a single ByteArray to exceeds 2GB. So, any single binary would be less than 2GB.

The parquet binary reader, which is separate into two styles of API, could be shown as below:

class BinaryRecordReader : virtual public RecordReader {
 public:
  virtual std::vector<std::shared_ptr<::arrow::Array>> GetBuilderChunks() = 0;
};

/// \brief Read records directly to dictionary-encoded Arrow form (int32
/// indices). Only valid for BYTE_ARRAY columns
class DictionaryRecordReader : virtual public RecordReader {
 public:
  virtual std::shared_ptr<::arrow::ChunkedArray> GetResult() = 0;
};

The api above, Both of these api don't support read "LargeBinary", however, the first api is able to separate the string into multiple separate chunk. When a BinaryBuilder reaches 2GB, it will rotate and switch to a new Binary. The api below can casting the result data to segments of large binary:

Status TransferColumnData(RecordReader* reader, const std::shared_ptr<Field>& value_field,
                          const ColumnDescriptor* descr, MemoryPool* pool,
                          std::shared_ptr<ChunkedArray>* out) 

For Dictionary, though the api returns a std::shared_ptr<::arrow::ChunkedArray>. However, only one dictionary builder would be used. I think we can apply the same way for it.

Pros: we can support read more than 2GB data into dictionary column Cons: data might be repeated among different dictionary columns. Maybe user should call "Concat" on that

Component(s)

C++, Parquet

mapleFU commented 3 months ago

cc @pitrou @jorisvandenbossche @felipecrv

Also cc @jp0317 as Dictionary Reader user

mapleFU commented 3 months ago

More complexly, for array<string>, the string / dictionary - string can be cast and concat to large first, then concat to single "large" dictionary array. This would cause slightly performance lost only when reading the large binary

wgtmac commented 3 months ago

I can't remember the exact problem in that PR but the messy thing was in the parquet decoder where it outputs to an arrow array.

For the chunked array solution, I think a challenge is how do we deal with the higher-level API, like GetRecordBatchReader where arrays must be concatenated into a single array.

mapleFU commented 3 months ago

BinaryRecordReader already supports this. User needs passing a LargeBinary/LargeString to reading it.

DictionaryRecordReader will only have single builder, which could not storing more than 2GB data. And ListReader will concat underlying ChunkedArray to single Array, so I think we can testing it, but in theory we can read this

However, this proposal may introducing high peak memory and more memcpy than previous patch(because during concat multiple "large" buffer need to be concat together)

felipecrv commented 3 months ago

However, this proposal may introducing high peak memory and more memcpy than previous patch(because during concat multiple "large" buffer need to be concat together)

Better than not supporting the 64-bit-length strings.

mapleFU commented 3 months ago

@pitrou What do you think of this? If it's ok I'd like to add this later this month

jonkeane commented 1 month ago

I'm doing a bit of cleanup before the release and noticed that this was here. @pitrou do you have any thoughts above?