Open mapleFU opened 3 months ago
cc @pitrou @jorisvandenbossche @felipecrv
Also cc @jp0317 as Dictionary Reader user
More complexly, for array<string>
, the string / dictionary - string can be cast and concat to large first, then concat to single "large" dictionary array. This would cause slightly performance lost only when reading the large binary
I can't remember the exact problem in that PR but the messy thing was in the parquet decoder where it outputs to an arrow array.
For the chunked array solution, I think a challenge is how do we deal with the higher-level API, like GetRecordBatchReader
where arrays must be concatenated into a single array.
BinaryRecordReader
already supports this. User needs passing a LargeBinary/LargeString to reading it.
DictionaryRecordReader
will only have single builder, which could not storing more than 2GB data. And ListReader
will concat underlying ChunkedArray
to single Array
, so I think we can testing it, but in theory we can read this
However, this proposal may introducing high peak memory and more memcpy than previous patch(because during concat multiple "large" buffer need to be concat together)
However, this proposal may introducing high peak memory and more memcpy than previous patch(because during concat multiple "large" buffer need to be concat together)
Better than not supporting the 64-bit-length strings.
@pitrou What do you think of this? If it's ok I'd like to add this later this month
I'm doing a bit of cleanup before the release and noticed that this was here. @pitrou do you have any thoughts above?
Describe the enhancement requested
Previously, an issue ( https://github.com/apache/arrow/pull/35825 ) shows that directly read large binary by dict is not supported.
During writing to parquet, we don't allow a single ByteArray to exceeds 2GB. So, any single binary would be less than 2GB.
The parquet binary reader, which is separate into two styles of API, could be shown as below:
The api above, Both of these api don't support read "LargeBinary", however, the first api is able to separate the string into multiple separate chunk. When a
BinaryBuilder
reaches 2GB, it will rotate and switch to a new Binary. The api below can casting the result data to segments of large binary:For
Dictionary
, though the api returns astd::shared_ptr<::arrow::ChunkedArray>
. However, only one dictionary builder would be used. I think we can apply the same way for it.Pros: we can support read more than 2GB data into dictionary column Cons: data might be repeated among different dictionary columns. Maybe user should call "Concat" on that
Component(s)
C++, Parquet