apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.48k stars 3.52k forks source link

[C++] Utilities to estimate average (de)serialized row size #34712

Open wjones127 opened 1 year ago

wjones127 commented 1 year ago

Describe the enhancement requested

We often parameterize things by number of rows, but what we would rather set the batch size in bytes. This is often the case when reading/writing files or IPC streams. One solution would be to provide utilities to estimate the average row size. For example, the Velox project file readers provide an estimatedRowSize() (although I'm not sure how often that is used):

https://github.com/facebookincubator/velox/blob/33c40fda3a7654891c506bf23d078c0da0cd4f0d/velox/dwio/common/Reader.h#L71

This interface for the Parquet reader might be something like:

/// \brief Provides average bytes per row as in-memory Arrow data.
///
/// \param sample_rows: if true, will read a sample of rows to estimate the average size of
/// variable length columns
Result<int64_t> EstimateDeserializedRowSize(
  parquet::arrow::FileReader file,
  std::vector<int64_t> column_indices,
  sample_rows = false);

Then would be used something like:

parquet::arrow::FileReader file = parquet::arrow::OpenFile("path/to/file");
std::vector<int64_t> columns = {1, 3, 5};
int64_t row_size = ARROW_RETURN_NOT_OK(EstimateDeserializedRowSize(file, columns));

// Configure Arrow-specific Parquet reader settings
int64_t batch_size_bytes = 64 * 1024 * 1024;
auto arrow_reader_props = parquet::ArrowReaderProperties();
arrow_reader_props.set_batch_size(batch_size_bytes / row_size);

/// Then use the properties to get a RBR...

Similarly, when writing IPC we might want something like:

/// \brief Provides average bytes per for in serialized IPC message
Result<int64_t> EstimateSerializedRowSize(
  const RecordBatch& batch,
  const IpcWriteOptions& write_option,
);

So we can use this when writing to a Flight stream:

std::shared_ptr<Table> table = ...
int64_t row_size = EstimateSerializedRowSize(table->batch(0), write_options);

int64_t batch_size_bytes = 10 * 1024 * 1024;
auto reader = TableBatchReader(table.get());
reader.set_chunksize(batch_size_bytes / row_size);

/// Pass batches to DoPut

Component(s)

C++

wjones127 commented 1 year ago

Alternatively, instead of returning a single value, it may make more sense to return the value per column. For columns where the size is fixed, it can always be there, and for variable with ones it can be optional.

Note that this kind of metadata is really important for cost-based plan optimizers.

westonpace commented 1 year ago

I think the approach might be different for writing and for reading. For example, for writing, if you wanted your output batches to be a certain size (in bytes) then you need to either:

However, when reading, your options are more limited. Typically you want to read a batch that has X bytes. You can't use the decoded & uncompressed size (unless that is written in the statistics / metadata somewhere). You can't read-twice in the same way you can write-twice. You are then left with guessing.

However, there is one other approach you can take when reading. Instead of asking your column decoded for X pages or X row groups worth of data you can ask your column decoder for X bytes worth of data. The decoder can then advance through as many pages as it needs to deliver X bytes of data. This is a bit tricky because, if you are reading a batch, you might get a different number of rows from each decoder. However, that can be addressed as well.

wjones127 commented 1 year ago

Or if this gets tractions, we might not have to guess at all (for Parquet): https://lists.apache.org/thread/3sm9n6tgjxsb0k6j1b6dr2nv3zx68bjy

wgtmac commented 1 year ago

Alternatively, instead of returning a single value, it may make more sense to return the value per column. For columns where the size is fixed, it can always be there, and for variable with ones it can be optional.

Note that this kind of metadata is really important for cost-based plan optimizers.

Just want to add that things are different if we are talking about in-memory size or on-disk raw size (decoded & decompressed), especially when there are substantial null values.

BTW, it is tricky to support this by file formats. We always have to deal with legacy files that does not have these metadata fields.

mapleFU commented 1 year ago

The util is great, however, it's a bit-tricky here. I've implement a similar size-hint in our system, here are some problems I met:

  1. Null variables. In Arrow Array, null-value should occupy some place, but field-raw size cannot represent that value.
  2. Size of FLBA/ByteArray. It's size should be variable-size-summary or variable-size-summary + sizeof(ByteArray) * value-count
  3. Some time Arrow data is not equal to Parquet data, like Decimal stored as int32 or int64.

Hope that helps.