Open wjones127 opened 1 year ago
Alternatively, instead of returning a single value, it may make more sense to return the value per column. For columns where the size is fixed, it can always be there, and for variable with ones it can be optional.
Note that this kind of metadata is really important for cost-based plan optimizers.
I think the approach might be different for writing and for reading. For example, for writing, if you wanted your output batches to be a certain size (in bytes) then you need to either:
However, when reading, your options are more limited. Typically you want to read a batch that has X bytes. You can't use the decoded & uncompressed size (unless that is written in the statistics / metadata somewhere). You can't read-twice in the same way you can write-twice. You are then left with guessing.
However, there is one other approach you can take when reading. Instead of asking your column decoded for X pages or X row groups worth of data you can ask your column decoder for X bytes worth of data. The decoder can then advance through as many pages as it needs to deliver X bytes of data. This is a bit tricky because, if you are reading a batch, you might get a different number of rows from each decoder. However, that can be addressed as well.
Or if this gets tractions, we might not have to guess at all (for Parquet): https://lists.apache.org/thread/3sm9n6tgjxsb0k6j1b6dr2nv3zx68bjy
Alternatively, instead of returning a single value, it may make more sense to return the value per column. For columns where the size is fixed, it can always be there, and for variable with ones it can be optional.
Note that this kind of metadata is really important for cost-based plan optimizers.
Just want to add that things are different if we are talking about in-memory size or on-disk raw size (decoded & decompressed), especially when there are substantial null values.
BTW, it is tricky to support this by file formats. We always have to deal with legacy files that does not have these metadata fields.
The util is great, however, it's a bit-tricky here. I've implement a similar size-hint in our system, here are some problems I met:
Hope that helps.
Describe the enhancement requested
We often parameterize things by number of rows, but what we would rather set the batch size in bytes. This is often the case when reading/writing files or IPC streams. One solution would be to provide utilities to estimate the average row size. For example, the Velox project file readers provide an
estimatedRowSize()
(although I'm not sure how often that is used):https://github.com/facebookincubator/velox/blob/33c40fda3a7654891c506bf23d078c0da0cd4f0d/velox/dwio/common/Reader.h#L71
This interface for the Parquet reader might be something like:
Then would be used something like:
Similarly, when writing IPC we might want something like:
So we can use this when writing to a Flight stream:
Component(s)
C++