Open mapleFU opened 11 months ago
Is anyone working on this? I wouldn't mind taking a crack at it
Just move forward!
@mapleFU Is this issue resolved? If not, I want to contribute to it!!
@ShaiviAgarwal2 still open. Go for it!
@mapleFU @JacobOgle As far as I can understand, we need to optimize the decoding of Boolean values in the Parquet C library which we can do by adding a condition to check if the size of the type T
is 1 and also to use a specialized decoding method for it.
This is a possible result that we can try running. This code checks if the type T
is a Boolean, if it is, it uses a more efficient method for decoding. This code should speed up the decoding of Boolean values.
if (sizeof(T) == 1) {
const uint8_t* bool_buffer = reinterpret_cast<const uint8_t*>(buffer + byte_offset);
while (i < batch_size) {
int unpack_size = std::min(8, batch_size - i);
uint8_t unpack_byte = bool_buffer[i / 8];
for (int k = 0; k < unpack_size; ++k) {
v[i + k] = static_cast<T>((unpack_byte >> (7 - (i % 8))) & 1);
}
i += unpack_size;
byte_offset += unpack_size / 8;
}
} else {
// Existing code for other cases
// ...
}
This looks ok to me, you can also run encoding_benchmark
and boolean related benchmark to testing your optimization
This looks ok to me, you can also run
encoding_benchmark
and boolean related benchmark to testing your optimization @mapleFU Can we use the Google Benchmark library for the same? Below I have used the Google Benchmark library for testing the optimization. Here, we can replace BooleanDecodingFunction() with the actual boolean decoding function we are trying to test.
#include <benchmark/benchmark.h>
static void BM_BooleanDecoding(benchmark::State& state) {
// Set up the data and context for the benchmark
while (state.KeepRunning()) {
// Call the boolean decoding function wre're benchmarking
BooleanDecodingFunction();
}
}
// Register the benchmark
BENCHMARK(BM_BooleanDecoding);
// Run the benchmark
BENCHMARK_MAIN();
@ShaiviAgarwal2 we already have a encoding_benchmark.cc
under src/parquet/...
. Just testing it there
@ShaiviAgarwal2 we already have a
encoding_benchmark.cc
undersrc/parquet/...
. Just testing it there
Oh ok
@pitrou
Just curious that, seems implement unpack8 with pdep can get some optimizations. However, I searched in arrow project, seems that there're two usage of BMI2:
src/parquet/level_conversion_inc.h
function ExtractBits
src/arrow/compute/util_avx2.cc
function bits_to_indexes_imp_avx2
If I want a pdep, should I write a same impl like src/parquet/level_conversion_inc.h
function ExtractBits
?
Can you start with https://github.com/apache/arrow/issues/37623 ?
Aha, sure. I can add a poc impl first for unpack8 and benchmark it first, if the performance get improved, I can move forward
I also don't understand 1) which method you are trying to speed up 2) how BMI2 is related with this
The answer here should be to stop using BitReader
and rewrite PlainBooleanDecoder
using something more efficient instead, such as CopyBitmap
.
@pitrou Yeah I understand your meaning. Previously I make it wrong, it uses bool type here: https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/reader_internal.cc#L361
I test bmi2 based unpack8 in ryzen but it's so slow, I guess this is because I'm using Zen2 for benchmark ( https://stackoverflow.com/questions/76428057/do-all-cpus-that-support-avx2-also-support-bmi2-or-popcnt ). I'll shift to intel cpu for testing :-(
Again, instead of obsessing over SIMD optimizations, we should transfer the bitmap directly from encoded Parquet to Arrow boolean data, without going through some C++ bool
s.
You're right, let's shift to this( Besides, our project uses 1byte to store boolean, so I still hope an efficient Unpack would be help)
If shifting without C++ bool
s, I'm ok with this, but I'm afraid this might break the RecordReader
interface ( It expect typed record as output)
FYI, I've testing it with Intel Xeon:
BMI2 unpack8:
Load Average: 0.87, 3.85, 3.34
----------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------------
BM_PlainDecodingBoolean/1024 142 ns 141 ns 4948292 bytes_per_second=6.74001Gi/s
BM_PlainDecodingBoolean/4096 343 ns 343 ns 2037558 bytes_per_second=11.1335Gi/s
BM_PlainDecodingBoolean/32768 2090 ns 2090 ns 334798 bytes_per_second=14.5993Gi/s
BM_PlainDecodingBoolean/65536 4120 ns 4120 ns 172542 bytes_per_second=14.8152Gi/s
Current code:
----------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------------
BM_PlainDecodingBoolean/1024 575 ns 575 ns 1220486 bytes_per_second=1.65757Gi/s
BM_PlainDecodingBoolean/4096 2056 ns 2056 ns 340757 bytes_per_second=1.85583Gi/s
BM_PlainDecodingBoolean/32768 15953 ns 15952 ns 43986 bytes_per_second=1.91309Gi/s
BM_PlainDecodingBoolean/65536 31741 ns 31739 ns 21954 bytes_per_second=1.923Gi/s
If this is on a REQUIRED column, then CopyBitmap
would be much faster.
If this is on a REQUIRED column, then CopyBitmap would be much faster.
Yeah, I know. Both optimization would be applied.
If shifting without C++ bools, I'm ok with this, but I'm afraid this might break the RecordReader interface ( It expect typed record as output)
What do you think about this?
If shifting without C++ bools, I'm ok with this, but I'm afraid this might break the RecordReader interface ( It expect typed record as output)
What do you think about this?
You are right, we should instead add a new RecordReader
method to return bit-packed output.
I finally have time for this. I've test BMI2 unpack8 on Intel CPU, and it gets 10times faster than current impl.
I'll first investigate can we implement unpack16 with avx2 / bmi2, since levels are used as int16_t
. If all these operations has optimizations, I'll start to implement software extractBits, depositBits and optimizing unpack
I'll also investigate auto-vectorization in this problem, hope it works
The end goal should ideally be to optimize Parquet to Arrow boolean conversion. So please make sure you address that use case :-)
Sigh, personally I think a bit-level extra RecordReader is so hacking. But anyway I will try a poc
I mean an extra RecordReader method.
It's quite silly that the RecordReader unpacks a bitmap to a range of bool
values, only to have the Arrow Parquet reader pack the bitmap again.
Sure, I'll separate the issue:
take
I create a seprate issue for unpack16: https://github.com/apache/arrow/issues/40845
class BooleanDecoder : virtual public TypedDecoder<BooleanType> {
public:
using TypedDecoder<BooleanType>::Decode;
/// \brief Decode and bit-pack values into a buffer
///
/// \param[in] buffer destination for decoded values
/// This buffer will contain bit-packed values.
/// \param[in] max_values max values to decode.
/// \return The number of values decoded. Should be identical to max_values except
/// at the end of the current data page.
virtual int Decode(uint8_t* buffer, int max_values) = 0;
};
RecordReader might read un-aligned batch here 🤔, maybe we need re-copying the bitmap in record reader. @pitrou we have 3 ways:
if (values_written_ % 8 != 0 ) {
// handle unaligned decoding
}
// Decode aligned
this->current_decoder_->Decode()
Decode
interface:/// Candidate 1:
virtual int DecodeArrow(int max_values, ::arrow::BooleanBuilder*) = 0;
/// Candidate 2:
virtual int DecodeBitmap(int max_values, int begin, uint8_t* lsb_bitmap) = 0;
@pitrou @wgtmac what do you think of this?
int DecodeArrow(int num_values, int null_count, const uint8_t* valid_bits,
int64_t valid_bits_offset,
typename EncodingTraits<BooleanType>::Accumulator* out) override
Sigh, there is also a DecodeArrow
. So many confusing interfaces...
Catching up the discussion. I'm inclined to deprecate the bool*
interface and optimize mainly for reading into arrow bitmap. For the uint8_t*
variant, that looks like a separate topic.
I decide to use DecodeArrow
to implement this first. I'll first create a separate issue for that, then I will create a new RecordReader and a flag.
Describe the enhancement requested
Plain Boolean Decoding is far more slower than Rle Boolean Decoding. This is because:
PlainBooleanDecoder
usesBitReader::GetBatch
to decoding BoolBitReader::GetBatch
is optimized for 32bits and 64bits input with unpack32/unpack64bool
, the code will fallback to the logic:Maybe we can specialize the case with
sizeof(T) == 1
to optimize this?Component(s)
C++, Parquet