apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.56k stars 3.54k forks source link

[C++][Parquet] Decoding: allow Boolean RecordReader get raw LSB bitmap #39227

Open mapleFU opened 11 months ago

mapleFU commented 11 months ago

Describe the enhancement requested

Plain Boolean Decoding is far more slower than Rle Boolean Decoding. This is because:

  1. PlainBooleanDecoder uses BitReader::GetBatch to decoding Bool
  2. BitReader::GetBatch is optimized for 32bits and 64bits input with unpack32/unpack64
  3. However, when input is bool, the code will fallback to the logic:
  if (sizeof(T) == 4) {
    int num_unpacked =
        internal::unpack32(reinterpret_cast<const uint32_t*>(buffer + byte_offset),
                           reinterpret_cast<uint32_t*>(v + i), batch_size - i, num_bits);
    i += num_unpacked;
    byte_offset += num_unpacked * num_bits / 8;
  } else if (sizeof(T) == 8 && num_bits > 32) {
    // Use unpack64 only if num_bits is larger than 32
    // TODO (ARROW-13677): improve the performance of internal::unpack64
    // and remove the restriction of num_bits
    int num_unpacked =
        internal::unpack64(buffer + byte_offset, reinterpret_cast<uint64_t*>(v + i),
                           batch_size - i, num_bits);
    i += num_unpacked;
    byte_offset += num_unpacked * num_bits / 8;
  } else {
    // TODO: revisit this limit if necessary
    DCHECK_LE(num_bits, 32);
    const int buffer_size = 1024;
    uint32_t unpack_buffer[buffer_size];
    while (i < batch_size) {
      int unpack_size = std::min(buffer_size, batch_size - i);
      int num_unpacked =
          internal::unpack32(reinterpret_cast<const uint32_t*>(buffer + byte_offset),
                             unpack_buffer, unpack_size, num_bits);
      if (num_unpacked == 0) {
        break;
      }
      for (int k = 0; k < num_unpacked; ++k) {
#ifdef _MSC_VER
#pragma warning(push)
#pragma warning(disable : 4800)
#endif
        v[i + k] = static_cast<T>(unpack_buffer[k]);
#ifdef _MSC_VER
#pragma warning(pop)
#endif
      }
      i += num_unpacked;
      byte_offset += num_unpacked * num_bits / 8;
    }
  }

Maybe we can specialize the case with sizeof(T) == 1 to optimize this?

Component(s)

C++, Parquet

JacobOgle commented 10 months ago

Is anyone working on this? I wouldn't mind taking a crack at it

mapleFU commented 10 months ago

Just move forward!

ShaiviAgarwal2 commented 10 months ago

@mapleFU Is this issue resolved? If not, I want to contribute to it!!

JacobOgle commented 10 months ago

@ShaiviAgarwal2 still open. Go for it!

ShaiviAgarwal2 commented 10 months ago

@mapleFU @JacobOgle As far as I can understand, we need to optimize the decoding of Boolean values in the Parquet C library which we can do by adding a condition to check if the size of the type T is 1 and also to use a specialized decoding method for it.

This is a possible result that we can try running. This code checks if the type T is a Boolean, if it is, it uses a more efficient method for decoding. This code should speed up the decoding of Boolean values.

if (sizeof(T) == 1) {
    const uint8_t* bool_buffer = reinterpret_cast<const uint8_t*>(buffer + byte_offset);
    while (i < batch_size) {
        int unpack_size = std::min(8, batch_size - i);
        uint8_t unpack_byte = bool_buffer[i / 8];
        for (int k = 0; k < unpack_size; ++k) {
            v[i + k] = static_cast<T>((unpack_byte >> (7 - (i % 8))) & 1);
        }
        i += unpack_size;
        byte_offset += unpack_size / 8;
    }
} else {
    // Existing code for other cases
    // ...
}
mapleFU commented 10 months ago

This looks ok to me, you can also run encoding_benchmark and boolean related benchmark to testing your optimization

ShaiviAgarwal2 commented 10 months ago

This looks ok to me, you can also run encoding_benchmark and boolean related benchmark to testing your optimization @mapleFU Can we use the Google Benchmark library for the same? Below I have used the Google Benchmark library for testing the optimization. Here, we can replace BooleanDecodingFunction() with the actual boolean decoding function we are trying to test.

#include <benchmark/benchmark.h>

static void BM_BooleanDecoding(benchmark::State& state) {
    // Set up the data and context for the benchmark

    while (state.KeepRunning()) {
        // Call the boolean decoding function wre're benchmarking
        BooleanDecodingFunction();
    }
}
// Register the benchmark
BENCHMARK(BM_BooleanDecoding);

// Run the benchmark
BENCHMARK_MAIN();
mapleFU commented 9 months ago

@ShaiviAgarwal2 we already have a encoding_benchmark.cc under src/parquet/.... Just testing it there

ShaiviAgarwal2 commented 9 months ago

@ShaiviAgarwal2 we already have a encoding_benchmark.cc under src/parquet/.... Just testing it there

Oh ok

mapleFU commented 8 months ago

@pitrou

Just curious that, seems implement unpack8 with pdep can get some optimizations. However, I searched in arrow project, seems that there're two usage of BMI2:

  1. check ARROW_HAVE_BMI2 in src/parquet/level_conversion_inc.h function ExtractBits
  2. check ARROW_HAVE_RUNTIME_BMI2 in src/arrow/compute/util_avx2.cc function bits_to_indexes_imp_avx2

If I want a pdep, should I write a same impl like src/parquet/level_conversion_inc.h function ExtractBits?

pitrou commented 8 months ago

Can you start with https://github.com/apache/arrow/issues/37623 ?

mapleFU commented 8 months ago

Aha, sure. I can add a poc impl first for unpack8 and benchmark it first, if the performance get improved, I can move forward

pitrou commented 8 months ago

I also don't understand 1) which method you are trying to speed up 2) how BMI2 is related with this

pitrou commented 8 months ago

The answer here should be to stop using BitReader and rewrite PlainBooleanDecoder using something more efficient instead, such as CopyBitmap.

mapleFU commented 8 months ago

@pitrou Yeah I understand your meaning. Previously I make it wrong, it uses bool type here: https://github.com/apache/arrow/blob/main/cpp/src/parquet/arrow/reader_internal.cc#L361

I test bmi2 based unpack8 in ryzen but it's so slow, I guess this is because I'm using Zen2 for benchmark ( https://stackoverflow.com/questions/76428057/do-all-cpus-that-support-avx2-also-support-bmi2-or-popcnt ). I'll shift to intel cpu for testing :-(

pitrou commented 8 months ago

Again, instead of obsessing over SIMD optimizations, we should transfer the bitmap directly from encoded Parquet to Arrow boolean data, without going through some C++ bools.

mapleFU commented 8 months ago

You're right, let's shift to this( Besides, our project uses 1byte to store boolean, so I still hope an efficient Unpack would be help)

If shifting without C++ bools, I'm ok with this, but I'm afraid this might break the RecordReader interface ( It expect typed record as output)

mapleFU commented 8 months ago

FYI, I've testing it with Intel Xeon:

BMI2 unpack8:

Load Average: 0.87, 3.85, 3.34
----------------------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------
BM_PlainDecodingBoolean/1024         142 ns          141 ns      4948292 bytes_per_second=6.74001Gi/s
BM_PlainDecodingBoolean/4096         343 ns          343 ns      2037558 bytes_per_second=11.1335Gi/s
BM_PlainDecodingBoolean/32768       2090 ns         2090 ns       334798 bytes_per_second=14.5993Gi/s
BM_PlainDecodingBoolean/65536       4120 ns         4120 ns       172542 bytes_per_second=14.8152Gi/s

Current code:

----------------------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------
BM_PlainDecodingBoolean/1024         575 ns          575 ns      1220486 bytes_per_second=1.65757Gi/s
BM_PlainDecodingBoolean/4096        2056 ns         2056 ns       340757 bytes_per_second=1.85583Gi/s
BM_PlainDecodingBoolean/32768      15953 ns        15952 ns        43986 bytes_per_second=1.91309Gi/s
BM_PlainDecodingBoolean/65536      31741 ns        31739 ns        21954 bytes_per_second=1.923Gi/s
pitrou commented 8 months ago

If this is on a REQUIRED column, then CopyBitmap would be much faster.

mapleFU commented 8 months ago

If this is on a REQUIRED column, then CopyBitmap would be much faster.

Yeah, I know. Both optimization would be applied.

If shifting without C++ bools, I'm ok with this, but I'm afraid this might break the RecordReader interface ( It expect typed record as output)

What do you think about this?

pitrou commented 8 months ago

If shifting without C++ bools, I'm ok with this, but I'm afraid this might break the RecordReader interface ( It expect typed record as output)

What do you think about this?

You are right, we should instead add a new RecordReader method to return bit-packed output.

mapleFU commented 7 months ago

I finally have time for this. I've test BMI2 unpack8 on Intel CPU, and it gets 10times faster than current impl.

I'll first investigate can we implement unpack16 with avx2 / bmi2, since levels are used as int16_t. If all these operations has optimizations, I'll start to implement software extractBits, depositBits and optimizing unpack

I'll also investigate auto-vectorization in this problem, hope it works

pitrou commented 7 months ago

The end goal should ideally be to optimize Parquet to Arrow boolean conversion. So please make sure you address that use case :-)

mapleFU commented 7 months ago

Sigh, personally I think a bit-level extra RecordReader is so hacking. But anyway I will try a poc

pitrou commented 7 months ago

I mean an extra RecordReader method.

pitrou commented 7 months ago

It's quite silly that the RecordReader unpacks a bitmap to a range of bool values, only to have the Arrow Parquet reader pack the bitmap again.

mapleFU commented 7 months ago

Sure, I'll separate the issue:

  1. Using a method for decoding boolean
  2. Investigate can we accelerate decoding int16_t with these techniques
mapleFU commented 7 months ago

take

mapleFU commented 7 months ago

I create a seprate issue for unpack16: https://github.com/apache/arrow/issues/40845

mapleFU commented 7 months ago
class BooleanDecoder : virtual public TypedDecoder<BooleanType> {
 public:
  using TypedDecoder<BooleanType>::Decode;

  /// \brief Decode and bit-pack values into a buffer
  ///
  /// \param[in] buffer destination for decoded values
  /// This buffer will contain bit-packed values.
  /// \param[in] max_values max values to decode.
  /// \return The number of values decoded. Should be identical to max_values except
  /// at the end of the current data page.
  virtual int Decode(uint8_t* buffer, int max_values) = 0;
};

RecordReader might read un-aligned batch here 🤔, maybe we need re-copying the bitmap in record reader. @pitrou we have 3 ways:

  1. Use the current interface, and support an "aligned" handling in RecordReader:
if (values_written_ % 8 != 0 ) {
  // handle unaligned decoding
}
// Decode aligned
this->current_decoder_->Decode()
  1. Enhance the Decode interface:
/// Candidate 1:
  virtual int DecodeArrow(int max_values, ::arrow::BooleanBuilder*) = 0;

/// Candidate 2:
  virtual int DecodeBitmap(int max_values, int begin, uint8_t* lsb_bitmap) = 0;

@pitrou @wgtmac what do you think of this?

mapleFU commented 7 months ago
int DecodeArrow(int num_values, int null_count, const uint8_t* valid_bits,
                  int64_t valid_bits_offset,
                  typename EncodingTraits<BooleanType>::Accumulator* out) override

Sigh, there is also a DecodeArrow. So many confusing interfaces...

wgtmac commented 7 months ago

Catching up the discussion. I'm inclined to deprecate the bool* interface and optimize mainly for reading into arrow bitmap. For the uint8_t* variant, that looks like a separate topic.

mapleFU commented 7 months ago

I decide to use DecodeArrow to implement this first. I'll first create a separate issue for that, then I will create a new RecordReader and a flag.