apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.37k stars 3.5k forks source link

[C++] Compute: Why compute::internal::BooleanKeyEncoder doesn't handle !valid scalar as not-null? #43733

Closed mapleFU closed 1 month ago

mapleFU commented 1 month ago

Describe the enhancement requested

I'm current learning the code of compute::RowEncoder. I found that in FixedWidthKeyEncoder, Scalar detects is_valid:

Status FixedWidthKeyEncoder::Encode(const ExecValue& data, int64_t batch_length,
                                    uint8_t** encoded_bytes) {
  if (data.is_array()) {
    // ...
  } else {
    const auto& scalar = data.scalar_as<arrow::internal::PrimitiveScalarBase>();
    if (scalar.is_valid) { // <-- is valid is checked here..
      const std::string_view data = scalar.view();
      DCHECK_EQ(data.size(), static_cast<size_t>(byte_width_));
      for (int64_t i = 0; i < batch_length; i++) {
        auto& encoded_ptr = *encoded_bytes++;
        *encoded_ptr++ = kValidByte;
        memcpy(encoded_ptr, data.data(), data.size());
        encoded_ptr += byte_width_;
      }
    } else {
      for (int64_t i = 0; i < batch_length; i++) {
        auto& encoded_ptr = *encoded_bytes++;
        *encoded_ptr++ = kNullByte;
        memset(encoded_ptr, 0, byte_width_);
        encoded_ptr += byte_width_;
      }
    }
  }
  return Status::OK();
}

However, in BooleanKeyEncoder, this is not handled:

Status BooleanKeyEncoder::Encode(const ExecValue& data, int64_t batch_length,
                                 uint8_t** encoded_bytes) {
  if (data.is_array()) {
    // ..
  } else {
    const auto& scalar = data.scalar_as<BooleanScalar>();
    bool value = scalar.is_valid && scalar.value;  <-- boolean is handled in this way
    for (int64_t i = 0; i < batch_length; i++) {
      auto& encoded_ptr = *encoded_bytes++;
      *encoded_ptr++ = kValidByte;
      *encoded_ptr++ = value;
    }
  }
  return Status::OK();
}

Is this expected?

Component(s)

C++

mapleFU commented 1 month ago

@felipecrv @zanmato1984 would you mind take a look? I'm a bit confused here

zanmato1984 commented 1 month ago

You mean a null boolean scalar will be encoded as a valid false?

If so, this looks like a bug to me.

mapleFU commented 1 month ago

You mean a null boolean scalar will be encoded as a valid false?

Yes, seems like it is :-(

felipecrv commented 1 month ago

In SQL, NULL (when in a boolean context) is considered falsy.

D SELECT CASE WHEN NULL THEN 'N' ELSE 'ELSE' END;
┌──────────────────────────────────────────────┐
│ CASE  WHEN (NULL) THEN ('N') ELSE 'ELSE' END │
│                   varchar                    │
├──────────────────────────────────────────────┤
│ ELSE                                         │
└──────────────────────────────────────────────┘

But in grouping (where the row encoded seems to be used), the null is its own value:

D select * from test1;
┌───────┬─────────┐
│   a   │    b    │
│ int32 │ boolean │
├───────┼─────────┤
│     1 │         │
│     2 │ true    │
│     3 │ false   │
└───────┴─────────┘
D select b, count(*) from test1 group by b;
┌─────────┬──────────────┐
│    b    │ count_star() │
│ boolean │    int64     │
├─────────┼──────────────┤
│ false   │            1 │
│ true    │            1 │
│         │            1 │
└─────────┴──────────────┘

So yes, looks like a bug 🤔

mapleFU commented 1 month ago

Aha, I've tried to fix that

pitrou commented 1 month ago

Issue resolved by pull request 43734 https://github.com/apache/arrow/pull/43734