apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.66k stars 3.56k forks source link

[C++] dictionary array transpose not handle null in kernel function `TransposeInts` #44827

Open chenbaggio opened 2 days ago

chenbaggio commented 2 days ago

Describe the enhancement requested

Due to some unify dictionary applications not handling the index value corresponding to null, a very large value may randomly occur. As a result, in the final TransposeInts function, src[index] may become a very large value, even exceeding the range of transpose_map, leading to a crash.”

If this is a description of a bug, it suggests that the merge dictionary application should properly handle null indices to avoid generating invalid or out-of-bounds index values, which can cause crashes in later stages of the program

Component(s)

C++

mapleFU commented 2 days ago

Would you mind provide the crash stack? Seems that TransposeInts itself should not handle validity logic, perhaps the caller of TransposeInts wrongly handles validity bitmap

chenbaggio commented 1 day ago

the core stack has been removed, and here fix the issue using constant index value 0 instead of random value,

the core stack is in TransposeInts, the code snippet:

··· dest[1] = static_cast(transpose_map[src[1]]);

··· here the src[1] should be null, but here not check, the src[1] is random value, a massive number, so crash

here question is :

should check the null bitmap in here or handle it in application, make sure the index would not overflow ?

chenbaggio commented 1 day ago

another fix version is implement the version which check the bitmap:


template <typename InputInt, typename OutputInt>
void TransposeIntsWithNullBitmap(const InputInt* src, OutputInt* dest, int64_t length,
                   const int32_t* transpose_map, const uint8_t * null_bitsmap, int64_t null_offset) {

  if (null_bitsmap != nullptr) {
     int64_t idx = 0;  
    while (length > 0) {  
    if (bit_util::GetBit(null_bitsmap, (null_offset + idx))) {
        *dest = static_cast<OutputInt>(transpose_map[*src]);
     }
      --length, ++src, ++idx, ++dest;
    }
  } else {
    TransposeInts(src, dest, length, transpose_map);
  }
}
template <typename SrcType>
struct TransposeIntsDestWithNullBitmap {
  const SrcType* src;
  uint8_t* dest;
  int64_t dest_offset;
  int64_t length;
  const int32_t* transpose_map;
  const uint8_t* null_bitmap;
  int64_t null_offset;

  template <typename T>
  enable_if_integer<T, Status> Visit(const T&) {
    using DestType = typename T::c_type;
    TransposeIntsWithNullBitmap(src, reinterpret_cast<DestType*>(dest) + dest_offset, length,
                  transpose_map, null_bitmap, null_offset);
    return Status::OK();
  }

  Status Visit(const DataType& type) {
    return Status::TypeError("TransposeInts received non-integer dest_type");
  }

  Status operator()(const DataType& type) { return VisitTypeInline(type, this); }
};

struct TransposeIntsSrcWithNullBitmap {
  const uint8_t* src;
  uint8_t* dest;
  int64_t src_offset;
  int64_t dest_offset;
  int64_t length;
  const int32_t* transpose_map;
  const DataType& dest_type;
  const uint8_t *null_bitmap;

  template <typename T>
  enable_if_integer<T, Status> Visit(const T&) {
    using SrcType = typename T::c_type;
    return TransposeIntsDestWithNullBitmap<SrcType>{reinterpret_cast<const SrcType*>(src) + src_offset,
                                      dest, dest_offset, length,
                                      transpose_map, null_bitmap, src_offset}(dest_type);
  }

  Status Visit(const DataType& type) {
    return Status::TypeError("TransposeInts received non-integer dest_type");
  }

  Status operator()(const DataType& type) { return VisitTypeInline(type, this); }
};

Status TransposeIntsWithNullBitmap(const DataType& src_type, const DataType& dest_type,
                     const uint8_t* src, uint8_t* dest, int64_t src_offset,
                     int64_t dest_offset, int64_t length, const int32_t* transpose_map,
                     const uint8_t *null_bitmap) {
  TransposeIntsSrcWithNullBitmap transposer{src, dest, src_offset, dest_offset,
                              length, transpose_map, dest_type, null_bitmap};
  return transposer(src_type);
}
mapleFU commented 1 day ago

I mean, the internal::TransposeInts might not designed for null-inputs, so, the I think TransposeIntsWithNullBitmap is a good way, but perhaps we need to find which caller triggers this problem.

There're TransposeDictIndices and concatenate which calls TransposeInts, so i mean which caller might trigger this problem?

chenbaggio commented 1 day ago

TransposeDictIndices is the trigger, you can consturct a dictionary array with null value, and its index set to a massive value, the issue may reproduced

mapleFU commented 1 day ago

I've check the code of CompactTransposeMapVisitor and RecursiveUnifier::Unify, in tuition they didn't generate invalid ranges... I don't know which caller would make this happens 🤔, would you mind show the upper caller of this case, or add a test in this case?

chenbaggio commented 20 hours ago

here the test case:

  1. read one group of data from database storage and construct a dictionary array dictionary ['ab', 'cd'] index [1, 1, 0, 0] => here index is int32 data type, in memory[0x00000001, 0x00000001, 0x00000000, 0x00000000] nullbitmap nullptr

  2. read another group of data from database storage and construct a dictionary array dictionary ['ab', 'ef'] index [1, 1, 0, null] => here index is int32 data type, in memory[0x00000001, 0x00000001, 0x00000000, random_value] nullbitmap [0001]

NOTE: the group of data would be unified the dictionary with the ahead dictionary ["ab", "cd"]

  1. the two group of data are unified in the application for purpose get unified dictionary dictionary ['ab', 'cd', 'ef'] index transpose array for the second group of data [1, 2] the index [1, 1, 0, null] after refresh should be [2, 2, 1, null]

here according to the code snippet:

in function `TransposeInts`

 while (length >= 4) {
    dest[0] = static_cast<OutputInt>(transpose_map[src[0]]);
    dest[1] = static_cast<OutputInt>(transpose_map[src[1]]);
    dest[2] = static_cast<OutputInt>(transpose_map[src[2]]);
    dest[3] = static_cast<OutputInt>(transpose_map[src[3]]);  
    length -= 4;
    src += 4;
    dest += 4;

NOTE : src[3] is random value may be a massive number, if so, index overflow, then crash

mapleFU commented 4 hours ago

@chenbaggio I may understand this. DictionaryArray::Transpose seems to be regarded as a internal function, the input of it is expected to be handled by DictionaryUnifier, which handles null well...

I think the problem is const int32_t* transpose_map given "non-null" values, which is unexpected. DictionaryUnifier might fixes this?

Would this ( https://github.com/apache/arrow/pull/44867 ) helps?