apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.87k stars 3.38k forks source link

[C++][Compute] A latent bug may caused by validity-buffer pre-allocation in ScalarExecutor with NullHandling::INTERSECTION #43036

Open ZhangHuiGui opened 1 week ago

ZhangHuiGui commented 1 week ago

Describe the bug, including details regarding any error messages, version, and platform.

For now, the validity-buffer for output-data may not be preallocated with NullHandling::INTERSECTION according the input-data's NullGeneration. https://github.com/apache/arrow/blob/fecd207e5dd553d226046f9d7bc5cd1a05529781/cpp/src/arrow/compute/exec.cc#L943-L951

We've talked about this in https://github.com/apache/arrow/pull/41975#discussion_r1638523247.

If null_handling is INTERSECTION the function might want to follow a single code path that uses the pre-allocated buffer to put the result of the bitmaps intersection.

If some future implementations of kernel function believe that INTERSECTION mode must pre-allocate a validity-buffer, and rely on this to build their own logic, then the current ScalarExecutor pre-allocation strategy for validity-buffer in INTERSECTION mode may cause potential SIGSEGV problems.

Component(s)

C++

ZhangHuiGui commented 1 week ago

cc @pitrou @felipecrv

pitrou commented 6 days ago

So, the concern is that the validity bitmap may be NULL if the executor determines that the output is all-valid, right?

felipecrv commented 6 days ago

So, the concern is that the validity bitmap may be NULL if the executor determines that the output is all-valid, right?

Yes.

ZhangHuiGui commented 6 days ago

Even if we do not pre-allocate in INTERSECTION mode, the validity-buffer will also be pre-allocated in ScalarExecutor::ExecuteNonSpans or VectorExecutor::Exec through PropagateNulls.

ExecuteNonSpans: https://github.com/apache/arrow/blob/237987ad6ab8ac95f15d6d9aa825e7ccfa2a1d99/cpp/src/arrow/compute/exec.cc#L914-L915 PropagateNulls : will do the same null check with NullGeneration and do allocate validity-buffer again if not allocated. https://github.com/apache/arrow/blob/237987ad6ab8ac95f15d6d9aa825e7ccfa2a1d99/cpp/src/arrow/compute/exec.cc#L546-L553

Therefore, the pre-allocation logic of ScalarExecutor should be kept consistent with that of VectorExecutor, and the logic of pre-allocating validity-buffer should be retained in INTERSECTION mode.

pitrou commented 5 days ago

So, the concern is that the validity bitmap may be NULL if the executor determines that the output is all-valid, right?

Yes.

But having a NULL validity bitmap when the output is determined to be all-valid is a feature, not a bug. There's nothing that needs to be changed here.

Therefore, the pre-allocation logic of ScalarExecutor should be kept consistent with that of VectorExecutor

I don't think consistency is a goal here, since there are important differences between what is possible in a scalar and vector kernel.

PropagateNulls : will do the same null check with NullGeneration and do allocate validity-buffer again if not allocated.

Well... if PropagateNulls allocates a validity bitmap even when we know there will be 0 nulls, it's the bug that needs to be fixed :-)

ZhangHuiGui commented 5 days ago

But having a NULL validity bitmap when the output is determined to be all-valid is a feature, not a bug. There's nothing that needs to be changed here.

Yes, wes mentioned this:

  // Used to account for the case where we do not preallocate a
  // validity bitmap because the inputs are all non-null and we're
  // using NullHandling::INTERSECTION to compute the validity bitmap
  bool elide_validity_bitmap_ = false;

ScalarExecutor support this feature by check NullGeneralization::Get in SetupPreallocation, but VectorExecutor not support this for now and do preallocate for NULL validity bitmap in PropagateNulls, i think we should support/fix this for VectorExecutor.

felipecrv commented 5 days ago

Well... if PropagateNulls allocates a validity bitmap even when we know there will be 0 nulls, it's the bug that needs to be fixed :-)

I raised the concern as I worried about the consequence to existing compute kernels. I still think a more sane approach is to make pre-allocation always on with INTERSECTION and ask kernels that can handle every case correctly to simply adopt NO_PREALLOCATE.

ZhangHuiGui commented 4 days ago

If we want to ensure that pre-allocation for validity-bitmap is easy to use for kernels and prevent latent bugs, then for NullHandling, only the COMPUTED_PREALLOCATE and COMPUTED_NO_PREALLOCATE strategies are retained, and the remaining two strategies seem unnecessary?

For INTERSECTION, I understand that kernels developers are not clear whether the input data is ALL_VALID or PERHAPS_NULL, so the validity-bitmap allocation is left to the Executor itself -:)

felipecrv commented 3 days ago

No. That's not a fair inference on what I said in the PR.

I think INTERSECTION handling should check the types of the input arrays and reason about the intersection with Type::NA and other types that don't have validity bitmaps like unions.

Generic kernel dispatching should as much as possible be type dependent and not value dependent to (1) make keep dispatching decidable (non-Turing Complete) and (2) more amenable to static analysis as a consequence of (1).

Another argument (more practical) in favor of making pre-allocation fully determined by the types: during kernel dispatching we're looking at the lazily computed null_count which might be kUnknownNullCount non-deterministically. Calling GetNullCount() on all the inputs would be wasteful because it's probably better to let the kernel use the pre-allocated validity bitmap to compute the intersection together with its own loop.

To summarize my opinion: it's OK for a kernel that takes Type::NA to get an empty validity bitmap in the pre-allocated out when null_handling is INTERSECTION, but it's not OK for kernels that take nullable types to non-deterministically get empty validity buffers.

@pitrou might feel like this is a risk worth taking and we should demand that kernels handle non-determinism correctly.