Stack Overflow with Deeply Nested Filter Expressions

orlandohohmeier commented 8 months ago

Describe the bug

When running a query with a deeply nested filter expression the query fails with stack overflow – the bug initially manifested as a EXC_BAD_ACCESS error on macOS in our application. The problem is that the filter expression is recursively normalized using transform_up which can cause stack overflows. This probably also happens in other scenarios where one would end up with a deeply nested tree.

Tested/Reproduced with: version = "34.0.0" macOS = 14.2.1

To Reproduce

Minimal Reproducible Example:

use datafusion::arrow::array::Int64Array;
use datafusion::arrow::datatypes::DataType;
use datafusion::arrow::datatypes::Field;
use datafusion::arrow::datatypes::Schema;
use datafusion::arrow::record_batch::RecordBatch;
use datafusion::error::Result;
use datafusion::prelude::*;
use std::sync::Arc;
#[tokio::main]
async fn main() -> Result<()> {
    let ctx = SessionContext::new();
    let batch = RecordBatch::try_new(
        Arc::new(Schema::new(vec![Field::new("a", DataType::Int64, false)])),
        vec![Arc::new(Int64Array::from(vec![
            1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
        ]))],
    )?;
    let df = ctx.read_batch(batch)?;

    let mut expr = col("a").eq(lit(1));
    for _ in 0..1000 {
        expr = expr.or(col("a").eq(lit(1)));
    }

    let df = df.filter(expr).unwrap();

    df.show().await
}

For it to run into an SO with an optimized release build the depth needs to be increased to 10000.

Expected behavior

The query should complete without errors, despite the complexity of the filter expression.

Additional context

No response

/cc @nfnt

orlandohohmeier commented 8 months ago

Current Quickfix: Increasing the stack size e.g. via the RUST_MIN_STACK environment variable (e.g., to 16MB) temporarily resolves the issue.

Long-term Solution: We should probably implement and use a stackless variant of transform_up, transform_down, and their mutable counterparts. I'd be happy to draft a PR, given their widespread use, the limited set of functions in TreeNode and need for post-order traversal, I am unsure how to best fix this bug though.

alamb commented 8 months ago

I'd be happy to draft a PR, given their widespread use, the limited set of functions in TreeNode and need for post-order traversal, I am unsure how to best fix this bug though.

Thanks @orlandohohmeier -- note there is some significant work currently being undertaken on the TreeNode APIs -- see https://github.com/apache/arrow-datafusion/issues/8913

Maybe once that is done, it will be easier to implement a stackless variant 🤔

We have some version of this pattern in the SQL planner already (SqlToRel)

orlandohohmeier commented 8 months ago

@alamb, thanks for the pointers! I'll dive into #8913 to understand the ongoing changes in the TreeNode APIs better. Glancing over it and aiming at maintaining the interface things could be done in parallel. I'll also review the existing implementation of the SqlToRel pattern in the SQL planner. This might provide valuable insights or a potential pathway for implementing a stackless variant for transform_up and related functions. I will get back once I have an idea of how to do the refactoring.

apache / datafusion