apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.92k stars 1.12k forks source link

Dictionary columns (with a `Boolean` values) can't be used directly as filters #12380

Closed samuelcolvin closed 6 days ago

samuelcolvin commented 1 week ago

Describe the bug

Running

select count(*) v from dicts where json_contains(json_data, str_key2)

Where json_contains returns a Dictionary(UInt32, Boolean) (as of https://github.com/datafusion-contrib/datafusion-functions-json/pull/39), results in:

called `Result::unwrap()` on an `Err` value: Plan("Cannot create filter with non-boolean predicate 'json_contains(dicts.json_data, dicts.str_key2)' returning Dictionary(UInt32, Boolean)")

If I change the sql to

select count(*) v from dicts where json_contains(json_data, str_key2) is true

Works fine.

Is this intended or a mistake?

To Reproduce

No response

Expected behavior

I would have assumed it should just work, but perhaps I'm mistaken?

Additional context

No response

samuelcolvin commented 1 week ago

Same seems to be the case for is not null, e.g. it never evaluates to true, is that intentional?

adriangb commented 1 week ago

I can reproduce with:

    let values = vec![Some(true), Some(false), None, Some(true)];
    let keys = vec![0, 0, 1, 2, 1, 3, 1];
    let values_array = BooleanArray::from(values);
    let keys_array = Int8Array::from(keys);
    let array = DictionaryArray::new(
        keys_array,
        Arc::new(values_array) as Arc<dyn Array>,
    );

    let field = Field::new(
        "my_dict",
        DataType::Dictionary(Box::new(DataType::Int8), Box::new(DataType::Boolean)),
        true,
    );
    let schema = Arc::new(Schema::new(vec![field]));

    let batch = RecordBatch::try_new(schema, vec![Arc::new(array)]).unwrap();

    let ctx = SessionContext::new();

    ctx.register_batch("dict_batch", batch).unwrap();

    let df = ctx.table("dict_batch").await.unwrap();

    // view_all
    let expected = [
        "+---------+",
        "| my_dict |",
        "+---------+",
        "| true    |",
        "| true    |",
        "| false   |",
        "|         |",
        "| false   |",
        "| true    |",
        "| false   |",
        "+---------+",
    ];
    assert_batches_eq!(expected, &df.clone().collect().await.unwrap());

    // filter where is null
    let result_df = df.clone().filter(col("my_dict")).unwrap();
    let expected = [
        "+---------+",
        "| my_dict |",
        "+---------+",
        "| true    |",
        "| true    |",
        "| false   |",
        "| false   |",
        "| true    |",
        "| false   |",
        "+---------+",
    ];
    assert_batches_eq!(expected, &result_df.collect().await.unwrap());