String compare for Substrait roundtrip check

tgujar commented 7 months ago

Describe the bug

roundtrip_with_ctx function checks for string equality when comparing plans. However, string compare seems to not give correct results. For e.g Here are two plans which are equivalent but considered different by the function.

Plan 1:

LeftSemi Join: data.a = __correlated_sq_1.a
  Inner Join: data.a = data2.a
    TableScan: data projection=[a, b, c, d, e, f]
    TableScan: data2 projection=[a, b, c, d, e, f]
  SubqueryAlias: __correlated_sq_1
    Projection: data2.a
      Filter: data2.f NOT IN ([Utf8("a"), Utf8("b"), Utf8("c"), Utf8("d")])
        TableScan: data2 projection=[a, f], partial_filters=[data2.f NOT IN ([Utf8("a"), Utf8("b"), Utf8("c"), Utf8("d")])]

Plan 2:

LeftSemi Join: data.a = data2.a
  Inner Join: data.a = data2.a
    TableScan: data projection=[a, b, c, d, e, f]
    TableScan: data2 projection=[a, b, c, d, e, f]
  Projection: data2.a
    Filter: data2.f NOT IN ([Utf8("a"), Utf8("b"), Utf8("c"), Utf8("d")])
      TableScan: data2 projection=[a, f], partial_filters=[data2.f NOT IN ([Utf8("a"), Utf8("b"), Utf8("c"), Utf8("d")])]

It might also give incorrect comparison result in case where there are more than one partial_filters but in a different ordering inside the vector.

To Reproduce

Here is an example testcase I used to produce the plans in the Describe the bug section.

#[tokio::test]
async fn roundtrip_inlist_5() -> Result<()> {
    roundtrip("SELECT * FROM data, data2 WHERE data.a = data2.a AND data.a IN (SELECT data2.a FROM data2 WHERE f NOT IN ('a', 'b', 'c', 'd'))").await
}

Expected behavior

The two plans should be considered equivalent.

Additional context

No response

alamb commented 7 months ago

It looks like some sort of optimization pass has been applied to plan 2. Maybe we need to not optimize it somehow 🤔

tgujar commented 6 months ago

Plan 1 is the produced Logical plan, and Plan 2 is Plan1 converted to substrait and then converted back to Logical plan. I think the issue arises because of differences in what can be expressed by substrait grammar vs the logical plan generated by Datafusion. For tests where the plans differ, using assert_expected_plan instead of roundtrip_with_ctx should work after manual inspection. I think I want to look into this further but maybe we could normalize both plans somehow and then check for equality

alamb commented 6 months ago

I think I want to look into this further but maybe we could normalize both plans somehow and then check for equality

I think this sounds like a good plan to me. Thank you for the investigation

tgujar commented 6 months ago

Maybe a simple solution would be to just convert to substrait again and then compare?

async fn roundtrip_with_ctx(sql: &str, ctx: SessionContext) -> Result<()> {
    let df = ctx.sql(sql).await?;

    let plan = df.into_optimized_plan()?;
    let proto = to_substrait_plan(&plan, &ctx)?;

    let plan2 = from_substrait_plan(&ctx, &proto).await?;
    let plan2 = ctx.state().optimize(&plan2)?;
    let proto2 = to_substrait_plan(&plan2, &ctx)?;

    println!("{plan:#?}");
    println!("{plan2:#?}");

    assert_eq!(proto, proto2);
    Ok(())
}

haohuaijin commented 6 months ago

maybe the below function helpful https://github.com/apache/arrow-datafusion/blob/33fc1104c199904fb0ee019546ac6587e7088316/datafusion/substrait/tests/cases/roundtrip_logical_plan.rs#L814-L835

tgujar commented 6 months ago

Maybe, I am unsure how to rewrite the query to avoid alias here

SELECT * FROM data WHERE a IN (SELECT a FROM data2 WHERE f NOT IN ('a', 'b', 'c', 'd'))

assert_expected_plan works fine for cases like this where plans are equivalent but don't have the same representation as a string. I am not sure if its usage in tests is encouraged though. If I understand correctly, changes to the plan generation might break tests with assert_expected_plan where we compare for exact string equality.

apache / datafusion