facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.53k stars 1.16k forks source link

Merge Join Full Outer Join with filter is not filtering on some equality clause matches #11599

Open DanielHunte opened 2 days ago

DanielHunte commented 2 days ago

Bug description

Full Outer Joins results are expected to have unmatched rows from either table with nulls for the other table's columns. There is a bug in Velox's implementation where when x rows (x >= 2) match on the equality clause (ex. t0 = u0) but don't pass the filter (ex. t0 > 4), for some reason x - 1 rows are still considered a match and therefore results in missing rows in the output. See the following example.

    SELECT
        *
    FROM (
        VALUES (1), (2), (3)
    ) t(t0)
),
u AS (
    SELECT
        *
    FROM (
        VALUES (2), (3), (4)
    ) t(u0)
)
SELECT
    *
FROM t
FULL OUTER JOIN u
    ON t.t0 = u.u0
    AND t.t0 > 4;

Nothing should match since nothing passes the filter. This query should produce the table:

  t0 | u0
   1 | NULL
   2 | NULL
NULL | 2
   3 | NULL
NULL | 3
NULL | 4

However Velox thinks there is a match on 2 and produces the table:

  t0 | u0
   1 | NULL
   2 | 2
   3 | NULL
NULL | 3
NULL | 4

2 | 2 is an extra row and 2 | NULL and NULL | 2 are missing rows.

Here is the unit test I made within MergeJoinTest.cpp:

TEST_F(MergeJoinTest, fullOuterJoinFilteredOutMatches) {
  auto left = makeRowVector({"t0"}, {makeFlatVector<int64_t>({1, 2, 3})});
  auto right = makeRowVector({"u0"}, {makeFlatVector<int64_t>({2, 3, 4})});
  createDuckDbTable("t", {left});
  createDuckDbTable("u", {right});

  // Full outer join.
  auto planNodeIdGenerator = std::make_shared<core::PlanNodeIdGenerator>();
  auto plan =
      PlanBuilder(planNodeIdGenerator)
          .values({left})
          .mergeJoin(
              {"t0"},
              {"u0"},
              PlanBuilder(planNodeIdGenerator).values({right}).planNode(),
              "t0 > 4",
              {"t0", "u0"},
              core::JoinType::kFull)
          .planNode();

  AssertQueryBuilder(plan, duckDbQueryRunner_)
      .assertResults(
          "SELECT * FROM t FULL OUTER JOIN u ON t.t0 = u.u0 AND t.t0 > 4");
}

System information

Wrote unit test on my linux dev server.

Relevant logs

No response

Yuhta commented 2 days ago

CC: @pedroerp

kagamiori commented 1 day ago

@DanielHunte Could you also paste the Velox unit test of merge join you created here?

pedroerp commented 1 day ago

I think this PR addresses this issue (still under review):

https://github.com/facebookincubator/velox/pull/11068

Cc: @JkSelf

DanielHunte commented 1 day ago

@DanielHunte Could you also paste the Velox unit test of merge join you created here?

Sure. I added it to the end of the description.

JkSelf commented 16 hours ago

It seems https://github.com/facebookincubator/velox/pull/11068 can fix this issue. I will follow up to resolve @pedroerp's comments. Thanks.