facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.54k stars 1.17k forks source link

fix(merge-join): Produce output before advancing key comparison #11605

Open pedroerp opened 1 week ago

pedroerp commented 1 week ago

Summary: Before we start reading keys from the next batch of input, we need to make sure we are not holding output_ wrapped around lazy vector from the last batch, since lazy vectors need to be materialized in order.

Differential Revision: D66169184

facebook-github-bot commented 1 week ago

This pull request was exported from Phabricator. Differential Revision: D66169184

netlify[bot] commented 1 week ago

Deploy Preview for meta-velox canceled.

Name Link
Latest commit 82021a20e3fc7cf4e36f49b9be0367329ba4301e
Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/673f79c04844a80008a94b61
facebook-github-bot commented 1 week ago

This pull request was exported from Phabricator. Differential Revision: D66169184

JkSelf commented 1 week ago

@pedroerp I have verified that we have this case in the following eight locations. Do we need to implement this check in all eight places?

  1. https://github.com/pedroerp/velox-1/blob/export-D66169184/velox/exec/MergeJoin.cpp#L786
  2. https://github.com/pedroerp/velox-1/blob/export-D66169184/velox/exec/MergeJoin.cpp#L890
  3. https://github.com/pedroerp/velox-1/blob/export-D66169184/velox/exec/MergeJoin.cpp#L918
  4. https://github.com/pedroerp/velox-1/blob/export-D66169184/velox/exec/MergeJoin.cpp#L945
  5. https://github.com/pedroerp/velox-1/blob/export-D66169184/velox/exec/MergeJoin.cpp#L978
  6. https://github.com/pedroerp/velox-1/blob/export-D66169184/velox/exec/MergeJoin.cpp#L1028
  7. https://github.com/pedroerp/velox-1/blob/export-D66169184/velox/exec/MergeJoin.cpp#L1059
  8. https://github.com/pedroerp/velox-1/blob/export-D66169184/velox/exec/MergeJoin.cpp#L1119
facebook-github-bot commented 6 days ago

This pull request was exported from Phabricator. Differential Revision: D66169184

pedroerp commented 6 days ago

@Yuhta I factored out some helper methods to make it a bit more readable. Please take another look.

pedroerp commented 2 days ago

@pedroerp I have verified that we have this case in the following eight locations. Do we need to implement this check in all eight places?

@JkSelf thanks for helping checking this. This is a bit tricky, but it's only needed when you are advancing the left side. The problem is because you can't advance the left while you are wrapping around a previous left buffer, since the previous left buffer could be wrapping around a lazy vector not yet materialized. So if you advance, you will materialize the keys from the next left batch, and later on another operator may try to materialize the previous left batch, and violates the constraint that lazy vectors need to be materialized in order.

For right-hand side buffers, we don't support lazy vectors across pipelines, so when the driver feeds the right batches it guarantees to materialize them. For the other left call sites, I believe I covered all of them in this change.

JkSelf commented 2 days ago

@pedroerp I have verified that we have this case in the following eight locations. Do we need to implement this check in all eight places?

@JkSelf thanks for helping checking this. This is a bit tricky, but it's only needed when you are advancing the left side. The problem is because you can't advance the left while you are wrapping around a previous left buffer, since the previous left buffer could be wrapping around a lazy vector not yet materialized. So if you advance, you will materialize the keys from the next left batch, and later on another operator may try to materialize the previous left batch, and violates the constraint that lazy vectors need to be materialized in order.

For right-hand side buffers, we don't support lazy vectors across pipelines, so when the driver feeds the right batches it guarantees to materialize them. For the other left call sites, I believe I covered all of them in this change.

@pedroerp I see. Thanks for your detailed explanations.