walterddr commented 1 year ago

observed multiple times when random timeout occurs on 3-way or more join queries.

example: https://github.com/apache/pinot/actions/runs/3660353652/jobs/6187392596

2022-12-09T20:09:11.9697742Z [ERROR] org.apache.pinot.query.runtime.queries.ResourceBasedQueriesTest.testQueryTestCasesWithH2[where_clause_tests, SELECT * FROM where_clause_tests_tbl WHERE intCol IN (SELECT a.intCol FROM where_clause_tests_tbl AS a JOIN where_clause_tests_tbl AS b ON a.strCol = b.strCol WHERE MOD(a.intCol, 2) = MOD(b.intCol, 2)), null](8)  Time elapsed: 10.109 s  <<< FAILURE!

walterddr commented 1 year ago

several others i can find: https://github.com/apache/pinot/actions/runs/3651519056/jobs/6168825587 https://github.com/apache/pinot/actions/runs/3653818305/jobs/6173667324 https://github.com/apache/pinot/actions/runs/3642421827/jobs/6149508171

occurs all after Dec 6, 2022 (76c649258c625d431a42ff1fbc1b3003fe013066)

walterddr commented 1 year ago

Seems like this query fails the most often

org.apache.pinot.query.runtime.queries.ResourceBasedQueriesTest.testQueryTestCasesWithH2[where_clause_tests, SELECT * FROM where_clause_tests_tbl WHERE intCol IN (SELECT a.intCol FROM where_clause_tests_tbl AS a JOIN where_clause_tests_tbl AS b ON a.strCol = b.strCol WHERE MOD(a.intCol, 2) = MOD(b.intCol, 2)), null](8)  Time elapsed: 10.079 s  <<< FAILURE!

agavra commented 1 year ago

I think this might be related to the threading model - I just realized that for joins the notification system might be problematic, imagine the following:

you get a notification that the probe table has data available and EOS
the join operator is scheduled, but the broadcast table is incomplete so nothing happens
the broadcast table completes
!!! the join is never scheduled again because we already "used" the notification for (1)

9934 will fix this in an unideal way,I'll think of how to fix this.

agavra commented 1 year ago

Three potential fixes:

have two callbacks: onDataAvailable and onDataConsumed and only “use” a seen mail notification when onDataConsumed is called. the upside is that this gives a lot of flexibility to the scheduler, the downside is that if data is available from the probing side of the join but not the broadcast it will keep being scheduled unless I add some really fancy scheduling logic that knows to only schedule joins when one mailbox is complete
I can make the HashJoinOperator cache data it reads from the probing mailbox. The obvious issue there is a potential memory pressure - this would be mitigated with flow control in place.
only schedule when _seenMail contains mailboxes from the “first” mailbox in the list of mailboxes instead of any mailbox the operator reads from. we could make this more generic by instead of just using the first we could have the API return any mailboxes that we’re ready to read from. downside is that this requires some pretty tightly coupled abstractions so we need to think through the API design well

apache / pinot

[multistage][flakytest] ResourceBasedQueriesTest.testQueryTestCasesWithH2 is flaky #9959

9934 will fix this in an unideal way,I'll think of how to fix this.