Open dorlevi opened 2 weeks ago
Thanks for your report.
Apache Pinot has some OOM protection mechanisms that are applied to single-stage query engine but they are not applied in multi-stage query engine. I've created https://github.com/apache/pinot/issues/13436 in order to track them. We are actively working in some of them (specially in automatic query killing).
This specific query can be executed in single-stage. Have you try it there? Does it fail in that single-stage?
I need to find a proper time to test it because obviously killing our cluster is not something I can do at any time, (we don't have a test cluster with these big tables)
Regardless we care about the Multi-stage engine and we've seen it happen multiple times with different queries that some can only be executed on the multi-stage engine.
cc: @Jackie-Jiang
We have a realtime table (6 partitions, 140gb), when querying the table with timeout of 3 minutes all servers (6 servers, each 24cores and ~100 gb allocated to pinot) OOM and hangs
Query:
Explain plan:
We understand that such a query is perhaps not the best suited for Pinot but crashing all servers queried seems like a bug, especially as we haven't overridden any of the protections in place by the engine (besides timeout), we've reproduced it live for @mayankshriv and he suggested we open this issue.
OOM Logs from one of the servers (not super informative):
Running server args (pinot 1.1):
Table config