apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.5k stars 1.29k forks source link

Query-level priority and control #5627

Open yupeng9 opened 4 years ago

yupeng9 commented 4 years ago

I want to open a discussion around how to add query-level priority and control. Though Pinot has tenants at the servers and brokers level, often we need finer-granular control at the query level.

In production, we are facing the following common challenges:

I'd like to hear your thoughts on solving this type of challenge. While it's possible to explore a formal resource-based quota control framework at finer-granularity as other query engines do like Presto or Hive/Yarn, I'm also interested in some quick solutions to solve challenge #1, which can happen often and easily. For example, can we reject the expensive queries with the cost-estimate at the query plan phase? Or tools to identify expensive ongoing queries and be able to kill them?

agrawaldevesh commented 4 years ago

One quick hack: If we query Pinot via Presto then we can leverage Presto's resource groups (which is more of an admission control mechanism rather than a Yarn-queue like mechanism). Doing up front costing of the query shouldn't be too hard either if we can get apriori stats from Pinot, again leveraging Presto's pre existing CBO.

yupeng9 commented 4 years ago

Thanks for sharing the thoughts. I think it's a promising direction for users who query Pinot through Presto, as we can leverage the resource management layer in Presto.

At the minimum, we could add policies to control the queue size, if we could add a policy for the ad-hoc users, which might solve the challenge #2. To further utilize the constraints on CPU and memory, I think we need an integration with Pinot server/broker for the statistics collection?

I was also thinking of another direction like Query gateway to detect and block bad queries with some heuristics, e.g. SELECT with large time range scanning too many segments.

One quick hack: If we query Pinot via Presto then we can leverage Presto's resource groups (which is more of an admission control mechanism rather than a Yarn-queue like mechanism). Doing up front costing of the query shouldn't be too hard either if we can get apriori stats from Pinot, again leveraging Presto's pre existing CBO.

kishoreg commented 4 years ago

There is a per query quota and there is an overall quota.

lets start with protecting pinot against bad/poison queries. @Jackie-Jiang, @mayankshriv, @mcvsubbu do you have any thoughts on this?

siddharthteotia commented 4 years ago

At the minimum, we should leverage the per table level query timeout config (part of table config). Jackie had added this. Server can use this to kill long running queries or the queries that have waited long enough in the scheduler queue and thus are likely to overrun their time limit. Through query options, we can also pass per query time out.

In general, and a potential long term solution is to implement something called as workload management in Pinot to overall improve the multi-tenancy and isolation. I feel we should also revisit the priority token bucket scheduler since FCFS isn't suitable for multitenant deployments.

We can also consider forking a JVM (running in a separate process) so that an extremely bad query (high cpu high gc with the potential to cause OOM) doesn't crash the server.

yupeng9 commented 4 years ago

At the minimum, we should leverage the per table level query timeout config (part of table config). Jackie had added this. Server can use this to kill long running queries or the queries that have waited long enough in the scheduler queue and thus are likely to overrun their time limit. Through query options, we can also pass per query time out.

ah, that's a nice feature. @Jackie-Jiang can we add this config to the docs?

In general, and a potential long term solution is to implement something called as workload management in Pinot to overall improve the multi-tenancy and isolation. I feel we should also revisit the priority token bucket scheduler since FCFS isn't suitable for multitenant deployments.

We can also consider forking a JVM (running in a separate process) so that an extremely bad query (high cpu high gc with the potential to cause OOM) doesn't crash the server.

I like this idea. Server has two important functionalities of ingestion and querying. I think it is worth consideration of spawning child processes for query execution so that the ingestion is not impacted.

siddharthteotia commented 4 years ago

At the minimum, we should leverage the per table level query timeout config (part of table config). Jackie had added this. Server can use this to kill long running queries or the queries that have waited long enough in the scheduler queue and thus are likely to overrun their time limit. Through query options, we can also pass per query time out.

ah, that's a nice feature. @Jackie-Jiang can we add this config to the docs?

In general, and a potential long term solution is to implement something called as workload management in Pinot to overall improve the multi-tenancy and isolation. I feel we should also revisit the priority token bucket scheduler since FCFS isn't suitable for multitenant deployments. We can also consider forking a JVM (running in a separate process) so that an extremely bad query (high cpu high gc with the potential to cause OOM) doesn't crash the server.

I like this idea. Server has two important functionalities of ingestion and querying. I think it is worth consideration of spawning child processes for query execution so that the ingestion is not impacted.

@yupeng9 , take a look at QueryConfig class. I think you just need to set it in the table config. And yes, this has to be added to the docs.

mcvsubbu commented 4 years ago

Using off-heap (mmap) memory for query processing may be a good first step to take. This will be at the cost of performance, so it may need to be configurable. The max memory used for processing query can then be indicated to the broker in the response back from the server.

The brokers can, over time, tune themselves so that they can either ask servers to limit the number of threads used, and/or limit the memory used or even reject the query up front if there is no capacity.