apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.
https://doris.apache.org
Apache License 2.0
12.56k stars 3.25k forks source link

[Feature] Limit cluster resource usage in user granularity #7129

Open MorningLight5 opened 2 years ago

MorningLight5 commented 2 years ago

Search before asking

Description

In productive environment, the Doris cluster is often facing pressure from many aspects (mainly from stream load and query), cause many resource shortage problem like OOM, especially in shared cluster. image As above picture shows, the memory usage waves too big. I think it's better to have a way to limit the resource usage of each user. Maybe limit the usage frequency is a proper way.

Use case

No response

Related issues

No response

Are you willing to submit PR?

Code of Conduct

MorningLight5 commented 2 years ago

I'm prefer to add syntax LIMITER on behalf of the frequency limiter. This limiter is an abstract concept, it can have many kinds of types like query or stream load. So the SQL to monipulate LIMITER is like below: CREATE LIMITER name PORPERTIES("key1"="value1", "key2"="value2"); DROP LIMITER name; SHOW LIMITER;

morningman commented 2 years ago

What if the request exceed the limit? return error or slow down? And is there any other system we can refer to?

MorningLight5 commented 2 years ago

What if the request exceed the limit? return error or slow down? And is there any other system we can refer to?

As far as I know, MySQL have variable max_connection to limit connection number. When connection exceed the limit, it returns error.

MorningLight5 commented 2 years ago

I think we can put LIMITER relative config in FE, and put metric data in BE. The procedure is like below: image image image

morningman commented 2 years ago

I see. Doris already has max_connection limit which can be set for each user. But I think what you need is not just limit the number of connection, but to limit the rate of request.

As far as I know, Guava's rate limiter may meet the requirement. But what more important is, how to define the rate? Simply put, it may be a limitation of QPS. But the essence is "control the consumption of cluster resources per unit time."

So I think in the first version, we can implement this function through simple rules (such as QPS). But in the specific design, we must reflect the abstract design of "system resources" so that we can add more rules later.

Looking forward your PR!

MorningLight5 commented 2 years ago

I see. Doris already has max_connection limit which can be set for each user. But I think what you need is not just limit the number of connection, but to limit the rate of request.

As far as I know, Guava's rate limiter may meet the requirement. But what more important is, how to define the rate? Simply put, it may be a limitation of QPS. But the essence is "control the consumption of cluster resources per unit time."

So I think in the first version, we can implement this function through simple rules (such as QPS). But in the specific design, we must reflect the abstract design of "system resources" so that we can add more rules later.

Looking forward your PR!

Where do you think the limiter should be put, BE or FE? As Guava is for Java, Do you think the limiter is better in FE?

xinyiZzz commented 2 years ago

What if the request exceed the limit? return error or slow down? And is there any other system we can refer to?

Impala’s AdmissionController does a similar thing, Introduction is here https://shimo.im/docs/6qxjctpyDHJgPwtw

MorningLight5 commented 2 years ago

The limit of operation frequency is developed in #7474 , user can config the threshold through frontend config like below: ADMIN SET FRONTEND CONFIG ('key' = 'value') You can limit the query number(max_running_query_num) in certain period (report_stats_period), the default period is 10 second(10 * 1000). And you can also limit the load number through max_running_txn_num.

The design of this feature is clear: Each FE keeps its query number locally, and reports the query number to Master every period, So every FE can get the query number in each FE through metadata synchronize. When there is a query arrive, if the total query number in last period exceeds threshold, the system reject the query. User can only query in next period.