No protections for select query

niketh commented 7 years ago

Select query (unlike groupBy) has no protection against bad queries. I propose adding a limit of 50K like in the case of groupBy so that historicals/brokers don't get overwhelmed. An unchecked select query causes historicals to OOM while finalizing results to be sent to the broker.

b-slim commented 7 years ago

My 2 cents, Unlike group By select query is not suppose to produce an unexpectedly large amount of rows due to some grouping by high cardinality dimensions. In fact, the select query has paginations that limit the amount of row to pull at a given query, I think that is sort of enough to limit the number of rows pulled in a single query.

RestfulBlue commented 6 years ago

Any updates? Druid still can be killed by OOM using simple select query. of course I can use restrictions, but not all users understand what they are doing and the fact that users can very simply drop the entire cluster does not please me. Is it really hard to implement restriction for select query in configuration?

gianm commented 6 years ago

@RestfulBlue Have you considered the Scan query (http://druid.io/docs/latest/querying/scan-query.html) is meant to be a more memory friendly replacement for the Select query. It won't run your historicals out of memory. The only issue it really has is #4933, and that's a broker thing, not historicals. And it only can happen with very large or unlimited scan limits.

You could also consider adding a property for maxResults to SelectQueryConfig, and having that be a cap on the number of results that a Select query is allowed to return. With the way Select is written, though, it might be challenging to implement. Select's real problem is that it selects threshold-number of rows per each segment in parallel and then limits them later. This causes it to use up a lot more memory than you expect, and can also make it tough to implement a maxResults limit.

RestfulBlue commented 6 years ago

@gianm yes, i have considered, but currently druid api can be used only from some ui apps, like grafana/metabase/etc. if we let every analytic/data scientist in our company access to api , then entire druid cluster will crash really soon :( only one way to protect it now is timeout with low value ( about 10s). in 12.2 druid ingestion via kafka indexing task looks stable. We have separated cluster for stress testing, with throughput about 500000rows/sec on hardware with 8 cores, 64gb mem, 6 servers and now there s no failed tasks ( in 12.1 1/3 of task was failed), but currently query api looks really unstable and too many things can go wrong

gianm commented 6 years ago

@RestfulBlue I'd really try to move to Scan queries if you can… ultimately I believe we are going to be deprecating and removing Select queries. They are broken beyond repair, imo, and Scan is a good replacement (especially after #6088).

KenjiTakahashi commented 6 years ago

I'm with @gianm on this. I have examined the select in quite a detail in the past (before scan came to be, it is to be found somewhere in the issues/mailing list) and, sorry to say, but IMHO the whole design is just wrong, there's no use in trying to improve on it. Especially now, when we have scan.

FYI: We have since moved all but one (and a very rarely occurring one, that needs #6088, actually) of our queries to scan and never looked back.

asdf2014 commented 6 years ago

some ui apps, like grafana/metabase/etc. if we let every analytic/data scientist in our company access to api , then entire druid cluster will crash really soon

Hi, @RestfulBlue . Maybe you can try Apache Superset, which supports permission control to prevent ordinary users from sending queries that consume a lot of server resources, in addition, Superset also can predefine some dashboards instead of directly exposing the query api.

Tips: There is an article about superset on my personal blog that might help you.

RestfulBlue commented 6 years ago

@asdf2014 we use grafana as main gui, we have our self developed druid datasource and we have good restrictions as well. Using grafana users just cant build bad query and it is not a problem. Problems come then analysts/devs/others start using direct druid api , for example using jdbc

asdf2014 commented 6 years ago

Hi, @RestfulBlue . Indeed, we should analyze the query sql based on cost, then we can add some limitations for malicious query to protect druid cluster.

RestfulBlue commented 6 years ago

@asdf2014 jdbc its just example, in some places native queryes preferred, as sql has some problems.i think every query should have limitations like groupby. groupby looks really good query : disk spilling, limitation applyes to total rows count(not per granularity), when timeseries and topn doesnt has this restrictions. timeseries with large interval and none granularity - oom, same with topn. i think every query should have limitations like groupby

asdf2014 commented 6 years ago

@RestfulBlue Good point. I agree with you.

stale[bot] commented 5 years ago

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

stale[bot] commented 5 years ago

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

gianm commented 5 years ago

Since https://github.com/apache/incubator-druid/pull/7133 is done, even more workloads can be moved to Scan now. That is what I'd recommend.

apache / druid

No protections for select query #5006