KSQL should scale to handle more concurrent queries as cluster size grows.

confluentinc / ksql

The database purpose-built for stream processing applications.

https://ksqldb.io

Other

118 stars 1.04k forks source link

KSQL should scale to handle more concurrent queries as cluster size grows. #4168

Open big-andy-coates opened 4 years ago

big-andy-coates commented 4 years ago

As each node currently runs ever topology, with a default 4 stream threads, there is an upper limit to the number of concurrent queries the cluster can run, and this number does not increase as cluster size grows. Instead, we'll potentially have many threads sat doing no work, just eating up resources.

For example, say you have a 10 node KSQL cluster. Each query will start 4 streams threads by default on each server. That's 40 threads across the cluster, when there may only be 10 partitions in the source topic, meaning 30 threads will be sat doing nothing, but will be eating up memory.

We need to fix this! :D

big-andy-coates commented 4 years ago

For large clusters, it's certainly worth considering tuning num.stream.threads, This is the number of threads each query creates on each server.

So if you have a 10 node KSQL and only 20 partitions in your source topic, there will be 20 threads with no work to do. Therefore, you may want to set num.stream.threads to something in the region of 1 - 3:

1 will give you 10 threads, with each thread processing 2 partitions worth of data, and no idle threads.
2 will give you 20 threads, with each processing 1 partition, and no idle threads.
3 will give you 30 threads, with each processing 1 partition, and idle threads ready to pick up if any of the nodes fail.

Of course, if you throw hot-standbys into the mix, then your threading requirements may change.

AgoloKarimTawfik commented 4 years ago

@big-andy-coates so this means if all the topics I have are single partitions KSQL scaling will have no effect, I was under the impression that KSQL scaling not only spreading the stream itself across multiple nodes in the cluster, but also handle load balancing the queries on the clusters.

How KSQL would behave in the following scenario, I have 100 topic each with 1 partition, and I have 100 streams running on these topics, so based on what you mentioned, I should have a single node KSQL cluster, and can't scale it as it will be wasting resources, right?

My proposal, why can't KSQL cluster load the balance from the streams perspective as well, so that if I have a 100 streams and 2 KSQL nodes, why KSQL don't distribute them 50/50 or even better based on the resources usage, does it make sense?

big-andy-coates commented 4 years ago

@AgoloKarimTawfik, 100 topics, each with 1 partition, is 100 topic-partitions. (The topic-partition is the scalable unit in KSQL).

If all topics were involved in a single query then KSQL would distribute 50 topic-partitions to each node.

However, if each topic is handled by its own query, (as I believe is the case for you), then the distribution would be somewhat random. This is because each query has its own consumer group, which will attempt to balance the single topic-partition across the two nodes, picking one. (I'd have to check the Kafka code to see if assignment is random or deterministic - I can't remember of the top of my head).

This is a slightly different issue to the one in the description, but definitely worth also considering / fixing.

To fix, we'd need custom consumer group balancing logic though...

AgoloKarimTawfik commented 4 years ago

@big-andy-coates, regarding the 100 topics, each with 1 partition vs 100 topic-partitions, it looks the same from the processing point of view, however from scaling perspective, it is not, due to the fact that if I tried to increase the single topic from 100 to 200 partitions, this will result in rebalancing and stopping other partitions from from being processed (I am aware of the Static consumer group feature, but this will not help in my case), which is not a desired effect, specially in my case, where each partition is a total independant unit, that I don't want it to be affected by other partitions processing.

on the second part, I agree with you it might be slightly different than the description, however it is really important that KSQL handle this case, I think this will unlock other use cases (like the one I trying right now).

Do you think opening another issue would be better?

true-def commented 4 years ago

@big-andy-coates Any updates on an action plan to fix this ? I am planning to use ksqldb in production but stuck with this issue because I typically need to use large number of streams (>200) which run independent queries and stuck with similar issue as explained by @AgoloKarimTawfik

vinitpatni commented 2 years ago

@big-andy-coates We also have the same usecase where we are planning to add large number of streams running independent queries on single kafka topic. Any plans to fix this issue