Closed rsaggino closed 1 year ago
Hi @rsaggino - one thing I notice is that the telemetry capture for cluster is extremely slow:
:observe/kafka-telemetry snapshots captured in 96778 ms
96 seconds (where normally Kpow would snapshot a cluster in 1 to 5 seconds).
Can I ask, is Kpow installed in close network proximity to your cluster? Network lag can cause snapshotting time to blow out, and in turn you can see the UI a bit slow to load data.
Also, can you tell me roughly how many topics/partitons/groups that cluster contains?
Thanks for the stacktrace of that error, it's possibly unrelated but I'll pick that up with the team on Monday.
Hi @rsaggino - just checking in to see if moving Kpow in close network proximity to your cluster helped.
Also regarding the MAX_QUEUE_SIZE issue, this is resolved in v90.3 of our community edition, available now.
Hi Derek, sorry for the delay, I will try to dedicate a bit of time to this.
Can I ask, is Kpow installed in close network proximity to your cluster?
No more than 20ms.
how many topics/partitons/groups that cluster contains?
This is a test cluster with ~6000 topics (currently 80% empty) , ~30000 partitions and ~4000 consumer groups.
Also regarding the MAX_QUEUE_SIZE issue, this is resolved in v90.3 of our community edition, available now.
I will try it and comment back.
Thanks.
I don't see the MAX_QUEUE issue anymore. Unfortunately the UI hangs heavily, it doesn't respond at all for seconds, no errors in the log. Looks like the backend is keeping the frontend from responding?
13:01:55.359 INFO [OperatrScheduler_Worker-7] operatr.observe.kafka – cluster abc: [69962] :observe/kafka-telemetry snapshots captured in 101103 ms
13:04:16.572 INFO [OperatrScheduler_Worker-8] operatr.compute.v3.materialization – 200f867f-cc86-45b6-9d44-f9531e0886ca materialize [69988] {:id "abc", :domain :cluster, :observation :observe/kafka-telemetry}
13:04:17.358 INFO [OperatrScheduler_Worker-8] operatr.compute.v3.materialization – 200f867f-cc86-45b6-9d44-f9531e0886ca produce [6419] :cluster abc simple metrics
13:04:17.358 INFO [OperatrScheduler_Worker-8] operatr.compute.v3.materialization – 200f867f-cc86-45b6-9d44-f9531e0886ca materializing [69988] [:cluster "abc"]
13:04:18.420 INFO [OperatrScheduler_Worker-8] operatr.compute.v3.materialization – 200f867f-cc86-45b6-9d44-f9531e0886ca materializing [58734] [:cluster "abc"] :kafka/topic-summary __kpow_global
13:04:23.296 INFO [OperatrScheduler_Worker-8] operatr.compute.v3.materialization – 200f867f-cc86-45b6-9d44-f9531e0886ca materializing [17918] [:cluster "abc"] :kafka/group-summary __kpow_global
13:04:24.861 INFO [OperatrScheduler_Worker-8] operatr.compute.v3.materialization – 200f867f-cc86-45b6-9d44-f9531e0886ca materializing [17918] [:cluster "abc"] :kafka/simple-consumer-summary __kpow_global
13:04:24.912 INFO [OperatrScheduler_Worker-8] operatr.compute.v3.materialization – 200f867f-cc86-45b6-9d44-f9531e0886ca materializing [15] [:cluster "abc"] :kafka/broker __kpow_global
13:04:24.912 INFO [OperatrScheduler_Worker-8] operatr.compute.v3.materialization – 200f867f-cc86-45b6-9d44-f9531e0886ca materializing [1] [:cluster "abc"] :kafka/cluster __kpow_global
13:04:25.514 INFO [OperatrScheduler_Worker-8] operatr.compute.v3.materialization – 200f867f-cc86-45b6-9d44-f9531e0886ca produce [13370] [:cluster "abc" :kafka/topic-summary] ["__kpow_global"] materialized metrics
13:04:25.515 INFO [OperatrScheduler_Worker-8] operatr.compute.v3.materialization – 200f867f-cc86-45b6-9d44-f9531e0886ca produce [2] [:cluster "abc" :kafka/simple-consumer-summary] ["__kpow_global"] materialized metrics
13:04:26.015 INFO [OperatrScheduler_Worker-8] operatr.compute.v3.materialization – 200f867f-cc86-45b6-9d44-f9531e0886ca produce [7036] [:cluster "abc" :kafka/group-summary] ["__kpow_global"] materialized metrics
13:04:26.015 INFO [OperatrScheduler_Worker-8] operatr.observe.kafka – cluster abc: [69988] :observe/kafka-telemetry snapshots captured in 131168 ms
Since the original problem is fixed I will close this. I can open a different issue for the performance if you agree.
Thanks.
Hi @rsaggino, thanks for the update.
Please do open a new ticket for performance tuning, can you include in the ticket:
6k topics, 4k groups, 30k partitions lands on the very-large side of things. Normally in that case we would spend some time optimizing and tuning with a customer - so I'm not super surprised that the CE edition is running very slowly.
We have some parallelism / compute buttons we can tweak though - the goal is to get Kpow snapshotting your cluster within 30s, and I can tell that currently that's taking 120s, we're not massively off.
As a starter, I would suggest 2 CPU, 8GB heap if you were running with less than that already.
Also we have v90.4 out tomorrow with some performance improvements included.
Version of Kpow Release: 90.2.1
Describe the issue Hi,
just kept testing kpow, the console backend is a bit slow to load data and a few dashboards are empty (broker produce read and write).
This is the only ERROR I see log:
Are the symptoms related to the error?
Thanks.