factorhouse / kpow

Kpow for Apache Kafka Release Artifacts
https://factorhouse.io/kpow
Apache License 2.0
38 stars 5 forks source link

No more than 1024 pending puts are allowed on a single channel #9

Closed rsaggino closed 1 year ago

rsaggino commented 2 years ago

Version of Kpow Release: 90.2.1

Describe the issue Hi,

just kept testing kpow, the console backend is a bit slow to load data and a few dashboards are empty (broker produce read and write).

This is the only ERROR I see log:

10:40:51.371 ERROR [OperatrScheduler_Worker-5] operatr.compute.v3.materialization – Error creating materialized metrics for b7b580cb-9792-4776-9522-717d064dd76a [:cluster "xyz" :kafka/group-summary]
java.lang.AssertionError: Assert failed: No more than 1024 pending puts are allowed on a single channel. Consider using a windowed buffer.
(< (.size puts) impl/MAX-QUEUE-SIZE)
        at clojure.core.async.impl.channels.ManyToManyChannel.put_BANG_(channels.clj:156)
        at clojure.core.async$put_BANG_.invokeStatic(async.clj:199)
        at clojure.core.async$put_BANG_.invoke(async.clj:187)
        at operatr.observe.produce$fn__26696.invokeStatic(produce.clj:14)
        at operatr.observe.produce$fn__26696.invoke(produce.clj:11)
        at operatr.observe.produce$fn__26677$G__26672__26690.invoke(produce.clj:8)
        at operatr.compute.v3.materialization$materialized_metrics.invokeStatic(materialization.clj:54)
        at operatr.compute.v3.materialization$materialized_metrics.invoke(materialization.clj:48)
        at operatr.compute.v3.materialization$materialize$fn__27635$fn__27642.invoke(materialization.clj:134)
        at clojure.core$map$fn__5935.invoke(core.clj:2772)
        at clojure.lang.LazySeq.sval(LazySeq.java:42)
        at clojure.lang.LazySeq.seq(LazySeq.java:51)
        at clojure.lang.Cons.next(Cons.java:39)
        at clojure.lang.RT.next(RT.java:713)
        at clojure.core$next__5451.invokeStatic(core.clj:64)
        at clojure.core$dorun.invokeStatic(core.clj:3143)
        at clojure.core$doall.invokeStatic(core.clj:3149)
        at clojure.core$doall.invoke(core.clj:3149)
        at operatr.compute.v3.materialization$materialize$fn__27635.invoke(materialization.clj:134)
        at clojure.core$map$fn__5935.invoke(core.clj:2772)
        at clojure.lang.LazySeq.sval(LazySeq.java:42)
        at clojure.lang.LazySeq.seq(LazySeq.java:51)
        at clojure.lang.RT.seq(RT.java:535)
        at clojure.core$seq__5467.invokeStatic(core.clj:139)
        at clojure.core$dorun.invokeStatic(core.clj:3134)
        at clojure.core$doall.invokeStatic(core.clj:3149)
        at clojure.core$doall.invoke(core.clj:3149)
        at operatr.compute.v3.materialization$materialize.invokeStatic(materialization.clj:127)
        at operatr.compute.v3.materialization$materialize.invoke(materialization.clj:121)
        at operatr.observation$conclude.invokeStatic(observation.clj:104)
        at operatr.observation$conclude.invoke(observation.clj:100)
        at operatr.observe.kafka$capture_telemetry.invokeStatic(kafka.clj:253)
        at operatr.observe.kafka$capture_telemetry.invoke(kafka.clj:235)
        at operatr.observe.kafka$kafka_telemetry_job$reify__28375$fn__28376.invoke(kafka.clj:292)
        at operatr.observe.kafka.proxy$java.lang.Object$Callable$7da976d4.call(Unknown Source)
        at com.codahale.metrics.Timer.time(Timer.java:101)
        at operatr.observe.kafka$kafka_telemetry_job$reify__28375.execute(kafka.clj:290)
        at cronut.SerialProxyJob.execute(cronut.clj:73)
        at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
10:40:51.372 INFO  [OperatrScheduler_Worker-5] operatr.observe.kafka – cluster xyz: [68011] :observe/kafka-telemetry snapshots captured in 96778 ms

Are the symptoms related to the error?

Thanks.

d-t-w commented 2 years ago

Hi @rsaggino - one thing I notice is that the telemetry capture for cluster is extremely slow:

:observe/kafka-telemetry snapshots captured in 96778 ms

96 seconds (where normally Kpow would snapshot a cluster in 1 to 5 seconds).

Can I ask, is Kpow installed in close network proximity to your cluster? Network lag can cause snapshotting time to blow out, and in turn you can see the UI a bit slow to load data.

Also, can you tell me roughly how many topics/partitons/groups that cluster contains?

Thanks for the stacktrace of that error, it's possibly unrelated but I'll pick that up with the team on Monday.

d-t-w commented 1 year ago

Hi @rsaggino - just checking in to see if moving Kpow in close network proximity to your cluster helped.

Also regarding the MAX_QUEUE_SIZE issue, this is resolved in v90.3 of our community edition, available now.

rsaggino commented 1 year ago

Hi Derek, sorry for the delay, I will try to dedicate a bit of time to this.

Can I ask, is Kpow installed in close network proximity to your cluster?

No more than 20ms.

how many topics/partitons/groups that cluster contains?

This is a test cluster with ~6000 topics (currently 80% empty) , ~30000 partitions and ~4000 consumer groups.

Also regarding the MAX_QUEUE_SIZE issue, this is resolved in v90.3 of our community edition, available now.

I will try it and comment back.

Thanks.

rsaggino commented 1 year ago

I don't see the MAX_QUEUE issue anymore. Unfortunately the UI hangs heavily, it doesn't respond at all for seconds, no errors in the log. Looks like the backend is keeping the frontend from responding?

13:01:55.359 INFO  [OperatrScheduler_Worker-7] operatr.observe.kafka – cluster abc: [69962] :observe/kafka-telemetry snapshots captured in 101103 ms
13:04:16.572 INFO  [OperatrScheduler_Worker-8] operatr.compute.v3.materialization – 200f867f-cc86-45b6-9d44-f9531e0886ca materialize [69988] {:id "abc", :domain :cluster, :observation :observe/kafka-telemetry}
13:04:17.358 INFO  [OperatrScheduler_Worker-8] operatr.compute.v3.materialization – 200f867f-cc86-45b6-9d44-f9531e0886ca produce [6419] :cluster abc simple metrics
13:04:17.358 INFO  [OperatrScheduler_Worker-8] operatr.compute.v3.materialization – 200f867f-cc86-45b6-9d44-f9531e0886ca materializing [69988] [:cluster "abc"]
13:04:18.420 INFO  [OperatrScheduler_Worker-8] operatr.compute.v3.materialization – 200f867f-cc86-45b6-9d44-f9531e0886ca materializing [58734] [:cluster "abc"] :kafka/topic-summary __kpow_global
13:04:23.296 INFO  [OperatrScheduler_Worker-8] operatr.compute.v3.materialization – 200f867f-cc86-45b6-9d44-f9531e0886ca materializing [17918] [:cluster "abc"] :kafka/group-summary __kpow_global
13:04:24.861 INFO  [OperatrScheduler_Worker-8] operatr.compute.v3.materialization – 200f867f-cc86-45b6-9d44-f9531e0886ca materializing [17918] [:cluster "abc"] :kafka/simple-consumer-summary __kpow_global
13:04:24.912 INFO  [OperatrScheduler_Worker-8] operatr.compute.v3.materialization – 200f867f-cc86-45b6-9d44-f9531e0886ca materializing [15] [:cluster "abc"] :kafka/broker __kpow_global
13:04:24.912 INFO  [OperatrScheduler_Worker-8] operatr.compute.v3.materialization – 200f867f-cc86-45b6-9d44-f9531e0886ca materializing [1] [:cluster "abc"] :kafka/cluster __kpow_global
13:04:25.514 INFO  [OperatrScheduler_Worker-8] operatr.compute.v3.materialization – 200f867f-cc86-45b6-9d44-f9531e0886ca produce [13370] [:cluster "abc" :kafka/topic-summary] ["__kpow_global"] materialized metrics
13:04:25.515 INFO  [OperatrScheduler_Worker-8] operatr.compute.v3.materialization – 200f867f-cc86-45b6-9d44-f9531e0886ca produce [2] [:cluster "abc" :kafka/simple-consumer-summary] ["__kpow_global"] materialized metrics
13:04:26.015 INFO  [OperatrScheduler_Worker-8] operatr.compute.v3.materialization – 200f867f-cc86-45b6-9d44-f9531e0886ca produce [7036] [:cluster "abc" :kafka/group-summary] ["__kpow_global"] materialized metrics
13:04:26.015 INFO  [OperatrScheduler_Worker-8] operatr.observe.kafka – cluster abc: [69988] :observe/kafka-telemetry snapshots captured in 131168 ms
rsaggino commented 1 year ago

Since the original problem is fixed I will close this. I can open a different issue for the performance if you agree.

Thanks.

d-t-w commented 1 year ago

Hi @rsaggino, thanks for the update.

Please do open a new ticket for performance tuning, can you include in the ticket:

  1. The CPU available to the docker container (request/limit if deploying to kube via helm)
  2. The Memory available to the docker container (request/limit if deploying to kube via helm)

6k topics, 4k groups, 30k partitions lands on the very-large side of things. Normally in that case we would spend some time optimizing and tuning with a customer - so I'm not super surprised that the CE edition is running very slowly.

We have some parallelism / compute buttons we can tweak though - the goal is to get Kpow snapshotting your cluster within 30s, and I can tell that currently that's taking 120s, we're not massively off.

As a starter, I would suggest 2 CPU, 8GB heap if you were running with less than that already.

Also we have v90.4 out tomorrow with some performance improvements included.