Coroutine scheduler monitoring

asad-awadia commented 5 years ago

Are there any monitoring tools available for how many coroutines are currently active and their state etc? It would be nice if it could be exposed so that something like Prometheus can scrape it and visualise it in grafana.

It will also help in debugging leaks - and errors that might occur if we see coroutines just rising linearly

If not can this be done by looking at the thread stats instead?

Go exposes it via runtime.NumGoroutines()

elizarov commented 5 years ago

Please, take a look at kotlinx-coroutines-debug module: https://github.com/Kotlin/kotlinx.coroutines/blob/master/kotlinx-coroutines-debug/README.md

glasser commented 5 years ago

This does look pretty useful, but it also seems like it might have a notable performance impact?

The monitoring that looks attractive to me would be getting a gauge on the sizes of the CoroutineScheduler queues (global and local).

Our biggest fear is accidentally putting slow blocking work (or worse, deadlocks) in our main dispatcher (which happened to us once on a previous project using Kotlin coroutines incorrectly, and also when using Ratpack’s coroutine-style execution).

So getting alerted if work is building up over time (ie, if the queues are getting too big/growing indefinitely) seems helpful.

Would it be reasonable to expose some of these stats somewhere? These stats are specific to the CoroutineScheduler so I don't think kotlinx-coroutines-debug is relevant.

As an awful hack we are considering parsing (Dispatchers.Default as ExecutorCoroutineDispatcher).executor.toString(), with full understanding that it may break at any time.

elizarov commented 5 years ago

The monitoring that looks attractive to me would be getting a gauge on the sizes of the CoroutineScheduler queues (global and local).

@glasser Yes, that can be done without the slow debug mode and makes sense. I'll keep it open as an enhancement.

glasser commented 5 years ago

Thanks! Should I interpret that as "you're going to do it" or "you'd accept patches"?

qwwdfsad commented 5 years ago

Unfortunately, we are not ready to accept patches right now because the scheduler is being actively reworked.

But it would be really helpful if you could provide a more detailed example of the desired API shape and problem you want to solve with this API.

For example, "Ideally, we'd see it as pluggable SPI service for dispatcher with the following methods ..., so we could use to trigger our monitoring if ..."

glasser commented 5 years ago

Interesting — is there a branch or design doc or something for the reworking? Curious how it's changing.

My proposal is pretty simple. A few of the core objects involved with coroutine scheduling should be (a) publicly accessible and (b) expose a few properties that provide statistics about them. It's fine if these are documented as "experiment, up for change, don't rely on this" and as "fetching these properties may have a performance impact if done frequently" (eg, ConcurrentLinkedQueue.size is O(n)).

Most specifically, I'd want to have access to

ExperimentalCoroutineDispatcher.coroutineScheduler (which perhaps would return an interface declared to only contain the metrics below)
LimitingDispatcher.queueSize: Int
CoroutineScheduler.corePoolSize: Int
CoroutineScheduler.maxPoolSize: Int
CoroutineScheduler.queueSizes: Map<WorkerState, List<Int>>
CoroutineScheduler.globalQueueSize: Int
CoroutineScheduler.schedulerName: String (for tagging in the unlikely case of multiple schedulers) (ie, basically all the stuff in CoroutineScheduler.toString(); I think getting the control state isn't super necessary.)

I don't need kotlinx.coroutines to provide any machinery for hooking this up to my metrics service: I'm happy to keep at application (or external library) level the code that takes the dispatchers I care about, polls them for metrics, and publishes to my metrics service of choice.

qwwdfsad commented 5 years ago

Interesting — is there a branch or design doc or something for the reworking? Curious how it's changing.

No for both, though changes will be, of course, properly documented. But mostly it's about changing the parking/spinning strategy without violating liveness property to reduce CPU consumption during the low rate of the requests and to have a robust idle thread termination. Change is just too intrusive and touches all the places in the scheduler.

Thanks for the details! Could you please clarify, is it for Android app or for some backend service? Asking because there are also chances that Dispatches.Default will be backed with ForkJoinPool on Android by default (mostly to reduce dex size and count of threads), so we have to interop this observability with FJP as well.

glasser commented 5 years ago

This is for server usage.

We are currently porting a few web servers from Ratpack to Ktor. Ratpack has a similar async structure (with a recommended usage of a pool of "compute" threads approximately equal in size to the number of CPUs plus a scaling "blocking" pool) to Kotlin coroutines, but because you have to do all work with explicit Promise composition rather than the nice syntax of Kotlin coroutines, we've found that developers often don't bother to keep blocking work out of the compute pool, and often implement error handling incorrectly (eg by putting try/catch/finally or retry loops around functions that return Promises rather than properly using the Promise API). Our hope is that Kotlin coroutines will be much more accessible. But we still want to monitor that we're not clogging up the pools!

(Ratpack Promises also have some other odd behavior — eg, Blocking.get {}, which is somewhat like withContext(Dispatchers.IO {}, does not actually invoke the given block on the scalable threadpool until after the currently-running code fully returns to the event loop (equivalent of suspension), which meant that some misguided attempts to make a blocking call within a non-Promise-returning function use the "right" threadpool by writing (effectively) Blocking.get {}.get() not only tied up the current thread like you might expect, but actually blocked indefinitely because the code never got run! Hopefully our complete rewrite will avoid these border cases.)

cprice404 commented 4 years ago

+1 to everything that @glasser said. Looking to start replacing some thread pools with coroutines in our high-volume, production, back-end service, and would feel a lot better about it if we had some way to emit metrics about the health of the pools/scheduler. Thanks!

lfmunoz commented 4 years ago

I have an app that launches millions of coroutines that are CPU bound and they are taking longer than would be expected to complete. I am wondering if they are taking a long time because of the overhead of them being scheduled and executed. Would like to have monitoring on the queue size for this reason.

damian-pacierpnik-jamf commented 4 years ago

Any updates on this? Any news when it may be implemented? We are also interested in monitoring number of Coroutinies, and it is really disappointing, that such basic metric is not available by default.

anderssv commented 3 years ago

Any updates on this? Any other ways of getting similar numbers? Wanting metrics basically because of the same reasons as @glasser . :)

vikiselev commented 3 years ago

Any updates? I'm interested as well.

premnirmal commented 3 years ago

Also interested in this

qwwdfsad commented 3 years ago

We aim to implement it in the next releases after 1.5.0

joost-de-vries commented 3 years ago

Our use case is also high load server side. In addition to the metrics glasser mentioned:

latency
completed tasks.

soudmaijer commented 3 years ago

@qwwdfsad any updates? Also very much interested in this.

joost-de-vries commented 3 years ago

@soudmaijer for us this is so critical that I implemented the 'awful hack' that glasser mentioned. See https://github.com/joost-de-vries/spring-reactor-coroutine-metrics/tree/coroutineDispatcherMetrics/src/main/kotlin/metrics

cprice404 commented 2 years ago

We aim to implement it in the next releases after 1.5.0

Does that mean that this will be addressed in 1.6.0 (which appears to be close to release)?

dovchinnikov commented 2 years ago

In IJ we have own unlimited executor (let's call it ApplicationPool). We log a thread dump when the number of threads exceeds a certain value, but we don't prevent spawning new threads. I'd like to replace ApplicationPool with Dispatchers.IO.limitedParallelism(MAX_VALUE), but I'm missing the diagnostics part.

Using effectively unlimited IO dispatcher will allow us to drop own executor service (single pool for the whole app approach) and avoid unnecessary thread switches which inevitably happen between Dispatchers.Default and ApplicationPool.asCoroutineDispatcher().

jaredjstewart commented 1 year ago

Is there any update on this issue?

chenzhihui28 commented 1 year ago

any update?

glasser commented 1 year ago

@joost-de-vries is your hack still working out reasonably well for you?

cleidiano commented 4 months ago

Is there any update on this issue?

Kotlin / kotlinx.coroutines

Coroutine scheduler monitoring #1360