clj-commons / manifold

A compatibility layer for event-driven abstractions
1.01k stars 106 forks source link

Understanding the stats->map output #232

Open KaliszAd opened 1 year ago

KaliszAd commented 1 year ago

I have trouble understanding the (manifold-exec/stats->map (.getStats executor)) output. After a trivial transformation, I get:

#:thread-pool{:num-workers 93,
              :utilization #:summary{:permille-950 0.7666895163808638,
                                     :permille-999 0.8500435866770647,
                                     :permille-900 0.7023507840369302,
                                     :permille-500 0.0,
                                     :permille-990 0.8200846893009659},
              :queue-latency #:summary{:permille-950 2.602875999999999,
                                       :permille-999 11.711050468000012,
                                       :permille-900 1.6487680000000007,
                                       :permille-500 0.085797,
                                       :permille-990 6.60346608},
              :task-completion-rate #:summary{:permille-950 400.0,
                                              :permille-999 1060.7999999999993,
                                              :permille-900 240.0,
                                              :permille-500 0.0,
                                              :permille-990 640.0},
              :task-latency #:summary{:permille-950 903.397588,
                                      :permille-999 2057.2366079300305,
                                      :permille-900 761.185869,
                                      :permille-500 1.472786,
                                      :permille-990 1242.8296557000006},
              :queue-length #:summary{:permille-950 0.0,
                                      :permille-999 0.0,
                                      :permille-900 0.0,
                                      :permille-500 0.0,
                                      :permille-990 0.0},
              :task-arrival-rate #:summary{:permille-950 380.0,
                                           :permille-999 710.3999999999996,
                                           :permille-900 240.0,
                                           :permille-500 0.0,
                                           :permille-990 560.0},
              :task-rejection-rate #:summary{:permille-950 0.0,
                                             :permille-999 0.0,
                                             :permille-900 0.0,
                                             :permille-500 0.0,
                                             :permille-990 0.0}}

I don't get how the task arrival rate can be 0 in the Q-50. Why is there a queue latency when the queue length is 0? This particular executor is a utilization executor (0 queue length by default). It is created using (flow/utilization-executor 0.9 512 {:initial-thread-count 10}).

KingMob commented 1 year ago

I don't know off the top of my head. I'd have to bury deep into the Dirigiste code to refresh my memory to get the answer. But I'll give you my immediate guesses. @arnaudgeiser may also have some insights.

I don't get how the task arrival rate can be 0 in the Q-50.

If no tasks arrive for at least half the recording period that stats were collected for, then the median (50th pctile) will be 0. If you start stuff up in the background, and only use it occasionally (like in the REPL, or a low-use server), this seems pretty natural to me. Let me turn it around: why do you think it couldn't be 0?

Why is there a queue latency when the queue length is 0? This particular executor is a utilization executor (0 queue length by default).

Yeah, this one's a little confusing. What queue-latency is actually measuring is time to start executing submitted tasks, which is always non-zero. (The name, or docs, might be clarified on this point. PRs welcome.)

Even if you specify a queue length of 0, the code must still deal with the situation of the executor not being ready to run immediately. To handle that, it uses a SynchronousQueue, which blocks the submitting thread until the executor can accept the Runnable. And even if the executor was always ready, time still elapses between when the job is submitted, and when it starts, regardless.

It should probably be called something like job-start-latency.

KaliszAd commented 1 year ago

I don't get how the task arrival rate can be 0 in the Q-50.

If no tasks arrive for at least half the recording period that stats were collected for, then the median (50th pctile) will be 0. If you start stuff up in the background, and only use it occasionally (like in the REPL, or a low-use server), this seems pretty natural to me. Let me turn it around: why do you think it couldn't be 0?

Ah ok, that makes sense. I somehow didn't realize there could actually by no tasks in at least half of the possible arrival time slots. Now it also makes sense why arrival rates are often modeled using exponential distributions.

Yes, there could definitely be a bit more explanation about the why and what with the metrics and their stats in Dirigiste and Manifold. It would also be useful to document the units (AFAIK milliseconds for latencies), so that people don't need to sift through the source code.

KingMob commented 1 year ago

PRs welcome 😉