Closed essen closed 4 years ago
This looks great, will check it out as soon as I get a minute.
hey, I'm eager to merge this or #101. thoughts?
I'm fine either way. @gerhard can see which one he prefers. :-)
Monday - 6th of Jan - is the first day that I can context switch to this. Will try both approaches and see which one works better. Either way, the intention is to get one of the merged next week. 👍 if ok
Thanks @lukebakken for taking over this, my priorities are not allowing me to indulge in taking #98 and #101 for a spin.
These are the questions - in no particular order - which I was going to use in determining which of these 2 approaches best delivers on what is expected:
msacc:print/0
returns?--with-microstate-accounting=extra
, are the extra metrics returned as well?/metrics
HTTP endpoint, as used by the Prometheus server, can be scraped?Stats per type
metrics? Show us!If it helps, let's chat after today's sync-up @lukebakken 👍
@gerhard -
erlang_vm_msacc_gc{type="async",id="0"} 0
All stats are in the erlang_vm_thread_microstates_microseconds
category, differentiated via the thread_type
attribute:
erlang_vm_thread_microstates_microseconds{thread_type="async",thread_id="0",thread_state="gc"} 0.0
Yes in both PRs
Both use microseconds.
TODO
Yes. Since #98 is not in an all-in-one category, each stat has more detail in its help text.
Not sure, to be honest.
@gerhard
application:get_env
on every stats collection run (ets lookup).msacc:stats/0
to get stats which I believe removes the need to call erlang:convert_time_unit/3
and could be used in #98 (?)This PR doesn't require runtime_tools
, which it would if using msacc
functions. msacc
is just a convenience wrapper above the microstate accounting functionality, see https://github.com/erlang/otp/blob/master/lib/runtime_tools/src/msacc.erl#L68 and msacc:stats/0
actually does more than just convert time units, it also sorts and puts the data into maps, we don't need that: https://github.com/erlang/otp/blob/master/lib/runtime_tools/src/msacc.erl#L146
The application:get_env
to enable/disable certain stats is done by all exporters. There's no real penalty doing this.
I don't know if enabling microstate accounting automatically is a good idea, you generally don't want microstate accounting to be always enabled, and it's simpler to set a system_flag to start/stop it than to add/remove a Prometheus exporter.
From a Prometheus metric & label naming perspective, neither formats adhere to the expected standard:
erlang_vm_msacc_gc{type="async",id="0"} 0
is missing the uniterlang_vm_thread_microstates_microseconds{thread_type="async",thread_id="0",thread_state="gc"} 0.0
is using the wrong scale for the unit & is missing _total
Before we settle on appropriate names & labels, I would like to go back to the original msacc:print/0
output.
Given the following (--with-microstate-accounting=extra
is enabled):
erl -noshell -eval 'msacc:start(1000), msacc:print(), halt().'
Average thread real-time : 1003114 us
Accumulated system run-time : 7321 us
Average scheduler run-time : 2811 us
Thread alloc aux bifbusy_wait check_io emulator ets gc gc_full nif other port send sleep timers
Stats per thread:
async( 0) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% 0.00%
aux( 1) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% 0.00%
dirty_cpu_( 1) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% 0.00%
dirty_cpu_( 2) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% 0.00%
dirty_io_s( 1) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% 0.00%
dirty_io_s( 2) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% 0.00%
dirty_io_s( 3) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% 0.00%
dirty_io_s( 4) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% 0.00%
dirty_io_s( 5) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% 0.00%
dirty_io_s( 6) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% 0.00%
dirty_io_s( 7) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% 0.00%
dirty_io_s( 8) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% 0.00%
dirty_io_s( 9) 0.00% 0.00% 0.00% 0.02% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 99.97% 0.00%
dirty_io_s(10) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% 0.00%
poll( 0) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% 0.00%
scheduler( 1) 0.00% 0.00% 0.00% 0.07% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.03% 0.00% 0.00% 99.90% 0.00%
scheduler( 2) 0.00% 0.00% 0.01% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 99.99% 0.00%
Stats per type:
async 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% 0.00%
aux 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% 0.00%
dirty_cpu_sche 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% 0.00%
dirty_io_sched 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% 0.00%
poll 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 100.00% 0.00%
scheduler 0.00% 0.00% 0.01% 0.03% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.01% 0.00% 0.00% 99.94% 0.00%
I would expect the following metrics:
erlang_vm_msacc_thread_type_async_seconds_total{state="alloc"} 0
erlang_vm_msacc_thread_type_async_seconds_total{state="aux"} 0
erlang_vm_msacc_thread_type_async_seconds_total{state="bifbusy_wait"} 0
erlang_vm_msacc_thread_type_async_seconds_total{state="check_io"} 0
erlang_vm_msacc_thread_type_async_seconds_total{state="emulator"} 0
erlang_vm_msacc_thread_type_async_seconds_total{state="ets"} 0
erlang_vm_msacc_thread_type_async_seconds_total{state="gc"} 0
erlang_vm_msacc_thread_type_async_seconds_total{state="gc_full"} 0
erlang_vm_msacc_thread_type_async_seconds_total{state="nif"} 0
erlang_vm_msacc_thread_type_async_seconds_total{state="other"} 0
erlang_vm_msacc_thread_type_async_seconds_total{state="port"} 0
erlang_vm_msacc_thread_type_async_seconds_total{state="send"} 0
erlang_vm_msacc_thread_type_async_seconds_total{state="sleep"} 1
erlang_vm_msacc_thread_type_async_seconds_total{state="timers"} 0
# repeat for all other thread types
The above example output incorporates the learnings from https://github.com/rabbitmq/rabbitmq-prometheus/issues/26#issuecomment-569959281. If we track the thread id by default, then we will end up with an explosion of metrics, directly proportional to the number of CPUs.
16*4 + 2
66*7
66*14
+A 128
(which is what RabbitMQ would do by default), combined with extra msacc on a 16 CPUs system, and we end up with 2,688 metrics just for msacc.If we decide to go down the aggregation path by default - which I am recommending - regardless of the number of CPUs/schedulers/async threads, we would end up with either 42 metrics - 6*7
- or at most 84 metrics - 6*14
- if msacc extra is enabled. OK, we will not be able to see scheduler imbalances, but the first goal is to see where different schedulers spend most time. We should introduce a config option to enable per thread metrics, but not enable it by default, because of the metrics explosion described above.
As a separate point, we are missing thread real-time & system + scheduler run-time metrics. For the above example, I would expect to see:
erlang_vm_average_thread_real_time_seconds 1
erlang_vm_accumulated_system_run_time_seconds_total 0.007321
erlang_vm_average_scheduler_run_time_seconds 0.002811
If this collector is enabled and msacc is not running, I think that it would make sense to start/stop it for brief periods of time - maybe 1s - and return stats from this period. If msacc was already running, which I believe is what you are thinking @essen, then do not stop it. msacc:start/0
return makes it easy to determine this. If you think that this is a bad idea, tell me why 🙂
I think we need to determine what this collector is for. Microstate accounting is a profiling tool that's not meant to be always on, so the assumptions are different for this exporter compared to other exporters. If you envision to enable this at all times you probably need to confirm that you can take the performance hit from collecting those statistics (regardless of the metrics output). More details below.
From a Prometheus metric & label naming perspective, neither formats adhere to the expected standard:
Bummer! But not really surprising.
[...] The above example output incorporates the learnings from rabbitmq/rabbitmq-prometheus#26 (comment). If we track the thread id by default, then we will end up with an explosion of metrics, directly proportional to the number of CPUs.
Yeah. But sometimes you really are looking into issues that affect one or a limited number of threads. That's pretty much the point of these stats. So it's best to keep that in mind.
- Given a host with 16 CPUs (we regularly see 32 or 64), there will be 66 combinations of thread type & state by default:
16*4 + 2
- Multiplied by the number of default thread states we get a minium of 462 metrics:
66*7
- Enable extra msacc (14 thread states) and we get 924 metrics:
66*14
- Obviously, 1 async thread is unrealistic, so take
+A 128
(which is what RabbitMQ would do by default), combined with extra msacc on a 16 CPUs system, and we end up with 2,688 metrics just for msacc.
Actually 1 async thread is very realistic with modern Erlang/OTP, because it's not used anymore. I don't know the exact version the VM stopped using them but it's possible that RabbitMQ doesn't need to set +A
anymore. It's using dirty schedulers now. I'm not sure whether you need as many dirty schedulers as you did async threads. Probably don't.
If we decide to go down the aggregation path by default - which I am recommending - regardless of the number of CPUs/schedulers/async threads, we would end up with either 42 metrics -
6*7
- or at most 84 metrics -6*14
- if msacc extra is enabled. OK, we will not be able to see scheduler imbalances, but the first goal is to see where different schedulers spend most time. We should introduce a config option to enable per thread metrics, but not enable it by default, because of the metrics explosion described above.
Are we talking about the RabbitMQ default or the exporter default? Will this aggregate better fit people than per thread? Also what is the real cost of aggregating/not aggregating? Or aggregating in Grafana directly?
I doubt one or the other is the better answer, which is why msacc:print/0
prints both per thread and aggregated. But you can aggregate from per thread so that seemed like the better course of action.
Are you perhaps thinking in terms of leaving microstate accounting always on? I'm not sure that's what it should be, I think it's something you want to have there to enable when you need to debug a performance issue, then disable when you are done investigating. That's why I'm not terribly worried about producing potentially large number of metrics, they're only going to be produced when necessary.
As a separate point, we are missing thread real-time & system + scheduler run-time metrics. For the above example, I would expect to see:
erlang_vm_average_thread_real_time_seconds 1 erlang_vm_accumulated_system_run_time_seconds_total 0.007321 erlang_vm_average_scheduler_run_time_seconds 0.002811
No, that's an aggregate as well. Aggregated time excluding sleep time, aggregated time including sleep time, and aggregated time of the scheduler threads excluding sleep time which is then divided by the number of schedulers.
If this collector is enabled and msacc is not running, I think that it would make sense to start/stop it for brief periods of time - maybe 1s - and return stats from this period. If msacc was already running, which I believe is what you are thinking @essen, then do not stop it.
msacc:start/0
return makes it easy to determine this. If you think that this is a bad idea, tell me why
Well microstate accounting works by recording data over a period of time, if we have to enable it when we produce metrics then we'll need to wait for that period of time before we send those metrics. If we enable it automatically when metrics are pulled then we won't be able to disable it until Prometheus stops pulling metrics. If we enable it automatically at startup then it will run until it is manually stopped.
The latter might be useful as a convenience but it shouldn't be the default IMO.
The way I see things, is that I would enable this exporter by default but that it would only produce metrics when I enable microstate accounting. The rest of the time it would basically do nothing.
I agree with this comment and have opened essen/prometheus.erl#1 to simply add the _microseconds
suffix as suggested by the prometheus guidelines. Using seconds would lose valuable information.
But sometimes you really are looking into issues that affect one or a limited number of threads.
OK, agreed to emitting per thread id
metrics.
Actually 1 async thread is very realistic with modern Erlang/OTP, because it's not used anymore. I don't know the exact version the VM stopped using them but it's possible that RabbitMQ doesn't need to set +A anymore. It's using dirty schedulers now. I'm not sure whether you need as many dirty schedulers as you did async threads. Probably don't.
This is most interesting, I would very much like to see the result of various RabbitMQ benchmarks with no +A
setting. Took a note, thanks!
The way I see things, is that I would enable this exporter by default but that it would only produce metrics when I enable microstate accounting. The rest of the time it would basically do nothing.
How would you recommend for microstate accounting to be enabled on a node? Based on this comment, I would think via remote shell, but maybe there is a better/simpler way that I don't know. I am assuming that erlang:system_flag(microstate_accounting, false)
is what disables microstate accounting.
simply add the
_microseconds
suffix as suggested by the prometheus guidelines. Using seconds would lose valuable information.
Prometheus guidelines recommend using seconds for the base unit, and using floats to capture the precision. That's what we are doing in rabbitmq_core_metrics_colelctor
, and that's what all other good exporters do, like node_exporter for example. Also, since these are accumulating metrics, we need to use _seconds_total
.
Other than this last issue, once we have some details about this collector in the README - how to enable it, how to enable/disable microstate accounting - why is it a bad idea to leave it on all the time - maybe something else? - I am happy to get this merged and start working on the Grafana dashboard, and get @lukebakken involved with as well, so that we can share the how-to.
@essen out of curiosity, why did you choose msacc_state as the primary dimension (the metric name) & msacc_type as the secondary one?
msacc:print/0
uses msacc_type & id as the primary dimension - the rows - and msacc_state as the secondary one - the columns - which makes understanding thread imbalances straightforward.
The way I see things, is that I would enable this exporter by default but that it would only produce metrics when I enable microstate accounting. The rest of the time it would basically do nothing.
How would you recommend for microstate accounting to be enabled on a node? Based on this comment, I would think via remote shell, but maybe there is a better/simpler way that I don't know. I am assuming that
erlang:system_flag(microstate_accounting, false)
is what disables microstate accounting.
RabbitMQ could just have a rabbitmqctl
command to start/stop it (or do it through its eval command). Erlang projects that use release also have an eval command. So it's fairly painless, just need to document it somewhere so that it can be copy pasted.
simply add the
_microseconds
suffix as suggested by the prometheus guidelines. Using seconds would lose valuable information.Prometheus guidelines recommend using seconds for the base unit, and using floats to capture the precision. That's what we are doing in
rabbitmq_core_metrics_colelctor
, and that's what all other good exporters do, like node_exporter for example. Also, since these are accumulating metrics, we need to use_seconds_total
.
Using floats to capture the precision. Now that's a funny thought. :-) Doesn't really matter I guess, on the Erlang side we would still need to work in microseconds first and only convert for the output. @lukebakken if you have time feel free to do those changes. Thanks!
Other than this last issue, once we have some details about this collector in the README - how to enable it, how to enable/disable microstate accounting - why is it a bad idea to leave it on all the time - maybe something else? - I am happy to get this merged and start working on the Grafana dashboard, and get @lukebakken involved with as well, so that we can share the how-to.
About leaving it on all the time, by the way, I think experimentation would be needed to know the real cost. There's a cost, but I don't know how big. Anyway that cost is mentioned in the OTP documentation so perhaps there's no need to repeat it here.
@essen out of curiosity, why did you choose msacc_state as the primary dimension (the metric name) & msacc_type as the secondary one?
msacc:print/0
uses msacc_type & id as the primary dimension - the rows - and msacc_state as the secondary one - the columns - which makes understanding thread imbalances straightforward.
Because the state is what we're concerned about, regardless of how we render things out afterwards. For example the documentation works out better when done this way compared to documenting what the types are about. We want to know what the different states are, not so much the different types of threads (most of them are self-describing if you know a little Erlang). And from them you can aggregate across all threads, or across all threads of a certain type, or get a thread-specific view. You can't aggregate across all threads easily if you use the thread type as primary.
@essen out of curiosity, why did you choose msacc_state as the primary dimension (the metric name) & msacc_type as the secondary one? msacc:print/0 uses msacc_type & id as the primary dimension - the rows - and msacc_state as the secondary one - the columns - which makes understanding thread imbalances straightforward.
FWIW I think the current dimensions make sense (to me at least) when comparing with msacc:print
output. I'll make the change to use seconds and will add the _total
suffix. I'm assuming we should do conversion in the same manner as this code.
Ready for re-review!
Hey, just want to let you know I will look at this in the next days. Not ignoring!
It should work fine but we have not really tested it yet or written a dashboard for it.
/cc @gerhard