Add microstate accounting collector

essen commented 4 years ago

It should work fine but we have not really tested it yet or written a dashboard for it.

/cc @gerhard

gerhard commented 4 years ago

This looks great, will check it out as soon as I get a minute.

deadtrickster commented 4 years ago

hey, I'm eager to merge this or #101. thoughts?

essen commented 4 years ago

I'm fine either way. @gerhard can see which one he prefers. :-)

gerhard commented 4 years ago

Monday - 6th of Jan - is the first day that I can context switch to this. Will try both approaches and see which one works better. Either way, the intention is to get one of the merged next week. 👍 if ok

gerhard commented 4 years ago

Thanks @lukebakken for taking over this, my priorities are not allowing me to indulge in taking #98 and #101 for a spin.

These are the questions - in no particular order - which I was going to use in determining which of these 2 approaches best delivers on what is expected:

Do the metrics correspond to what msacc:print/0 returns?
When Erlang is compiled with --with-microstate-accounting=extra, are the extra metrics returned as well?
What time span is used to collect msacc metrics? Does this impose a restriction on how frequently /metrics HTTP endpoint, as used by the Prometheus server, can be scraped?
How long do the 2 approaches take to return metrics? Is one slower than the other? By how much?
Are all returned metrics described via HELP & TYPE?
Which approach - if any - is easier for building a single stacked graph in Grafana for Stats per type metrics? Show us!

If it helps, let's chat after today's sync-up @lukebakken 👍

lukebakken commented 4 years ago

@gerhard -

Yes, but with a different format.

98

erlang_vm_msacc_gc{type="async",id="0"} 0

101

All stats are in the erlang_vm_thread_microstates_microseconds category, differentiated via the thread_type attribute:

erlang_vm_thread_microstates_microseconds{thread_type="async",thread_id="0",thread_state="gc"} 0.0

Yes in both PRs
Both use microseconds.
TODO
Yes. Since #98 is not in an all-in-one category, each stat has more detail in its help text.
Not sure, to be honest.

lukebakken commented 4 years ago

@gerhard

98 requires that the user enable microstate accounting on their own, #101 will do it for them via `msacc:start()
98 allows enabling only a certain set of metrics. It involves a call to application:get_env on every stats collection run (ets lookup).
101 uses msacc:stats/0 to get stats which I believe removes the need to call erlang:convert_time_unit/3 and could be used in #98 (?)

essen commented 4 years ago

This PR doesn't require runtime_tools, which it would if using msacc functions. msacc is just a convenience wrapper above the microstate accounting functionality, see https://github.com/erlang/otp/blob/master/lib/runtime_tools/src/msacc.erl#L68 and msacc:stats/0 actually does more than just convert time units, it also sorts and puts the data into maps, we don't need that: https://github.com/erlang/otp/blob/master/lib/runtime_tools/src/msacc.erl#L146

The application:get_env to enable/disable certain stats is done by all exporters. There's no real penalty doing this.

I don't know if enabling microstate accounting automatically is a good idea, you generally don't want microstate accounting to be always enabled, and it's simpler to set a system_flag to start/stop it than to add/remove a Prometheus exporter.

gerhard commented 4 years ago

From a Prometheus metric & label naming perspective, neither formats adhere to the expected standard:

erlang_vm_msacc_gc{type="async",id="0"} 0 is missing the unit
erlang_vm_thread_microstates_microseconds{thread_type="async",thread_id="0",thread_state="gc"} 0.0 is using the wrong scale for the unit & is missing _total

Before we settle on appropriate names & labels, I would like to go back to the original msacc:print/0 output.

Given the following (--with-microstate-accounting=extra is enabled):

erl -noshell -eval 'msacc:start(1000), msacc:print(), halt().'
Average thread real-time    : 1003114 us
Accumulated system run-time :    7321 us
Average scheduler run-time  :    2811 us

        Thread    alloc      aux      bifbusy_wait check_io emulator      ets       gc  gc_full      nif    other     port     send    sleep   timers

Stats per thread:
     async( 0)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
       aux( 1)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_( 1)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_( 2)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_io_s( 1)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_io_s( 2)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_io_s( 3)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_io_s( 4)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_io_s( 5)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_io_s( 6)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_io_s( 7)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_io_s( 8)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_io_s( 9)    0.00%    0.00%    0.00%    0.02%    0.00%    0.01%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.97%    0.00%
dirty_io_s(10)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
      poll( 0)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
 scheduler( 1)    0.00%    0.00%    0.00%    0.07%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.03%    0.00%    0.00%   99.90%    0.00%
 scheduler( 2)    0.00%    0.00%    0.01%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.99%    0.00%

Stats per type:
         async    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
           aux    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_sche    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_io_sched    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
          poll    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
     scheduler    0.00%    0.00%    0.01%    0.03%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.01%    0.00%    0.00%   99.94%    0.00%

I would expect the following metrics:

erlang_vm_msacc_thread_type_async_seconds_total{state="alloc"} 0
erlang_vm_msacc_thread_type_async_seconds_total{state="aux"} 0
erlang_vm_msacc_thread_type_async_seconds_total{state="bifbusy_wait"} 0
erlang_vm_msacc_thread_type_async_seconds_total{state="check_io"} 0
erlang_vm_msacc_thread_type_async_seconds_total{state="emulator"} 0
erlang_vm_msacc_thread_type_async_seconds_total{state="ets"} 0
erlang_vm_msacc_thread_type_async_seconds_total{state="gc"} 0
erlang_vm_msacc_thread_type_async_seconds_total{state="gc_full"} 0
erlang_vm_msacc_thread_type_async_seconds_total{state="nif"} 0
erlang_vm_msacc_thread_type_async_seconds_total{state="other"} 0
erlang_vm_msacc_thread_type_async_seconds_total{state="port"} 0
erlang_vm_msacc_thread_type_async_seconds_total{state="send"} 0
erlang_vm_msacc_thread_type_async_seconds_total{state="sleep"} 1
erlang_vm_msacc_thread_type_async_seconds_total{state="timers"} 0
# repeat for all other thread types

The above example output incorporates the learnings from https://github.com/rabbitmq/rabbitmq-prometheus/issues/26#issuecomment-569959281. If we track the thread id by default, then we will end up with an explosion of metrics, directly proportional to the number of CPUs.

Given a host with 16 CPUs (we regularly see 32 or 64), there will be 66 combinations of thread type & state by default: 16*4 + 2
Multiplied by the number of default thread states we get a minium of 462 metrics: 66*7
Enable extra msacc (14 thread states) and we get 924 metrics: 66*14
Obviously, 1 async thread is unrealistic, so take +A 128 (which is what RabbitMQ would do by default), combined with extra msacc on a 16 CPUs system, and we end up with 2,688 metrics just for msacc.

If we decide to go down the aggregation path by default - which I am recommending - regardless of the number of CPUs/schedulers/async threads, we would end up with either 42 metrics - 6*7 - or at most 84 metrics - 6*14 - if msacc extra is enabled. OK, we will not be able to see scheduler imbalances, but the first goal is to see where different schedulers spend most time. We should introduce a config option to enable per thread metrics, but not enable it by default, because of the metrics explosion described above.

As a separate point, we are missing thread real-time & system + scheduler run-time metrics. For the above example, I would expect to see:

erlang_vm_average_thread_real_time_seconds 1
erlang_vm_accumulated_system_run_time_seconds_total 0.007321
erlang_vm_average_scheduler_run_time_seconds 0.002811

If this collector is enabled and msacc is not running, I think that it would make sense to start/stop it for brief periods of time - maybe 1s - and return stats from this period. If msacc was already running, which I believe is what you are thinking @essen, then do not stop it. msacc:start/0 return makes it easy to determine this. If you think that this is a bad idea, tell me why 🙂

essen commented 4 years ago

I think we need to determine what this collector is for. Microstate accounting is a profiling tool that's not meant to be always on, so the assumptions are different for this exporter compared to other exporters. If you envision to enable this at all times you probably need to confirm that you can take the performance hit from collecting those statistics (regardless of the metrics output). More details below.

From a Prometheus metric & label naming perspective, neither formats adhere to the expected standard:

Bummer! But not really surprising.

[...] The above example output incorporates the learnings from rabbitmq/rabbitmq-prometheus#26 (comment). If we track the thread id by default, then we will end up with an explosion of metrics, directly proportional to the number of CPUs.

Yeah. But sometimes you really are looking into issues that affect one or a limited number of threads. That's pretty much the point of these stats. So it's best to keep that in mind.

Given a host with 16 CPUs (we regularly see 32 or 64), there will be 66 combinations of thread type & state by default: 16*4 + 2

Multiplied by the number of default thread states we get a minium of 462 metrics: 66*7

Enable extra msacc (14 thread states) and we get 924 metrics: 66*14

Obviously, 1 async thread is unrealistic, so take +A 128 (which is what RabbitMQ would do by default), combined with extra msacc on a 16 CPUs system, and we end up with 2,688 metrics just for msacc.

Actually 1 async thread is very realistic with modern Erlang/OTP, because it's not used anymore. I don't know the exact version the VM stopped using them but it's possible that RabbitMQ doesn't need to set +A anymore. It's using dirty schedulers now. I'm not sure whether you need as many dirty schedulers as you did async threads. Probably don't.

If we decide to go down the aggregation path by default - which I am recommending - regardless of the number of CPUs/schedulers/async threads, we would end up with either 42 metrics - 6*7 - or at most 84 metrics - 6*14 - if msacc extra is enabled. OK, we will not be able to see scheduler imbalances, but the first goal is to see where different schedulers spend most time. We should introduce a config option to enable per thread metrics, but not enable it by default, because of the metrics explosion described above.

Are we talking about the RabbitMQ default or the exporter default? Will this aggregate better fit people than per thread? Also what is the real cost of aggregating/not aggregating? Or aggregating in Grafana directly?

I doubt one or the other is the better answer, which is why msacc:print/0 prints both per thread and aggregated. But you can aggregate from per thread so that seemed like the better course of action.

Are you perhaps thinking in terms of leaving microstate accounting always on? I'm not sure that's what it should be, I think it's something you want to have there to enable when you need to debug a performance issue, then disable when you are done investigating. That's why I'm not terribly worried about producing potentially large number of metrics, they're only going to be produced when necessary.

As a separate point, we are missing thread real-time & system + scheduler run-time metrics. For the above example, I would expect to see:
erlang_vm_average_thread_real_time_seconds 1
erlang_vm_accumulated_system_run_time_seconds_total 0.007321
erlang_vm_average_scheduler_run_time_seconds 0.002811

No, that's an aggregate as well. Aggregated time excluding sleep time, aggregated time including sleep time, and aggregated time of the scheduler threads excluding sleep time which is then divided by the number of schedulers.

If this collector is enabled and msacc is not running, I think that it would make sense to start/stop it for brief periods of time - maybe 1s - and return stats from this period. If msacc was already running, which I believe is what you are thinking @essen, then do not stop it. msacc:start/0 return makes it easy to determine this. If you think that this is a bad idea, tell me why

Well microstate accounting works by recording data over a period of time, if we have to enable it when we produce metrics then we'll need to wait for that period of time before we send those metrics. If we enable it automatically when metrics are pulled then we won't be able to disable it until Prometheus stops pulling metrics. If we enable it automatically at startup then it will run until it is manually stopped.

The latter might be useful as a convenience but it shouldn't be the default IMO.

The way I see things, is that I would enable this exporter by default but that it would only produce metrics when I enable microstate accounting. The rest of the time it would basically do nothing.

lukebakken commented 4 years ago

I agree with this comment and have opened essen/prometheus.erl#1 to simply add the _microseconds suffix as suggested by the prometheus guidelines. Using seconds would lose valuable information.

gerhard commented 4 years ago

But sometimes you really are looking into issues that affect one or a limited number of threads.

OK, agreed to emitting per thread id metrics.

Actually 1 async thread is very realistic with modern Erlang/OTP, because it's not used anymore. I don't know the exact version the VM stopped using them but it's possible that RabbitMQ doesn't need to set +A anymore. It's using dirty schedulers now. I'm not sure whether you need as many dirty schedulers as you did async threads. Probably don't.

This is most interesting, I would very much like to see the result of various RabbitMQ benchmarks with no +A setting. Took a note, thanks!

The way I see things, is that I would enable this exporter by default but that it would only produce metrics when I enable microstate accounting. The rest of the time it would basically do nothing.

How would you recommend for microstate accounting to be enabled on a node? Based on this comment, I would think via remote shell, but maybe there is a better/simpler way that I don't know. I am assuming that erlang:system_flag(microstate_accounting, false) is what disables microstate accounting.

simply add the _microseconds suffix as suggested by the prometheus guidelines. Using seconds would lose valuable information.

Prometheus guidelines recommend using seconds for the base unit, and using floats to capture the precision. That's what we are doing in rabbitmq_core_metrics_colelctor, and that's what all other good exporters do, like node_exporter for example. Also, since these are accumulating metrics, we need to use _seconds_total.

Other than this last issue, once we have some details about this collector in the README - how to enable it, how to enable/disable microstate accounting - why is it a bad idea to leave it on all the time - maybe something else? - I am happy to get this merged and start working on the Grafana dashboard, and get @lukebakken involved with as well, so that we can share the how-to.

gerhard commented 4 years ago

@essen out of curiosity, why did you choose msacc_state as the primary dimension (the metric name) & msacc_type as the secondary one?

msacc:print/0 uses msacc_type & id as the primary dimension - the rows - and msacc_state as the secondary one - the columns - which makes understanding thread imbalances straightforward.

essen commented 4 years ago

The way I see things, is that I would enable this exporter by default but that it would only produce metrics when I enable microstate accounting. The rest of the time it would basically do nothing.

How would you recommend for microstate accounting to be enabled on a node? Based on this comment, I would think via remote shell, but maybe there is a better/simpler way that I don't know. I am assuming that erlang:system_flag(microstate_accounting, false) is what disables microstate accounting.

RabbitMQ could just have a rabbitmqctl command to start/stop it (or do it through its eval command). Erlang projects that use release also have an eval command. So it's fairly painless, just need to document it somewhere so that it can be copy pasted.

simply add the _microseconds suffix as suggested by the prometheus guidelines. Using seconds would lose valuable information.

Prometheus guidelines recommend using seconds for the base unit, and using floats to capture the precision. That's what we are doing in rabbitmq_core_metrics_colelctor, and that's what all other good exporters do, like node_exporter for example. Also, since these are accumulating metrics, we need to use _seconds_total.

Using floats to capture the precision. Now that's a funny thought. :-) Doesn't really matter I guess, on the Erlang side we would still need to work in microseconds first and only convert for the output. @lukebakken if you have time feel free to do those changes. Thanks!

Other than this last issue, once we have some details about this collector in the README - how to enable it, how to enable/disable microstate accounting - why is it a bad idea to leave it on all the time - maybe something else? - I am happy to get this merged and start working on the Grafana dashboard, and get @lukebakken involved with as well, so that we can share the how-to.

About leaving it on all the time, by the way, I think experimentation would be needed to know the real cost. There's a cost, but I don't know how big. Anyway that cost is mentioned in the OTP documentation so perhaps there's no need to repeat it here.

@essen out of curiosity, why did you choose msacc_state as the primary dimension (the metric name) & msacc_type as the secondary one?

msacc:print/0 uses msacc_type & id as the primary dimension - the rows - and msacc_state as the secondary one - the columns - which makes understanding thread imbalances straightforward.

Because the state is what we're concerned about, regardless of how we render things out afterwards. For example the documentation works out better when done this way compared to documenting what the types are about. We want to know what the different states are, not so much the different types of threads (most of them are self-describing if you know a little Erlang). And from them you can aggregate across all threads, or across all threads of a certain type, or get a thread-specific view. You can't aggregate across all threads easily if you use the thread type as primary.

lukebakken commented 4 years ago

@essen out of curiosity, why did you choose msacc_state as the primary dimension (the metric name) & msacc_type as the secondary one? msacc:print/0 uses msacc_type & id as the primary dimension - the rows - and msacc_state as the secondary one - the columns - which makes understanding thread imbalances straightforward.

FWIW I think the current dimensions make sense (to me at least) when comparing with msacc:print output. I'll make the change to use seconds and will add the _total suffix. I'm assuming we should do conversion in the same manner as this code.

lukebakken commented 4 years ago

Ready for re-review!

deadtrickster commented 4 years ago

Hey, just want to let you know I will look at this in the next days. Not ignoring!

deadtrickster / prometheus.erl

Add microstate accounting collector #98

98

101

98 requires that the user enable microstate accounting on their own, #101 will do it for them via `msacc:start()

98 allows enabling only a certain set of metrics. It involves a call to `application:get_env` on every stats collection run (ets lookup).

101 uses `msacc:stats/0` to get stats which I believe removes the need to call `erlang:convert_time_unit/3` and could be used in #98 (?)

deadtrickster / prometheus.erl

Add microstate accounting collector #98

98

101

98 requires that the user enable microstate accounting on their own, #101 will do it for them via `msacc:start()

98 allows enabling only a certain set of metrics. It involves a call to application:get_env on every stats collection run (ets lookup).

101 uses msacc:stats/0 to get stats which I believe removes the need to call erlang:convert_time_unit/3 and could be used in #98 (?)

98 allows enabling only a certain set of metrics. It involves a call to `application:get_env` on every stats collection run (ets lookup).

101 uses `msacc:stats/0` to get stats which I believe removes the need to call `erlang:convert_time_unit/3` and could be used in #98 (?)