Mnesia Collector Throws Error with RabbitMQ

tcurley1 commented 3 years ago

When upgrading RabbitMQ from 3.8.x -> 3.9.x, the prometheus client version was bumped from 4.6.0 to 4.8.1 which introduced a bug collecting and reporting metrics in the prometheus_mnesia collector.

Using rabbitmq diagnostics checks I was able to verify that mnesia does have a memory usage value associated with it, but the prometheus client is throwing the following error:


2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0>     initial call: cowboy_stream_h:request_process/3
2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0>     pid: <0.3484.0>
2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0>     registered_name: []
2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0>     exception error: an error occurred when evaluating an arithmetic expression
2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0>       in operator  +/2
2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0>          called as undefined + 0
2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0>       in call from prometheus_mnesia_collector:'-get_memory_usage/0-fun-0-'/2 (src/collectors/mnesia/prometheus_mnesia_collector.erl, line 199)
2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0>       in call from lists:foldl/3 (lists.erl, line 1267)
2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0>       in call from prometheus_mnesia_collector:get_memory_usage/0 (src/collectors/mnesia/prometheus_mnesia_collector.erl, line 201)
2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0>       in call from prometheus_mnesia_collector:metrics/1 (src/collectors/mnesia/prometheus_mnesia_collector.erl, line 124)
2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0>       in call from prometheus_mnesia_collector:collect_mf/2 (src/collectors/mnesia/prometheus_mnesia_collector.erl, line 108)
2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0>       in call from prometheus_collector:collect_mf/3 (src/prometheus_collector.erl, line 156)
2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0>       in call from prometheus_registry:'-collect/2-lc$^0/1-0-'/3 (src/prometheus_registry.erl, line 86)
2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0>     ancestors: [<0.735.0>,<0.686.0>,<0.684.0>,<0.683.0>,<0.681.0>,
2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0>                   rabbit_web_dispatch_sup,<0.599.0>]
2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0>     message_queue_len: 0
2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0>     messages: []
2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0>     links: [<0.735.0>,#Port<0.723>]
2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0>     dictionary: []
2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0>     trap_exit: false
2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0>     status: running
2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0>     heap_size: 6772
2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0>     stack_size: 28
2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0>     reductions: 4853
2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0>   neighbours:
2021-08-18 15:36:21.731849+00:00 [erro] <0.3484.0> 
2021-08-18 15:36:21.732461+00:00 [erro] <0.735.0> Ranch listener {acceptor,{0,0,0,0,0,0,0,0},15692}, connection process <0.735.0>, stream 107 had its request process <0.3484.0> exit with reason badarith and stacktrace [{erlang,'+',[undefined,0],[{error_info,#{module => erl_erts_errors}}]},{prometheus_mnesia_collector,'-get_memory_usage/0-fun-0-',2,[{file,"src/collectors/mnesia/prometheus_mnesia_collector.erl"},{line,199}]},{lists,foldl,3,[{file,"lists.erl"},{line,1267}]},{prometheus_mnesia_collector,get_memory_usage,0,[{file,"src/collectors/mnesia/prometheus_mnesia_collector.erl"},{line,201}]},{prometheus_mnesia_collector,metrics,1,[{file,"src/collectors/mnesia/prometheus_mnesia_collector.erl"},{line,124}]},{prometheus_mnesia_collector,collect_mf,2,[{file,"src/collectors/mnesia/prometheus_mnesia_collector.erl"},{line,108}]},{prometheus_collector,collect_mf,3,[{file,"src/prometheus_collector.erl"},{line,156}]},{prometheus_registry,'-collect/2-lc$^0/1-0-',3,[{file,"src/prometheus_registry.erl"},{line,86}]}]```

strange commented 3 years ago

What could cause mnesia:table_info/2 to return undefined? At-least locally on OTP 24 it returns 0 even if mnesia is down or the table does not exist. We're seeing the same problem in a clustered environment with some load. Timeouts?

KangyiZengVizio commented 2 years ago

Same problem encountered, any updates?

gustav-b commented 2 years ago

I'm seeing the same issue when upgrading RabbitMQ from v3.8.3 to v3.8.27 (v3.8.23 apparently bumped the client from 4.6.0 to 4.8.1). This is on a healthy 3 node cluster.

Are there any known workarounds except for downgrading?

KangyiZengVizio commented 2 years ago

I'm seeing the same issue when upgrading RabbitMQ from v3.8.3 to v3.8.27 (v3.8.23 apparently bumped the client from 4.6.0 to 4.8.1). This is on a healthy 3 node cluster.

Are there any known workarounds except for downgrading?

have you tried restarting the rabbitmq services? looks like it`s a fix, I tested on one box

gustav-b commented 2 years ago

Thanks, I tried to restart one of three nodes, but that didn't solve it – I see the same error after a clean start. I've also tried disabling and then enabling the rabbitmq_prometheus plugin but that had no effect either.

I have another RabbitMQ cluster with just a single node also running v3.8.27 and there the Prometheus exporter works fine with no errors in the logs.

KangyiZengVizio commented 2 years ago

Thanks, I tried to restart one of three nodes, but that didn't solve it – I see the same error after a clean start. I've also tried disabling and then enabling the rabbitmq_prometheus plugin but that had no effect either.

I have another RabbitMQ cluster with just a single node also running v3.8.27 and there the Prometheus exporter works fine with no errors in the logs.

My RabbitMQ is v3.9.5, 3 out of 4 servers have this problem, and I tried restart one of the service and the problem gone. I do tried other method than using Prometheus. Like using rabbitmq_cloudwatch_exporter, it requires service restart and can do the job. Used rabbitmq_exporter, too, and it crushed at certain time.

KangyiZengVizio commented 2 years ago

Thanks, I tried to restart one of three nodes, but that didn't solve it – I see the same error after a clean start. I've also tried disabling and then enabling the rabbitmq_prometheus plugin but that had no effect either.

I have another RabbitMQ cluster with just a single node also running v3.8.27 and there the Prometheus exporter works fine with no errors in the logs.

Since the problem is from get_memory_usage() this function, I tried to disable it but since lack of knowledge of erlang, I haven`t figured out how.

adamski007 commented 2 years ago

Hello, does anyone have a fix for this ? We did restart the broker one by one on our cluster, and it is still loggin a huge amount of log because of that.... it is really polluting the logs.

Is there a know procedure to execute to fix it ? Thx

essen commented 2 years ago

Can someone please open a ticket against RabbitMQ instead? https://github.com/rabbitmq/rabbitmq-server

adamski007 commented 2 years ago

actually there is a thread discussing the issue here : https://groups.google.com/g/rabbitmq-users/c/JXF8ryEqvyk/m/4L4xiklXAwAJ

and they suggest to point back to this github as it is related to prometheus collector....

we are circling around with no solution so far.

essen commented 2 years ago

Fine, let me see what I can do tomorrow. cc @lhoguin

lhoguin commented 2 years ago

I have opened https://github.com/erlang/otp/issues/5830 to figure out if that's normal. If it is the fix is very simple.

adamski007 commented 2 years ago

If you need more info to know how to reproduce, lets us know. Currently, my only way to fix it, is to completly stop my RabbitMQ cluster, and only then start.

lhoguin commented 2 years ago

I have opened https://github.com/deadtrickster/prometheus.erl/pull/140 with a fix.

deadtrickster commented 2 years ago

released as 4.8.2

adamski007 commented 2 years ago

thanks for all the action there... My next question is : how can I test that new code/release with RabbitMQ code ? any pointer ? thx

lhoguin commented 2 years ago

Hello! I have opened https://github.com/rabbitmq/rabbitmq-server/pull/4376 that upgrades Prometheus.erl to 4.8.2.

How to test depends on how you run RabbitMQ. You can always build RabbitMQ from this branch directly in any case.

michaelklishin commented 2 years ago

Building the plugin on Erlang 23 or 24, replacing it in the plugins directory and restarting the node on Erlang 24 would do. No need to build the entire distribution ;)

lhoguin commented 2 years ago

Gotcha! Thanks.

If you need a Docker image you can find them at https://hub.docker.com/r/pivotalrabbitmq/rabbitmq/tags?page=1&name=loic-update-prometheus but it will run the most recent RabbitMQ code.

michaelklishin commented 2 years ago

Here is a build of this plugin with a fix: prometheus-4.8.2.ez.zip. It's a .zip only because GitHub does not allow for .ez files. Uncompress it and replace the prometheus .ez you will find in the plugins directory of your installation.

Then run any tests you need to run.

If we do not get any feedback on this by next week, we will proceed with merging rabbitmq/rabbitmq-server#4376 and consider this done.

michaelklishin commented 2 years ago

This was addressed in https://github.com/rabbitmq/rabbitmq-server/pull/4376, scheduled to ship in RabbitMQ 3.10.0 and 3.9.15.

michaelklishin commented 2 years ago

@tcurley1 @deadtrickster should we close this issue?

essen commented 2 years ago

Thanks everyone!

deadtrickster / prometheus.erl

Mnesia Collector Throws Error with RabbitMQ #133