Feuerlabs / exometer_core

Core components of exometer
Mozilla Public License 2.0
193 stars 121 forks source link

exometer_proc timeout #99

Open iguberman opened 7 years ago

iguberman commented 7 years ago

This started happening when we started testing the system under high load. Never seen this problem until then, and it happens a few minutes into our high-load test. The problem is, exometer never recovers after that and we don't get any stats anymore. I tried significantly increasing this timeout in our fork of exometer_core, but that didn't help.


2017-10-20 17:34:35.715 [error] <0.702.0> CRASH REPORT Process exometer_report with 0 neighbours crashed with reason: timeout in exometer_proc:call/2 line 131
2017-10-20 17:34:35.780 [error] <0.699.0> Supervisor exometer_core_sup had child exometer_report started with exometer_report:start_link() at <0.702.0> exit with reason timeout in exometer_proc:call/2 line 131 in context child_terminated```
uwiger commented 7 years ago

Hi Irina!

Odd that it's the exometer_report process that crashes in an exometer_proc:call/2 call – it should never make such a call in the first place. The second problem is of course that it doesn't restore operation after restarting.

Can you reproduce the problem?

uwiger commented 7 years ago

Hmm, I take that back (well, not the "should never" ...). In do_report/2 [handle_info(report_batch, ...)], which is called from the handle_info() clauses for report and report_batch, exometer_report calls exometer:get_value/2, which may end up doing an exometer_proc() call to a probe, which would use exometer_proc:call().

Chances are then, that one of your probes becomes unresponsive. Of course, the exometer_report process should not crash because of that – it should not even be blocked. I'll have to look into that.

iguberman commented 7 years ago

Thanks so much Ulf!

That gives me something to look into on my end as well. Unfortunately I could reproduce it every time I run high load test in our server and pretty soon, right around the time when I need to start examining the stats, exometer_proc crashes and the stats disappear.

What I did however as a temp workaround I removed timeout completely from exometer_proc to receive without any timeout and now it doesn't crash at all (possibly I'm thus hiding the real cause of original issue under the rug).


    MRef = erlang:monitor(process, Pid),
    Pid ! {exometer_proc, {self(), MRef}, Req},
    receive
        {MRef, Reply} ->
            erlang:demonitor(MRef, [flush]),
            Reply;
        {'DOWN', MRef, _, _, Reason} ->
            error(Reason)
%%    after 5000 ->
%%            error(timeout)
    end.
iguberman commented 7 years ago

Hi Ulf,

Apparently I had this message below sitting in this browser window unsent for 9 days and just noticed it (they were crazy 9 days, but still....) Very sorry for not replying sooner!!! :(


Thank so much Ulf!

That gives me something to look into on my end as well. Unfortunately I could reproduce it every time I run high load test in our server and pretty soon, right around the time when I need to start examining the stats, exometer_proc crashes and the stats disappear.

What I did however as a workaround I removed timeout completely from exometer_proc to receive without any timeout and now exometer_proc doesn't crash at all (possibly I'm thus hiding the real cause of original issue under the rug).


    MRef = erlang:monitor(process, Pid),
    Pid ! {exometer_proc, {self(), MRef}, Req},
    receive
        {MRef, Reply} ->
            erlang:demonitor(MRef, [flush]),
            Reply;
        {'DOWN', MRef, _, _, Reason} ->
            error(Reason)
%%    after 5000 ->
%%            error(timeout)
    end.

On Sat, Oct 21, 2017 at 10:58 AM, Ulf Wiger notifications@github.com wrote:

Hmm, I take that back (well, not the "should never" ...). In do_report/2 https://github.com/Feuerlabs/exometer_core/blob/master/src/exometer_report.erl#L1213 [handle_info(report_batch, ...)], which is called from the handle_info() clauses for report and report_batch, exometer_report calls exometer:get_value/2, which may end up doing an exometer_proc() call to a probe, which would use exometer_proc:call().

Chances are then, that one of your probes becomes unresponsive. Of course, the exometer_report process should not crash because of that – it should not even be blocked. I'll have to look into that.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Feuerlabs/exometer_core/issues/99#issuecomment-338412379, or mute the thread https://github.com/notifications/unsubscribe-auth/ALcepoY-dUFKPixQqWMQMN06U6ZQb-LTks5suhTBgaJpZM4QBEpj .

ruttenberg commented 6 years ago

I have observed this problem intermittently. This is an example:

2018 Feb 27 18:02:08 52.23.248.173 erlang: error | gen_server exometer_report terminated with reason: timeout in exometer_proc:call/2 line 131 2018 Feb 27 18:02:13 52.23.248.173 erlang: error | CRASH REPORT Process exometer_report with 0 neighbours exited with reason: timeout in exometer_proc:call/2 line 131 in gen_server:terminate/7 line 812 2018 Feb 27 18:02:14 52.23.248.173 erlang: error | Supervisor exometer_core_sup had child exometer_report started with exometer_report:start_link() at <0.1178.23> exit with reason timeout in exometer_proc:call/2 line 131 in context child_terminated

exometer_report is restarted, and the cycle continues. This repeats for a while until the VM crashes.

At the point this begins, I notice the following in general VM statistics: dramatic drop in gc words reclaimed, IO, and reductions. dramatic increase in scheduler utilization, message queue count.

Please let me know what other information might be helpful in solving this problem.

uwiger commented 6 years ago

@ruttenberg, that drop you observe, is it from stats reported by exometer or the actual numbers presented by the VM?

ruttenberg commented 6 years ago

That is from stats reported by exometer.

uwiger commented 6 years ago

Ok. I have the flu right now. I don't think I'll be able to do much about this today, but will gladly accept a PR that at least catches calls to exometer:get_value/2. One could also argue that get_value() shouldn't crash in the first place. This would also be an acceptable solution.

As to the problem of reporting drift if a particular metric keeps timing out, that's perhaps best noticed in the analytics backend and simply fixing the probe that keeps hanging.

ruttenberg commented 6 years ago

@uwiger I hope you feel better soon. I think you overestimate my understanding of the exometer implementation. I don't really understand what you are suggesting for the PR.

Thanks.

--Jon Ruttenberg

uwiger commented 6 years ago

@ruttenberg, for example:

https://github.com/Feuerlabs/exometer_core/blob/master/src/exometer_report.erl#L1438

The exometer_report process expects exometer:get_value/2 not to fail. Reading the docs, this seems like an appropriate assumption, but exometer_report could still wrap these in a try ... catch here.

More appropriate places to put try ... catch might be (fixing the get_value/2 function): https://github.com/Feuerlabs/exometer_core/blob/master/src/exometer.erl#L362

and https://github.com/Feuerlabs/exometer_core/blob/master/src/exometer_probe.erl#L592 (as well as #L595)

ruttenberg commented 6 years ago

@uwiger Thanks. I will take a look at those.

Some additional information: I have not implemented any probes. The only probe is the built-in histogram.