Closed andrewvmail closed 6 years ago
@andrewvmail Hi! This is only a suspicion, but have you checked the available disk space? I see that you have debug logs turned on - are you logging to files in addition to logging to the console? (if you don't know please paste the contents of etc/app.config
file).
@arkgil, hmm df
gives. so looks alright
overlay 40470828 9242960 31211484 23% /
tmpfs 65536 0 65536 0% /dev
tmpfs 1024080 0 1024080 0% /sys/fs/cgroup
/dev/vda1 40470828 9242960 31211484 23% /.r
shm 65536 0 65536 0% /dev/shm
tmpfs 1024080 0 1024080 0% /proc/scsi
tmpfs 1024080 0 1024080 0% /sys/firmware
app.config
[
{setup, [{verify_directories, false}]},
{mnesia, [{dir, "/var/lib/mongooseim/Mnesia.mongooseim@chat-chat-1"}]},
{ssl, [{session_lifetime, 600}]}, %% 10 minutes
{lager, [
{colored, true},
%% Limit the number of messages per second allowed from error_logger
{error_logger_hwm, 100},
%% Make logging more async
%% If some very heavy loaded process want to log something, it's better to
%% not block the process.
{async_threshold, 2000},
{async_threshold_window, 500},
%% Kill sink if it has more than 10k messages
{killer_hwm, 10000},
{killer_reinstall_after, 5000},
{log_root, "/var/log/mongooseim"},
{crash_log, "crash.log"},
{handlers, [
{lager_console_backend, [{level, info}]},
%% use below line to add syslog backend for Lager
% {lager_syslog_backend, [ "mongooseim", local0, info]},
{lager_file_backend, [{file, "ejabberd.log"}, {level, info}, {size, 2097152}, {date, "$D0"}, {count, 5}]}
]}
]}
%{exometer_core, [
% {mongooseim_report_interval, 60000}, %% 60 seconds
% {report, [
% {reporters, [
% {exometer_report_graphite, [
% {prefix, "mongooseim"},
% {connect_timeout, 5000},
% {host, "127.0.0.1"},
% {port, 2003},
% {api_key, ""}
% ]}
% ]}
% ]}
% ]}
%% use below line to enable FIPS mode
%%{crypto, [{fips_mode, true}]},
].
%% vim: filetype=erlang
I think I'm logging to file. Interestingly this week the problem with the freezing hasn't happened once. Come to think of it I think its because of Riak being offline the server crashed silently.
Will re-open if i see it again.
Hmm I guess its not beacuse of Riak... it looks like it's happening again, I think it did crash but the crash log is empty.
The log folder contains following files.
drwxr-xr-x 2 root root 4096 Jul 15 05:27 .
drwxr-xr-x 1 root root 4096 Jul 8 20:47 ..
-rw-r--r-- 1 root root 0 Jul 15 04:49 crash.log
-rw-r--r-- 1 root root 0 Jul 9 01:59 crash.log.4
-rw-r--r-- 1 root root 909642 Jul 15 05:33 ejabberd.log
-rw-r--r-- 1 root root 2097361 Jul 15 03:32 ejabberd.log.0
-rw-r--r-- 1 root root 1545196 Jul 14 23:59 ejabberd.log.1
-rw-r--r-- 1 root root 2145725 Jul 14 23:18 ejabberd.log.2
-rw-r--r-- 1 root root 2111147 Jul 14 22:07 ejabberd.log.3
-rw-r--r-- 1 root root 2106857 Jul 14 21:16 ejabberd.log.4
@arkgil im curious is there any way to trigger a crash from mongooseimctl to test if crash.log is actually working? I'm trying to figure out why crash.log empty
@andrewvmail
The easiest method I can think of is to attach to node's shell and execute e.g. exit(whereis(ejabberd_c2s_sup), test).
This should trigger following log in crash.log
:
2018-07-19 16:01:13 =SUPERVISOR REPORT====
Supervisor: {local,ejabberd_c2s_sup}
Context: child_terminated
Reason: test
Offender: [{pid,<0.1159.0>},{id,undefined},{mfargs,{ejabberd_c2s,start_link,undefined}},{restart_type,temporary},{shutdown,brutal_kill},{child_type,worker}]
@fenek Thanks I will try this to deduce the issue
Looks like crashing function works fine. I think host may run out of memory. I ran this just now
%% Enable "monitoring".
F = fun(F, I) ->
Timestamp = calendar:now_to_local_time(erlang:now()),
NSessions = ejabberd_sm:get_vh_session_number(<<"kosmosmobile.com">>),
FreeRam = "free -m | sed '2q;d' | awk '{ print $4 }'",
io:format("~p ~p no of users: ~p, free ram: ~s",
[I, Timestamp, NSessions, os:cmd(FreeRam)]),
timer:sleep(timer:seconds(2)),
F(F, I+1)
end.
G = fun() -> F(F, 1) end.
f(P).
P = spawn(G).
no of users: 0, free ram: 70
.
Then i ran it again after i restart the stack now its
no of users: 0, free ram: 615
I'm trying to load test this setup with Tsung to have its memory used up and I cant get it working with wss. A couple of years back i managed to load test using tsung tcp. Maybe its not supported yet based on this thread https://github.com/processone/tsung/issues/132 it just get stuck at the after the initial ssl handshake
Do you guys have any suggestions on how to get mim to eat up a bunch of memory? And if the system do run out of memory what should be the expected thing mim would do?
Okay just an update here from my end. Looks like it is a leak not sure whats causing it. I was monitoring it for a couple of days. and memory just decreases over time. Havent gotten a crash yet but I think it's coming soon. I think i might have build from a non stable commit https://github.com/esl/MongooseIM/commit/e108c81e1
I was able to use a patched up Tsung they just patched up recently to load-test the server.
Can you please execute erlang:memory().
in MongooseIM shell at three points in time? Right after container start and twice when memory usage becomes higher and higher?
Sunday, August 19 1:26AM
[{total,340398192},
{processes,308276416},
{processes_used,308275968},
{system,32121776},
{atom,785337},
{atom_used,760506},
{binary,1595152},
{code,15809584},
{ets,4000448}]
Monday, August 20 12:16PM
[{total,344188096},
{processes,308940168},
{processes_used,308936096},
{system,35247928},
{atom,883657},
{atom_used,877075},
{binary,1465248},
{code,18366460},
{ets,4076352}]
Tuesday August 21 6:10PM
[{total,347584264},
{processes,312081816},
{processes_used,312080296},
{system,35502448},
{atom,883657},
{atom_used,879705},
{binary,1477472},
{code,18496075},
{ets,4102368}]
@fenek see above for memory usage
OK, let's try two more commands, as clearly half of the memory consumption increase is caused by processes:
recon:proc_count(memory, 10).
recon:proc_count(message_queue_len, 10).
@fenek
Eshell V9.3 (abort with ^G)
(mongooseim@chat-chat-1)1> recon:proc_count(memory, 10).
[{<0.1618.0>,1344720,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.1639.0>,973288,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.646.0>,973288,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.38.0>,426512,
[code_server,
{current_function,{code_server,loop,1}},
{initial_call,{erlang,apply,2}}]},
{<0.1620.0>,372528,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.1619.0>,372528,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.1758.0>,372384,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.1757.0>,372384,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.1725.0>,372384,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.1724.0>,372384,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]}]
(mongooseim@chat-chat-1)2> recon:proc_count(message_queue_len, 10).
[{<0.12748.2>,0,
[{current_function,{shell,shell_rep,4}},
{initial_call,{erlang,apply,2}}]},
{<0.12745.2>,0,
[{current_function,{dist_util,con_loop,2}},
{initial_call,{inet_tcp_dist,do_accept,7}}]},
{<0.28284.1>,0,
[{current_function,{io,execute_request,2}},
{initial_call,{erlang,apply,2}}]},
{<0.28275.1>,0,
[{current_function,{shell,eval_loop,3}},
{initial_call,{erlang,apply,2}}]},
{<0.28274.1>,0,
[{current_function,{shell,get_command1,5}},
{initial_call,{erlang,apply,2}}]},
{<0.28271.1>,0,
[{current_function,{dist_util,con_loop,2}},
{initial_call,{inet_tcp_dist,do_accept,7}}]},
{<0.471.1>,0,
[{current_function,{io,execute_request,2}},
{initial_call,{erlang,apply,2}}]},
{<0.461.1>,0,
[{current_function,{shell,eval_loop,3}},
{initial_call,{erlang,apply,2}}]},
{<0.460.1>,0,
[{current_function,{shell,get_command1,5}},
{initial_call,{erlang,apply,2}}]},
{<0.457.1>,0,
[{current_function,{dist_util,con_loop,2}},
{initial_call,{inet_tcp_dist,do_accept,7}}]}]
Are these for a system after a fresh start, during the load test or after finished test?
@fenek its just after a few days of very low usage of me logging in and out. Have not put any load to it.
OK, please try the same commands (memory
and recon
) during/after some load test. With very little traffic it may be expected that some processes are not garbage collected.
Just a note ill be on this in a week or so. Just have a backlog of stuff I need to take care. Thanks for the assist!
Starting no load (1st attempt crashed)
(mongooseim@chat-chat-1)5> erlang:memory().
[{total,340255280},
{processes,307801440},
{processes_used,307799736},
{system,32453840},
{atom,785337},
{atom_used,761692},
{binary,1618104},
{code,15894237},
{ets,4218272}]
(mongooseim@chat-chat-1)6> recon:proc_count(memory, 10).
[{<0.38.0>,426512,
[code_server,
{current_function,{code_server,loop,1}},
{initial_call,{erlang,apply,2}}]},
{<0.1738.0>,372384,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.1737.0>,372384,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.1705.0>,372384,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.1704.0>,372384,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.1703.0>,372384,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.1702.0>,372384,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.1701.0>,372384,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.1700.0>,372384,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.1699.0>,372384,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]}]
3 {{2018,9,19},{0,2,34}} no of users: 0, free ram: 912
(mongooseim@chat-chat-1)7> recon:proc_count(message_queue_len, 10).
[{<0.2163.0>,0,
[{current_function,{timer,sleep,1}},
{initial_call,{erlang,apply,2}}]},
{<0.2154.0>,0,
[{current_function,{shell,shell_rep,4}},
{initial_call,{erlang,apply,2}}]},
{<0.2151.0>,0,
[{current_function,{dist_util,con_loop,2}},
{initial_call,{inet_tcp_dist,do_accept,7}}]},
{<0.2140.0>,0,
[nksip_router_15,
{current_function,{gen_server,loop,7}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.2139.0>,0,
[nksip_router_14,
{current_function,{gen_server,loop,7}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.2138.0>,0,
[nksip_router_13,
{current_function,{gen_server,loop,7}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.2137.0>,0,
[nksip_router_12,
{current_function,{gen_server,loop,7}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.2136.0>,0,
[nksip_router_11,
{current_function,{gen_server,loop,7}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.2135.0>,0,
[nksip_router_10,
{current_function,{gen_server,loop,7}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.2134.0>,0,
[nksip_router_9,
{current_function,{gen_server,loop,7}},
{initial_call,{proc_lib,init_p,5}}]}]
4 {{2018,9,19},{0,2,36}} no of users: 0, free ram: 911
5 {{2018,9,19},{0,2,38}} no
while load testing
8199 {{2018,9,19},{0,9,10}} no of users: 1001, free ram: 401
(mongooseim@chat-chat-1)10> erlang:memory().
[{total,387548256},
{processes,348664408},
{processes_used,348149280},
{system,38883848},
{atom,793529},
{atom_used,781177},
{binary,3405416},
{code,16605815},
{ets,5920800}]
(mongooseim@chat-chat-1)11> recon:proc_count(memory, 10).
200 {{2018,9,19},{0,9,12}} no of users: 1001, free ram: 402
[{<0.2163.0>,4119720,
[{current_function,{io,execute_request,2}},
{initial_call,{erlang,apply,2}}]},
{<0.1583.0>,2917880,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.1580.0>,1945584,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.660.0>,1945584,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.898.0>,1875888,
['mod_offline_kosmosmobile.com',
{current_function,{gen_server,loop,7}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.863.0>,973336,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.719.0>,973336,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.716.0>,973336,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.698.0>,973296,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.1619.0>,973288,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]}]
(mongooseim@chat-chat-1)12> recon:proc_count(message_queue_len, 10).
[{<0.356.0>,621,
[lager_event,
{current_function,{io,execute_request,2}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.898.0>,12,
['mod_offline_kosmosmobile.com',
{current_function,{mnesia_locker,receive_wlocks,4}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.4943.0>,2,
[{current_function,{erlang,bif_return_trap,2}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.4954.0>,1,
[{current_function,{gen,do_call,4}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.4952.0>,1,
[{current_function,{gen,do_call,4}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.4950.0>,1,
[{current_function,{gen,do_call,4}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.4948.0>,1,
[{current_function,{gen,do_call,4}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.4946.0>,1,
[{current_function,{gen,do_call,4}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.4351.0>,1,
[{current_function,{erlang,hibernate,3}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.4243.0>,1,
[{current_function,{gen,do_call,4}},
{initial_call,{proc_lib,init_p,5}}]}]
201 {{2018,9,19},{0,9,14}} no of users: 1001, free ram: 363
one minute after load test
367 {{2018,9,19},{0,14,48}} no of users: 83, free ram: 74
(mongooseim@chat-chat-1)16> erlang:memory().
[{total,641918536},
{processes,605680848},
{processes_used,605679472},
{system,36237688},
{atom,793529},
{atom_used,781177},
{binary,2026312},
{code,16605815},
{ets,4789792}]
(mongooseim@chat-chat-1)17> recon:proc_count(memory, 10).
[{<0.356.0>,262071800,
[lager_event,
{current_function,{io,execute_request,2}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.1580.0>,4720712,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.1583.0>,2917880,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.698.0>,1945584,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.898.0>,1811976,
['mod_offline_kosmosmobile.com',
{current_function,{gen_server,loop,7}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.660.0>,1574208,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.1587.0>,973624,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.720.0>,973432,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.888.0>,973384,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]},
{<0.1591.0>,973336,
[{current_function,{exometer_probe,loop,1}},
{initial_call,{proc_lib,init_p,3}}]}]
(mongooseim@chat-chat-1)18> recon:proc_count(message_queue_len, 10).
[{<0.356.0>,55137,
[lager_event,
{current_function,{io,execute_request,2}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.718.0>,169,
['ejabberd_mod_pubsub_loop_kosmosmobile.com',
{current_function,{gen,do_call,4}},
{initial_call,{mod_pubsub,send_loop,1}}]},
{<0.6115.0>,11,
[{current_function,{gen,do_call,4}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.4984.0>,9,
[{current_function,{gen,do_call,4}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.4788.0>,9,
[{current_function,{gen,do_call,4}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.6111.0>,8,
[{current_function,{gen,do_call,4}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.6109.0>,8,
[{current_function,{gen,do_call,4}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.5901.0>,8,
[{current_function,{gen,do_call,4}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.5487.0>,8,
[{current_function,{gen,do_call,4}},
{initial_call,{proc_lib,init_p,5}}]},
{<0.5036.0>,8,
[{current_function,{gen,do_call,4}},
{initial_call,{proc_lib,init_p,5}}]}]
368 {{2018,9,19},{0,14,50}} no of users: 83, free ram: 74
@fenek after that last load test i exited out of mongooseimctl debug and it stopped working and had to restart the container.
im disabling transparent huge page and running the container in host mode now gonna see if this helps
Hi @andrewvmail
Have the changes mentioned in your last comment helped?
@fenek, hasn't crashed yet. I think it did helped. Its stable like a rock now. I was gonna wait 1 more day before closing 😄. But I think its safe to close this today.
Thanks for the ping.
@fenek,
actually i pulled the trigger to soon ... it crashed last night at 5AM, it was working fine for 2 weeks up till yesterday. I just got to work and found out now.
According to the console snippets you've pasted, logging framework is a bottleneck. What log level have you configured in ejabberd.cfg
? If it's 4
, then try 3
.
Its actually at log 5, I'll set it to 3 and let you know how it is. But I think disabling THP and running it under host mode helped because it lasted 2 weeks instead of couple of days.
@andrewvmail
It's been more than 2 weeks since our conversation. How is the stability now?
Hi @fenek , it has been really stable all this time. I just checked it again just now to make sure, and we are still good. Thanks for all your help. I will close this issue.
MongooseIM version: 3.0.0-78-ge108c81 Installed from: mongooseim-docker (multistage) Erlang/OTP version: Erlang/OTP 20
Hello MongooseIM team,
Reporting something strange I am experiencing, happened twice now, after a few days of running mongoose in a container. mongooseim freezes no error message, i tried to dig around for an error message but i cant find any.
find . | grep crash.log
gives mewhich contains nothing
i tried to attach with
./bin/mongooseimctl
but got node downps aux
i still seethe mongoose process is started with this https://github.com/esl/mongooseim-docker/blob/master/member/start.sh
I'm running on a 2GB host and right now the docker stats out put:
From the looks of this I think its not because of exhausted memory
the last thing mongooseim outputs to tty is
if i restart the docker container it will start working again
Thanks!