Closed bioothod closed 8 years ago
It can happen, when someone has renamed the file and reopened it before Blackhole attempted to write into. Possibly logrotate can, check its config please.
However the current check-to-reopen logic is buggy and is subject to change.
Here is logrotate section, nothing special:
{
daily
maxsize 2G
missingok
compress
dateext
dateformat -%Y.%m.%d-%s
rotate 300
}
Logrotate doesn't write into rotated logs, but it renames them indeed. It also compresses them and remove. Blackhole doesn't see that file has been renamed and continue writing into old file descriptor associated with already renamed and compressed file.
I saw a check in file's sink code v0.5, write()
method, which checked whether rotation has to start right after writing log message - is that enough? There is no such code in v0.2
What is the time estimation for this check-to-reopen bug, it really hurts.
After renaming it creates the file, check its logs.
Check nocreate
option, it can help.
I've added nocheck
option, will monitor rotation, thank you.
It didn't help, logrotate with above config + nocheck
option missed log rotation
ls -l /proc/26855/fd | grep log
l-wx------ 1 root root 64 Jul 29 04:13 3 -> /home/admin/elliptics/log/ioserv.log-2015.07.28-1438053902 (deleted)
Even more, it starts eating memory and very quickly - in minutes - eats all RAM, server is being killed by oom-killer).
Here is appropriate trace:
Thread 1548 (Thread 0x7faf9cc16700 (LWP 26859)):
#0 0x00007fafa23d52a8 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007fafa23d7a70 in malloc () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007fafa2be407d in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007fafa4564e1b in allocate (this=<optimized out>, __n=<optimized out>) at /usr/include/c++/4.7/ext/new_allocator.h:94
#4 std::_Hashtable<std::string, std::pair<std::string const, blackhole::log::attribute_t>, std::allocator<std::pair<std::string const, blackho
le::log::attribute_t> >, std::_Select1st<std::pair<std::string const, blackhole::log::attribute_t> >, std::equal_to<std::string>, std::hash<std
::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, true, false, true>::_M_
allocate_node<std::pair<std::string const, blackhole::log::attribute_t> const&> (this=this@entry=0x7faf9cc0cfa0) at /usr/include/c++/4.7/bits/h
ashtable.h:625
#5 0x00007fafa4564f2a in std::_Hashtable<std::string, std::pair<std::string const, blackhole::log::attribute_t>, std::allocator<std::pair<std:
:string const, blackhole::log::attribute_t> >, std::_Select1st<std::pair<std::string const, blackhole::log::attribute_t> >, std::equal_to<std::
string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, t
rue, false, true>::_M_insert_bucket<std::pair<std::string const, blackhole::log::attribute_t> const&> (this=this@entry=0x7faf9cc0cfa0, __v=...,
__n=__n@entry=15, __code=__code@entry=17805607422174315343) at /usr/include/c++/4.7/bits/hashtable.h:1297
#6 0x00007fafa45545b9 in std::_Hashtable<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::basic_string<c
har, std::char_traits<char>, std::allocator<char> > const, blackhole::log::attribute_t>, std::allocator<std::pair<std::basic_string<char, std::
char_traits<char>, std::allocator<char> > const, blackhole::log::attribute_t> >, std::_Select1st<std::pair<std::basic_string<char, std::char_tr
aits<char>, std::allocator<char> > const, blackhole::log::attribute_t> >, std::equal_to<std::basic_string<char, std::char_traits<char>, std::al
locator<char> > >, std::hash<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::
__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, true, false, true>::_M_insert<std::pair<std::basic_string<char> const, bla
ckhole::log::attribute_t> const&> (this=this@entry=0x7faf9cc0cfa0, __v=...) at /usr/include/c++/4.7/bits/hashtable.h:1337
#7 0x00007fafa456507c in insert (__v=..., this=0x7faf9cc0cfa0) at /usr/include/c++/4.7/bits/hashtable.h:545
#8 std::_Hashtable<std::string, std::pair<std::string const, blackhole::log::attribute_t>, std::allocator<std::pair<std::string const, blackhole::log::attribute_t> >, std::_Select1st<std::pair<std::string const, blackhole::log::attribute_t> >, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, true, false, true>::insert<std::__detail::_Node_const_iterator<std::pair<std::string const, blackhole::log::attribute_t>, false, true> > (this=this@entry=0x7faf9cc0cfa0, __first=..., __last=..., __last@entry=...) at /usr/include/c++/4.7/bits/hashtable.h:1419
#9 0x00007fafa4565156 in blackhole::merge (args=...) at /usr/include/blackhole/attribute.hpp:141
#10 0x00007fafa4566400 in blackhole::logger_base_t::open_record (this=this@entry=0x1599ef8, local_attributes=...) at /usr/include/blackhole/implementation/logger.ipp:131
#11 0x00007fafa4568539 in blackhole::verbose_logger_t<blackhole::defaults::severity>::open_record (this=0x1599ef8, level=blackhole::defaults::info, local=...) at /usr/include/blackhole/logger.hpp:219
#12 0x00007fafa45687cf in blackhole::wrapper_t<blackhole::verbose_logger_t<blackhole::defaults::severity> >::open_record<blackhole::defaults::severity> (this=this@entry=0x159a220, level=level@entry=blackhole::defaults::info, attributes=...) at /usr/include/blackhole/logger/wrapper.hpp:119
#13 0x00007fafa45853b3 in dnet_log_open_record (logger=0x159a220, level=blackhole::defaults::info) at /tmp/buildd/elliptics-2.26.5.5/bindings/cpp/logger.cpp:195
#14 0x00007fafa41d58d9 in dnet_cmd_backend_control_dangerous (data=0x7fa699edb958, cmd=0x7fa699edb8e0, st=0x6fed9f0) at /tmp/buildd/elliptics-2.26.5.5/library/backend.cpp:616
#15 dnet_cmd_backend_control (st=st@entry=0x6fed9f0, cmd=cmd@entry=0x7fa699edb8e0, data=data@entry=0x7fa699edb958) at /tmp/buildd/elliptics-2.26.5.5/library/backend.cpp:718
#16 0x00007fafa41c71c6 in dnet_process_cmd_without_backend_raw (data=0x7fa699edb958, cmd=0x7fa699edb8e0, st=0x6fed9f0) at /tmp/buildd/elliptics-2.26.5.5/library/dnet.c:870
...
Btw, that log file was quite big - many gigabytes, it is possible, that logger tried to copy all that logs via plain memcpy()
or something similar. If it is true - why does it start this copy? Logrotate has moved logfile already...
Option is called nocreate, not nocheck.
No, BH never copies files, only mv
, but for your watcher (move) if does nothing, just monitors the file existance.
That backtrace shows that BH was interrupted in the middle of merging hashtables, it can't eat gigabytes of memory.
You can force logrotation by just calling logrotate
.
I meant nocreate
of course:
/home/admin/elliptics/log/*.log /srv/elliptics/log/*.log /var/log/elliptics/*.log
{
daily
maxsize 2G
missingok
nocreate
compress
dateext
dateformat -%Y.%m.%d-%s
rotate 300
postrotate
killall -HUP backrunner
endscript
}
Hmm... If BH has failed to write a log record it notifies in the stdour/stderr, check it please.
If you use boost::filesystem
backend, then there can be a trouble when there are no write access - boost::filesystem
has an internal buffer, which grows rapidly, when if fails to write.
There is no stderr/stdout output since elliptics closes those descriptors at startup when going daemon. But there can not be any permission violation since server runs with superuser privileges on a bare metal server without containerisation.
I suppose that if there would be write access trouble, it would fire up every time logrotate runs, but in practice this happens occasionally on any given server - more like race condition
Is there anything we can provide to solve this issue? We have this memory problem once a week or so
Config, stdout/stderr, memory dump. I'm pretty sure it's boost filesystem collecting chunks for some reasons.
"logger": {
"frontends": [
{
"formatter": {
"type": "string",
"pattern": "%(timestamp)s %(request_id)s/%(lwp)s/%(pid)s %(severity)s: %(message)s %(...L)s"
},
"sink": {
"type": "files",
"path": "/home/admin/elliptics/log/ioserv.log",
"autoflush": true,
"rotation": {
"move": 0
}
}
}
],
"level": "info"
},
Memory dump takes tens of gigabytes when process gets killed, it is unfeasible. stdout/stderr are not available - process closes all descriptor at the very beginning of its life.
Then I can't help until we have the same problem in our production.
Without stdout and memory dump the only way to understand what's going on is to attach to the process and to inspect logger's state (logger -> frontends -> sink -> writer -> internal buffer and so on).
Blackhole is a logger, and you propose to create another logger for logger to put first logger's messages (written into stdout) into different log? This is not going to work.
I've generated core file it takes 44Gb, compression hasn't yet completed, but it has already generated 200Mb of data. While compression is in progress, here is a appropriate thread's stack:
#0 0x00007efce5fd419d in write () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007efce54c3136 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2 0x00007efce54fb35f in std::basic_filebuf<char, std::char_traits<char> >::_M_convert_to_external(char*, long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007efce54fbc43 in std::basic_filebuf<char, std::char_traits<char> >::overflow(int) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007efce54f974f in std::basic_filebuf<char, std::char_traits<char> >::sync() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007efce54dc00e in std::ostream::flush() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007efce6e56db0 in flush (this=<optimized out>) at /usr/include/blackhole/sink/files/backend.hpp:93
#7 flush (this=<optimized out>) at /usr/include/blackhole/sink/files/flusher.hpp:21
#8 handle (message=..., this=0xe49c68) at /usr/include/blackhole/sink/files.hpp:67
#9 blackhole::sink::files_t<blackhole::sink::files::boost_backend_t, blackhole::sink::rotator_t<blackhole::sink::files::boost_backend_t, blackhole::sink::rotation::watcher::move_t, blackhole::sink::rotation::timer_t> >::consume (this=<optimized out>, message=..., attributes=...) at /usr/include/blackhole/sink/files.hpp:120
#10 0x00007efce6e57031 in blackhole::handler::sink_t<blackhole::sink::files_t<blackhole::sink::files::boost_backend_t, blackhole::sink::rotator_t<blackhole::sink::files::boost_backend_t, blackhole::sink::rotation::watcher::move_t, blackhole::sink::rotation::timer_t> >, void>::consume<std::string, std::unordered_map<std::string, blackhole::log::attribute_t, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, blackhole::log::attribute_t> > > const&>(std::string&&, std::unordered_map<std::string, blackhole::log::attribute_t, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, blackhole::log::attribute_t> > > const&) (
this=0xe1a4d0) at /usr/include/blackhole/frontend.hpp:81
#11 0x00007efce6e570fe in blackhole::frontend_t<blackhole::formatter::string_t, blackhole::sink::files_t<blackhole::sink::files::boost_backend_t, blackhole::sink::rotator_t<blackhole::sink::files::boost_backend_t, blackhole::sink::rotation::watcher::move_t, blackhole::sink::rotation::timer_t> > >::handle (this=0xe1a4c0, record=...)
at /usr/include/blackhole/frontend/files.hpp:25
#12 0x00007efce6e26dfc in blackhole::logger_base_t::push(blackhole::log::record_t&&) const (this=0xe0fef8, record=...) at /usr/include/blackhole/implementation/logger.ipp:156
#13 0x00007efce6e48229 in push (record=..., this=<optimized out>) at /usr/include/blackhole/logger/wrapper.hpp:123
#14 dnet_log_write (logger=0xe10220, record=0x7efcdecd86c0, format=<optimized out>) at /tmp/buildd/elliptics-2.26.5.5/bindings/cpp/logger.cpp:263
#15 0x00007efce6a9890d in dnet_cmd_backend_control_dangerous (data=0x7ef2f4b8b6d8, cmd=0x7ef2f4b8b660, st=0x7efca4c19ef0) at /tmp/buildd/elliptics-2.26.5.5/library/backend.cpp:616
#16 dnet_cmd_backend_control (st=st@entry=0x7efca4c19ef0, cmd=cmd@entry=0x7ef2f4b8b660, data=data@entry=0x7ef2f4b8b6d8) at /tmp/buildd/elliptics-2.26.5.5/library/backend.cpp:718
---Type <return> to continue, or q <return> to quit---
#17 0x00007efce6a8a1c6 in dnet_process_cmd_without_backend_raw (data=0x7ef2f4b8b6d8, cmd=0x7ef2f4b8b660, st=0x7efca4c19ef0) at /tmp/buildd/elliptics-2.26.5.5/library/dnet.c:870
#18 dnet_process_cmd_raw (backend=backend@entry=0x0, st=st@entry=0x7efca4c19ef0, cmd=0x7ef2f4b8b660, data=0x7ef2f4b8b6d8, recursive=recursive@entry=0)
at /tmp/buildd/elliptics-2.26.5.5/library/dnet.c:1054
#19 0x00007efce6e11fb2 in dnet_process_recv (backend=0x0, st=st@entry=0x7efca4c19ef0, r=r@entry=0x7ef2f4b8b610) at /tmp/buildd/elliptics-2.26.5.5/library/net.c:734
#20 0x00007efce6e3c470 in dnet_io_process (data_=0xe4efc0) at /tmp/buildd/elliptics-2.26.5.5/library/pool.c:1108
#21 0x00007efce5fccb50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#22 0x00007efce4cfb95d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#23 0x0000000000000000 in ?? ()
What do you want me to check, what commands to run?
If boost::filesystem is unable to write/flush it throws an exception, which is caught by BH. Due to there is no other place where it can notify about error it writes it to stdout.
There should be a buffer inside basic_filebuf
, its size is required.
Upd: _M_buf_size
, _M_buf_allocated
and _M_ext_buf_size
Looks like its normal elliptics log message except that some python stuff (probably gdb-related). It stuck in write call, buffer size looks like 175 bytes. Log file does contain that line.
I didn't find srd::basic_filebuf
pointer though.
NB I've sent you a mail with core file URL and elliptics binary
#1 0x00007efce54c3136 in (anonymous namespace)::xwrite (__fd=3,
__s=0x7efcc8815050 "2015-08-04 10:39:29.633711 ", '0' <repeats 16 times>, "/20718/20713 INFO: backend_control: received BACKEND_CONTROL: backend_id: 13, command: 2, state: addr:57587 'backend_id': 13\nime: 273 usecs, err: 0. \n"..., __n=175) at basic_file.cc:121
#2 0x00007efce54fb35f in std::basic_filebuf<char, std::char_traits<char> >::_M_convert_to_external (this=this@entry=0xe49c78,
__ibuf=0x7efcc8815050 "2015-08-04 10:39:29.633711 ", '0' <repeats 16 times>, "/20718/20713 INFO: backend_control: received BACKEND_CONTROL: backend_id: 13, command: 2, state: addr:57587 'backend_id': 13\nime: 273 usecs, err: 0. \n"..., __ilen=175)
at /home/packages/gcc/4.7/w/gcc-4.7-4.7.2/build/x86_64-linux-gnu/libstdc++-v3/include/bits/fstream.tcc:523
#3 0x00007efce54fbc43 in std::basic_filebuf<char, std::char_traits<char> >::overflow (this=0xe49c78, __c=-1)
at /home/packages/gcc/4.7/w/gcc-4.7-4.7.2/build/x86_64-linux-gnu/libstdc++-v3/include/bits/fstream.tcc:450
#4 0x00007efce54f974f in sync (this=<optimized out>) at /home/packages/gcc/4.7/w/gcc-4.7-4.7.2/build/x86_64-linux-gnu/libstdc++-v3/include/bits/fstream.tcc:902
#5 std::basic_filebuf<char, std::char_traits<char> >::sync (this=<optimized out>)
at /home/packages/gcc/4.7/w/gcc-4.7-4.7.2/build/x86_64-linux-gnu/libstdc++-v3/include/bits/fstream.tcc:894
#6 0x00007efce54dc00e in pubsync (this=<optimized out>) at /home/packages/gcc/4.7/w/gcc-4.7-4.7.2/build/x86_64-linux-gnu/libstdc++-v3/include/streambuf:269
#7 std::ostream::flush (this=0xe49c70) at /home/packages/gcc/4.7/w/gcc-4.7-4.7.2/build/x86_64-linux-gnu/libstdc++-v3/include/bits/ostream.tcc:221
#8 0x00007efce6e56db0 in flush (this=<optimized out>) at /usr/include/blackhole/sink/files/backend.hpp:93
#9 flush (this=<optimized out>) at /usr/include/blackhole/sink/files/flusher.hpp:21
#10 handle (message=
"2015-08-04 10:39:29.633711 ", '0' <repeats 16 times>, "/20718/20713 INFO: backend_control: received BACKEND_CONTROL: backend_id: 13, command: 2, state: addr:57587 'backend_id': 13", this=0xe49c68) at /usr/include/blackhole/sink/files.hpp:67
#11 blackhole::sink::files_t<blackhole::sink::files::boost_backend_t, blackhole::sink::rotator_t<blackhole::sink::files::boost_backend_t, blackhole::sink::rotation::watcher::move_t, blackhole::sink::rotation::timer_t> >::consume (this=<optimized out>, message=
Python Exception <type 'exceptions.IndexError'> list index out of range:
"2015-08-04 10:39:29.633711 ", '0' <repeats 16 times>, "/20718/20713 INFO: backend_control: received BACKEND_CONTROL: backend_id: 13, command: 2, state: addr:57587 'backend_id': 13", attributes=std::unordered_map with 8 elements) at /usr/include/blackhole/sink/files.hpp:120
Any progress on this issue? OOM bug happens on weekly basis
I have no idea how to reproduce this and I doubt that it's Blackhole's fault. You definitely need to profile the heap to see what's going on.
Is this 'works-on-my-machine' everything you can suggest based on full 44Gb crashdump? No debug prints, no 'disable this, let's look what will happen'?
Blackhole is the only active entity at the moment and some other threads waiting on blackhole lock, all others are sleeping on conditional variables (not even locks which might look like a deadlock), so its the primary target. It would be really great if you help proving that bug is not in the logger.
It works on my 110 machines, so yes. Your core dump shows nothing interesting inside BH. The only way I can suggest without direct access is to run E under heap profiler, it shows what entity eats more memory. It's easy.
Works on my exactly the same 110 machines
, i.e. works on my machine - it is not enough for software to be bugfree. For example bug with file descriptors referring to removed log file exists for years, and you are quite lucky to live in the environment where it doesn't show up. On every of your 110 machines. Probably you have the same 'good' log rotation config, or use syslog, or some other sinks.
I can not run elliptics under valgrind, it becomes about 100+ times slower, virtually no big file upload requests succeeded. What else can I try? Will switching to syslog help?
We run under 2 frontends: string formatter with files sink with move rotation; json formatter with udp sink. Once there was 3 day run with logging into removed fd, but we've fixed it via logrotate config.
Writing to syslog may help, because it has no internal buffering and lock free.
Valgrind is not the only way to profile the heap - linking and running with tcmalloc or jemalloc with appropriate env provides an ability to dump heap stats without huge performance penalty.
With file sink the only place where internal buffering may occurs is inside fstream, but your core dump shows that there aren't much space allocated.
Okay, to point you where potential leaking may occur I show how Blackhole works.
Can it block for long enough period (either because of rotation, or writing into rotated file, which was removed or anything else, the files.hpp#L112 link and ->handle()
call for example), that we would fill up the queues? Like a minute or so
It can block only on syscalls, but other threads are locked in this case.
JFYI by default Blackhole v1 won't have rotation support. Historically there is a convenient tool named logrotate
for rotating and zipping logs etc. And it's not reasonable to copy all that logic into the logger. Instead a logger should have an ability to reopen files when it's required. It can be implemented in several ways:
Blackhole v1 will provide the first method and I believe that it's the only true way. Le scheme:
We frequently see this issue in elliptics:
# ls -l /proc/19808/fd/ | grep log l-wx------ 1 root root 64 Jul 28 01:49 3 -> /home/admin/elliptics/log/ioserv.log-2015.07.22-1437535502 (deleted)
ioserv.log - elliptics log file - was moved by logrotate 6 days ago, but blackhole didn't notice that and didn't reopen logfile. Here is appropriate config section:
We use blackhole 0.2 if that matters. Rotation changed a little bit - there were introduced additional checks which might explain why 'size' rotation doesn't work in 0.2, but common 'move' logic is the same afaics.
It doesn't happen all the time, but yet quite frequently.