Closed lazyfrosch closed 4 years ago
Maybe someone of the externals thinks that we haven't any progress there, but that's not true. My reproduction strategy is getting better and better (slowly, but surely) and my cause search strategy β Git bisect β will definitively find the problem's cause.
Hi @bunghi @MarkNReynolds @Napsty @mattrose @jivuvtramvae @verboEse @BastiBr @Sec42 and all others facing this bug,
We pushed several changes to our snapshot packages which affect and may mitigate the JSON-RPC crash (and another segfault related to notification result syncing). It would be of great help to us if you could test those packages in your environments and provide feedback. Please enable core dumps beforehand so that we get as many information as possible if the crash still occurs. If the crash happens "reliably" in your setup, please share as much details as possible about your setup.
Thanks and cheers, Eric
... and here's how you enable core dumps: https://github.com/Al2Klimov/icinga-iliketrains/blob/001c57022b825bf8f5e4cf6845d191dca7d3a847/ansible-playbook.yml#L152
Good news: The updated env crashed as expected.
Bad news: That took about 2.5 days. Better late than never.
Core dump & co.: https://nextcloud.icinga.com/index.php/s/nJ6yeAY2cFeQYyL
The crashed env was 1c79c2b34. Now I deployed e909302fd. I'll give you an update during the next 1-2.5 days.
Some notes.
Might have an influence if both endpoint direction attempt to connect to each other with clearing one side right after. Or many broken pipes, e.g. the other side closes the socket not so clean.
Other research things, not sure if they match.
https://github.com/ceph/spawn/commits/master is a header only fork by a RedHat developer of asio::spawn making use of callcc instead of fcontext from the coroutine library. It removes the coroutine layer entirely.
Might again have to do something with stack unwinding and exceptions. Worth collecting the Boost versions and OS'es into a table for variety patterns?
e909302fd didn't crash, so we have a commit range: (e909302fd.. 1c79c2b34]
I'll continue with 794374a32 and give you an update on monday.
Thanks π
I have another theory: A combination of stack sizes, recursion and large JSON blobs processed inside the coroutines. If the connection between agent/master is dropped uncleanly, say broken pipe or connection reset, there may be a leak with a running coroutine which never terminates. Or something else corrupts the stack then, or we hit the number of opened files.
This may be influenced with setups where the connection direction is in both ends, contradicting to what's written in the docs. This is where a timer kills off the secondary connection for the endpoint, on both sides (race condition time).
Similar to number of spawned threads, we likely should measure how many coroutines we have spawned and log that somewhere with a timer, similar to the Workqueues, plus an API stats metric.
794374a32 crashed. New commit range: (e909302fd.. 794374a32] - https://github.com/icinga/icinga2/compare/e909302..794374a
I'll continue with 37de1a919 and give you an update on monday.
Follow-up read on the Ceph discussion: https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/6LBFZIFUPTJQ3SNTLVKSQMVITJWVWTZ6/?sort=date
The pooled_fixedsize_stack appears to have the same 2 limitations as the current fixedsize_stack: each stack has a fixed size of 128K by default, and there's no memory protection to detect stack overflows. The only difference seems to be that freed stacks are kept in a boost::pool to be reused for later stack allocations.
The motivation for enforcing a limit on the per-connection stack size is to help scale to lots of concurrent connections. However, if the requests are mostly object gets/puts, we'll be buffering up to 16M of object data per request and that's going to dwarf these stacks.
I think the protected_fixedsize_stack gives us the most flexibility here. In addition to the overflow detection from mmap+mprotect, it lets us map a comfortable amount of memory without having to allocate pages for them up front, and essentially behaving like the pthread stacks we're used to.
I've tried implementing their fork, but it uses several C++14/17 features. Let's if we can just boil it down to whatever is needed. https://github.com/Icinga/icinga2/pull/7762 - https://github.com/boostorg/asio/pull/55/files
Other collected thoughts.
https://github.com/boostorg/beast/issues/1055
I built a REST based server built upon the http_client_coro.cpp example which uses stackful coroutines. During one of our request handlers the app needs to reread an xml config file using boost::property_tree. In the read_xml_internal method we get a Access Violation exception. The top of the stack shows the function _chkstk() being called. Using VS2017 core guideline checker we get C6262 warning in the boost property tree that too much stack space is being used. We set the Linker->System->Stack Reserve Size to 4194304 but the crash still happens.
https://groups.google.com/forum/#!topic/boost-list/Lgo931jYc0E
In C and C++, each coroutine (and thread for that matter) must have a single, contiguous stack which, in addition, cannot grow. Because of this, you should 1) avoid large local variables (e.g., arrays), 2) avoid recursion of any kind, 3) avoid calling any other functions that use a lot of stack (see 1) and 2)), 4) have enough stack space for signal handling [actually, the best thing to do is to install an alternate signal stack by using sigaltstack()].
If you're running in 64-bit mode, try setting the stack size of each coroutine to something like 1MB; the system's demand-paging logic will actually allocate as much physical memory as the program needs. But the fundamental limitations listed above remain.
My last guess for now is that with the stack protection changes implemented in the Kernel because of the Intel microcode fuckup, the behaviour of these stackful coroutines is different to what it used to be before. On Windows, there was definitely some stack protection going on, corrupting the stack with guard pages.
I'd say this wouldn't crash if being run on a Kernel without these stack protection patches.
37de1a919 crashed. New commit range: (e909302..37de1a919]
I'll continue with 64568f596 and give you an update on monday.
@lippserd Congrats, 1c79c2b34 crashes w/ (1) in about 3h. I'll re-test it and give you an update ASAP.
- hosts: 'icingas:!aklimov-iliketrains2-master1:!aklimov-iliketrains2-master2'
become: yes
become_method: sudo
tasks:
- copy:
dest: /vola.pl
owner: root
group: root
mode: '0755'
content: |
#!/usr/bin/perl
while (1) {
sleep((rand() * 9 + 1) * 60);
print "OFF\n";
`iptables -I INPUT -p tcp -m tcp --dport 5665 -j DROP`;
`iptables -I OUTPUT -p tcp -m tcp --dport 5665 -j DROP`;
sleep((rand() * 4 + 1) * 60);
print "ON\n";
`iptables -D INPUT -p tcp -m tcp --dport 5665 -j DROP`;
`iptables -D OUTPUT -p tcp -m tcp --dport 5665 -j DROP`
}
- copy:
dest: /etc/systemd/system/vola.service
owner: root
group: root
mode: '0644'
content: |
[Unit]
[Service]
ExecStart=/vola.pl
[Install]
WantedBy=multi-user.target
- systemd:
daemon_reload: yes
- service:
name: vola
state: started
enabled: yes
The 2nd time it needed 24h. I'll re-test it again and give you an update ASAP.
Meanwhile, following the diff leads to the following assumptions:
64568f596 didn't crash. New commit range: (64568f596..37de1a919]
I'll continue with 56894bea1 and give you an update on thursday.
56894bea1 was killed β not what I expected.
I'll continue with 02db12ae0 and give you an update on friday.
The remaining commit range and the changes are a bit long, but some commits seem not to build. I've patched them:
for c in $(git log --format=%h --first-parent 64568f5..37de1a9); do git checkout $c; git cherry-pick 56894bea17df0f7ede19dd1df2b093033ea4190c; git push origin HEAD:refs/heads/aklimov/iliketrains4/$c; done
... and will hopefully continue with 43e9ae5f8.
I've added the network volatility to the test env spec as it crashed again in less than two days.
The master (18eb06e33) crashed β on both master nodes this time.
Just a comment from my side here, since questions are around - I now see the bisect strategy from @Al2Klimov as the only valid strategy to find out about the error.
Currently, I don't have further ideas for mitigating the error, except for raising the stack limit for coroutines. Which would highlight a stack overflow, but that's still a problem not necessarily solved with additional memory.
This involves waiting and no immediate feedback, and is totally different to a scenario where you debug something and it crashes after 1 hour in your dev environment.
After finding the commit which creates the problem, we sure need to invest development time to actually create a patch. And this patch again requires tests, and then we'll consider releasing 2.12 RC.
Currently this problem solely blocks the entire release chain for the IcingaDB release, including Web 2.8, the icingadb module, IcingaDB itself and Icinga 2.12.
After that's done, backporting the fix to 2.11, testing it again and then releasing a bugfix release is on the TODO list.
Cheers, Michael
I now see the bisect strategy from @Al2Klimov as the only valid strategy to find out about the error.
Finally somebody says it! π
43e9ae5 didn't crash. New commit range: (43e9ae5..37de1a9]
I'll continue with 268c8b395 and give you an update on monday.
Thanks. Boils down to https://github.com/icinga/icinga2/compare/43e9ae5..37de1a9
Since the Timer is something we implement ourselves, I'll investigate here for a possible mitigation/change.
268c8b395 got killed, but not segv-ed. Assuming that it's not affected. New commit range: (268c8b395..37de1a919]
I'll continue with 0438c866f and give you an update on wednesday.
Thanks a lot for the update. This removes the Timer fixes. https://github.com/icinga/icinga2/compare/268c8b3..37de1a9
Now I need to wait for future updates, I currently don't know what would cause the crash from the diff.
... and I don't know whether 268c8b3 actually isn't affected, yet.
I've been analyzing #7771 with the boost::filesystem::remove_all call. There's a certain chance that these recursive attempts (and failure when a file cannot be deleted) cause a stack overflow, like mentioned here: http://boost.2283326.n4.nabble.com/filesystem-remove-all-error-handling-td2603383.html
Specifically, ApiListener::ConfigUpdateHandler()
is called every time we receive a cluster config sync update. I've seen this exception on both Windows in #7761 (1.69.0) and Debian 10 in #7771 (1.67.0.1) now.
[2020-01-20 15:18:38 +0100] warning/JsonRpcConnection: Error while processing message for identity 'broker01.example.com'
Error: boost::filesystem::remove: Directory not empty: "/var/lib/icinga2/api/zones-stage/"
(0) libboost_filesystem.so.1.67.0: <unknown function> (+0xe2ef) [0x7fc4bd0062ef]
(1) libboost_filesystem.so.1.67.0: boost::filesystem::detail::remove_all(boost::filesystem::path const&, boost::system::error_code*) (+0xb6) [0x7fc4bd006946]
(2) icinga2: icinga::Utility::RemoveDirRecursive(icinga::String const&) (+0x8f) [0x63967f]
(3) icinga2: icinga::ApiListener::ConfigUpdateHandler(boost::intrusive_ptr<icinga::MessageOrigin> const&, boost::intrusive_ptr<icinga::Dictionary> const&) (+0x55b) [0x753aab]
(4) icinga2: std::_Function_handler<icinga::Value (boost::intrusive_ptr<icinga::MessageOrigin> const&, boost::intrusive_ptr<icinga::Dictionary> const&), icinga::Value (*)(boost::intrusive_ptr<icinga::MessageOrigin> const&, boost::intrusive_ptr<icinga::Dictionary> const&)>::_M_invoke(std::_Any_data const&, boost::intrusive_ptr<icinga::MessageOrigin> const&, boost::intrusive_ptr<icinga::Dictionary> const&) (+0xf) [0x7a828f]
(5) icinga2: icinga::JsonRpcConnection::MessageHandler(icinga::String const&) (+0x5b6) [0x7d8d66]
(6) icinga2: icinga::JsonRpcConnection::HandleIncomingMessages(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >) (+0x1b6) [0x7da926]
(7) /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2() [0x858f17]
(8) libboost_context.so.1.67.0: make_fcontext (+0x2f) [0x7fc4bd63673f]
I'll continue with reading and testing tomorrow.
@lippserd @gethash @Al2Klimov I had another idea while having a shower in the morning - what if the 64KB stack size for coroutines influences this behavior when boost::filesystem::remove_all is run with a huge set of directories and files, coming from zones.d? The recursion may have an influence and cause a stack overflow.
RenameFile might also be involved, since this also is used with SaveJsonFile in PKI requests. So a more generic problem with boost::filesystem over time.
Similar pattern what I had analyzed with the JSON decoding already.
Also interesting to read with recursion and stackful operations in mind:
0438c86 didn't crash. New commit range: (0438c86..37de1a9]
I'll continue with a630d0185 and give you an update on friday.
Ok, thanks, then the boost::filesystem::remove_all thingy is a different one, likely sourced from user cleanup scripts or corrupt filesystems.
https://github.com/icinga/icinga2/compare/0438c86..37de1a9
For the API Event Queue, we're registering signal handlers in lib/icinga/apievents.cpp
. Maybe the condition change in
if (queues.empty() && !inboxes)
return;
now continues and fills the inboxes for some reason. I need to test-drive this locally just to be sure about it.
a630d01 crashed. First bad commit: a630d01
Not really helpful as we've already reverted it and that revert still crashed. π
@dnsmichi Please list the (merge) commits I shall apply to each first-parent commit of v2.10.0..v2.11.0 for a new bisect in addition to cb21086d6.
We had a longer offline meeting, so here's another approach:
First bad
a630d01 - Merge pull request #6722 from Icinga/feature/notification-result Add notification result store/sync
Last good
0438c86 - Merge pull request #7102 from Icinga/feature/boost-fs-7101 Replace self-written filesystem ops with boost.filesystem
Tasks:
Base commit a630d01
git checkout -b bugfix/json-rpc-nr a630d01
git revert -m 1 a630d0185fb5e6270c0238002c03c91ac02b21fe
git cherry-pick -x 5725ba4bf37ef5e3eab718f0cab4b4f673d8d602
This requires the changes/fixes from
This is a pain to backport with the get_io_service() and GetContext() differences. We still need the exception swallowing fixes to be sure.
Damn. It seems the one patch doesn't apply to the one end of the bisect range and the other patch doesn't apply to the other one.
@lippserd @bobapple @gethash We need ASAP a customer where we can easily reproduce the crash (the right crash!) and where our PS colleagues can install my custom packages for (let's say) one or two days (total, not per bisect step).
I'll run the setup over the weekend with the current Git master (as of right now) and GDB attached to the main process.
@dnsmichi Any tips e.g. additional packages to install?
@dnsmichi @htriem @N-o-X @lippserd Please paste your public SSH keys here.
Forget it, I've got this one:
for u in dnsmichi htriem N-o-X lippserd; do curl -sSL https://github.com/$u.keys; done
@Al2Klimov I would prefer if you'd run the following which we agreed on yesterday.
git checkout -b bugfix/json-rpc-nr a630d01
Revert NR PR immediately as full. To rule out that the revert in git master is broken, or anything else. do this on the first run
git revert -m 1 a630d0185fb5e6270c0238002c03c91ac02b21fe
If that fails, please coordinate with @lippserd on the next steps to take.
In terms of SSH keys, please get in touch with @bobapple. People may use different keys for GH and servers btw.
Icinga 2 commit | crashed? | Env. spec. |
---|---|---|
c37c40d8c (master) | yes | all w/ GDB |
1c79c2b (~ v2.11rc1) | yes | all ex. network volatility |
794374a | yes | a.e.n.v. |
37de1a9 | yes | a.e.n.v. |
a630d01 (1st bad) | yes | all |
0438c86 | no | all |
43e9ae5 | no | all |
64568f5 | no | a.e.n.v. |
e909302 | no | a.e.n.v. |
Notes:
master with gdb: WorkQueue -> Bind -> Function -> MessageOrigin destructor -> JsonRpcConnection destruktur -> _free -> Bumm
Parallel task: thread sanitizer analysis down the rabbit hole. https://lemire.me/blog/2016/04/20/no-more-leaks-with-sanitize-flags-in-gcc-and-clang/ https://github.com/google/sanitizers/wiki/ThreadSanitizerCppManual
stack-use-after-scope wrap_memmove
Damn. It seems the one patch doesn't apply to the one end of the bisect range and the other patch doesn't apply to the other one.
@lippserd @bobapple @gethash We need ASAP a customer where we can easily reproduce the crash (the right crash!) and where our PS colleagues can install my custom packages for (let's say) one or two days (total, not per bisect step).
Hi if you need a system where testing these patch we are available.. please send an email to Kiratech. Thanks
Matteo
How fast does your system crash?
1 min or less
Also w/ the latest snapshot packages (as of right now)?
no snapshot package
Then please test them.
can you contact me in private? matteo.baruchello@kirateck.it
@mbaruchello77 Thanks for the offer, we'll do that!
Stacktrace generated with -fsanitize=address
Clang flag after killing 10 JSON-RPC connections with valid endpoints:
[2020-02-03 15:38:31 +0100] information/ApiListener: Finished syncing runtime objects to endpoint 'localhost'.
[2020-02-03 15:38:31 +0100] information/ApiListener: Finished sending runtime config updates for endpoint 'localhost' in zone 'localhost'.
==54100==WARNING: ASan is ignoring requested __asan_handle_no_return: stack top: 0x700002ef9000; bottom 0x000113e8e000; size: 0x6ffeef06b000 (123140722569216)
False positive error reports may follow
For details see https://github.com/google/sanitizers/issues/189
=================================================================
==54100==ERROR: AddressSanitizer: stack-buffer-underflow on address 0x000113e8fc08 at pc 0x00010c2d4fc9 bp 0x000113e8f6e0 sp 0x000113e8ee78
READ of size 168 at 0x000113e8fc08 thread T62
[2020-02-03 15:41:07 +0100] warning/JsonRpcConnection: API client disconnected for identity 'localhost'
[2020-02-03 15:41:07 +0100] warning/ApiListener: Removing API client for endpoint 'localhost'. 9 API clients left.
[2020-02-03 15:41:08 +0100] warning/JsonRpcConnection: API client disconnected for identity 'localhost'
[2020-02-03 15:41:08 +0100] warning/ApiListener: Removing API client for endpoint 'localhost'. 8 API clients left.
#0 0x10c2d4fc8 in wrap_memmove (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x1cfc8)
0x000113e8fc08 is located 254984 bytes inside of 262144-byte region [0x000113e51800,0x000113e91800)
allocated by thread T77 here:
#0 0x10c3187d3 in wrap_malloc (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x607d3)
Thread T62 created by T0 here:
[2020-02-03 15:41:08 +0100] warning/JsonRpcConnection: API client disconnected for identity 'localhost'
#0 0x10c3107cd in wrap_pthread_create (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x587cd)
[2020-02-03 15:41:08 +0100] warning/ApiListener: Removing API client for endpoint 'localhost'. 7 API clients left.
[2020-02-03 15:41:08 +0100] warning/JsonRpcConnection: API client disconnected for identity 'localhost'
[2020-02-03 15:41:08 +0100] warning/ApiListener: Removing API client for endpoint 'localhost'. 6 API clients left.
[2020-02-03 15:41:08 +0100] warning/JsonRpcConnection: API client disconnected for identity 'localhost'
[2020-02-03 15:41:08 +0100] warning/ApiListener: Removing API client for endpoint 'localhost'. 5 API clients left.
[2020-02-03 15:41:08 +0100] warning/JsonRpcConnection: API client disconnected for identity 'localhost'
[2020-02-03 15:41:08 +0100] warning/ApiListener: Removing API client for endpoint 'localhost'. 4 API clients left.
[2020-02-03 15:41:08 +0100] warning/JsonRpcConnection: API client disconnected for identity 'localhost'
[2020-02-03 15:41:08 +0100] warning/ApiListener: Removing API client for endpoint 'localhost'. 3 API clients left.
#1 0x10095e783 in std::__1::__libcpp_thread_create(_opaque_pthread_t**, void* (*)(void*), void*) __threading_support:328
#2 0x100e98d5e in std::__1::thread::thread<void (icinga::IoEngine::*)(), icinga::IoEngine*, void>(void (icinga::IoEngine::*&&)(), icinga::IoEngine*&&) thread:368
#3 0x100876434 in std::__1::thread::thread<void (icinga::IoEngine::*)(), icinga::IoEngine*, void>(void (icinga::IoEngine::*&&)(), icinga::IoEngine*&&) thread:360
#4 0x10087514c in icinga::IoEngine::IoEngine() io-engine.cpp:93
#5 0x100876634 in icinga::IoEngine::IoEngine() io-engine.cpp:88
#6 0x100e88f89 in icinga::IoEngine::$_64::operator()() const io-engine.cpp:75
#7 0x100e88f2e in decltype(std::__1::forward<icinga::IoEngine::$_64&>(fp)()) std::__1::__invoke<icinga::IoEngine::$_64&>(icinga::IoEngine::$_64&) type_traits:4361
#8 0x100e88ece in std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > std::__1::__invoke_void_return_wrapper<std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > >::__call<icinga::IoEngine::$_64&>(icinga::IoEngine::$_64&) __functional_base:318
#9 0x100e88e8e in std::__1::__function::__alloc_func<icinga::IoEngine::$_64, std::__1::allocator<icinga::IoEngine::$_64>, std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > ()>::operator()() functional:1527
#10 0x100e8561f in std::__1::__function::__func<icinga::IoEngine::$_64, std::__1::allocator<icinga::IoEngine::$_64>, std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > ()>::operator()() functional:1651
#11 0x100e8a5fc in std::__1::__function::__value_func<std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > ()>::operator()() const functional:1799
#12 0x100e8a3ab in std::__1::function<std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > ()>::operator()() const functional:2347
#13 0x100874b14 in icinga::LazyInit<std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > >::Get() lazy-init.hpp:56
#14 0x100873ecf in icinga::IoEngine::Get() io-engine.cpp:79
#15 0x101325e81 in icinga::ApiListener::AddListener(icinga::String const&, icinga::String const&) apilistener.cpp:364
#16 0x101323609 in icinga::ApiListener::Start(bool) apilistener.cpp:255
#17 0x1008066e3 in icinga::ConfigObject::Activate(bool, icinga::Value const&) configobject.cpp:367
#18 0x101201330 in icinga::ConfigItem::ActivateItems(icinga::WorkQueue&, std::__1::vector<boost::intrusive_ptr<icinga::ConfigItem>, std::__1::allocator<boost::intrusive_ptr<icinga::ConfigItem> > > const&, bool, bool, bool, icinga::Value const&) configitem.cpp:696
#19 0x101bbee83 in RunWorker(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, bool, icinga::String const&) daemoncommand.cpp:281
#20 0x101b5c30d in StartUnixWorker(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, bool, icinga::String const&) daemoncommand.cpp:507
#21 0x101b559fb in icinga::DaemonCommand::Run(boost::program_options::variables_map const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) const daemoncommand.cpp:697
#22 0x10074b1eb in Main() icinga.cpp:662
#23 0x10073fea0 in main icinga.cpp:959
#24 0x7fff7142b404 in start (libdyld.dylib:x86_64+0x11404)
Thread T77 created by T0 here:
#0 0x10c3107cd in wrap_pthread_create (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x587cd)
#1 0x10095e783 in std::__1::__libcpp_thread_create(_opaque_pthread_t**, void* (*)(void*), void*) __threading_support:328
#2 0x100e98d5e in std::__1::thread::thread<void (icinga::IoEngine::*)(), icinga::IoEngine*, void>(void (icinga::IoEngine::*&&)(), icinga::IoEngine*&&) thread:368
#3 0x100876434 in std::__1::thread::thread<void (icinga::IoEngine::*)(), icinga::IoEngine*, void>(void (icinga::IoEngine::*&&)(), icinga::IoEngine*&&) thread:360
#4 0x10087514c in icinga::IoEngine::IoEngine() io-engine.cpp:93
#5 0x100876634 in icinga::IoEngine::IoEngine() io-engine.cpp:88
#6 0x100e88f89 in icinga::IoEngine::$_64::operator()() const io-engine.cpp:75
#7 0x100e88f2e in decltype(std::__1::forward<icinga::IoEngine::$_64&>(fp)()) std::__1::__invoke<icinga::IoEngine::$_64&>(icinga::IoEngine::$_64&) type_traits:4361
#8 0x100e88ece in std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > std::__1::__invoke_void_return_wrapper<std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > >::__call<icinga::IoEngine::$_64&>(icinga::IoEngine::$_64&) __functional_base:318
#9 0x100e88e8e in std::__1::__function::__alloc_func<icinga::IoEngine::$_64, std::__1::allocator<icinga::IoEngine::$_64>, std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > ()>::operator()() functional:1527
#10 0x100e8561f in std::__1::__function::__func<icinga::IoEngine::$_64, std::__1::allocator<icinga::IoEngine::$_64>, std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > ()>::operator()() functional:1651
#11 0x100e8a5fc in std::__1::__function::__value_func<std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > ()>::operator()() const functional:1799
#12 0x100e8a3ab in std::__1::function<std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > ()>::operator()() const functional:2347
#13 0x100874b14 in icinga::LazyInit<std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > >::Get() lazy-init.hpp:56
#14 0x100873ecf in icinga::IoEngine::Get() io-engine.cpp:79
#15 0x101325e81 in icinga::ApiListener::AddListener(icinga::String const&, icinga::String const&) apilistener.cpp:364
#16 0x101323609 in icinga::ApiListener::Start(bool) apilistener.cpp:255
#17 0x1008066e3 in icinga::ConfigObject::Activate(bool, icinga::Value const&) configobject.cpp:367
#18 0x101201330 in icinga::ConfigItem::ActivateItems(icinga::WorkQueue&, std::__1::vector<boost::intrusive_ptr<icinga::ConfigItem>, std::__1::allocator<boost::intrusive_ptr<icinga::ConfigItem> > > const&, bool, bool, bool, icinga::Value const&) configitem.cpp:696
#19 0x101bbee83 in RunWorker(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, bool, icinga::String const&) daemoncommand.cpp:281
#20 0x101b5c30d in StartUnixWorker(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, bool, icinga::String const&) daemoncommand.cpp:507
#21 0x101b559fb in icinga::DaemonCommand::Run(boost::program_options::variables_map const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) const daemoncommand.cpp:697
#22 0x10074b1eb in Main() icinga.cpp:662
#23 0x10073fea0 in main icinga.cpp:959
#24 0x7fff7142b404 in start (libdyld.dylib:x86_64+0x11404)
SUMMARY: AddressSanitizer: stack-buffer-underflow (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x1cfc8) in wrap_memmove
Shadow bytes around the buggy address:
0x1000227d1f30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x1000227d1f40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x1000227d1f50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x1000227d1f60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x1000227d1f70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x1000227d1f80: f1[f1]f1 f1 00 00 f3 f3 00 00 00 00 00 00 00 00
0x1000227d1f90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x1000227d1fa0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x1000227d1fb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x1000227d1fc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x1000227d1fd0: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
Shadow gap: cc
==54100==ABORTING
Caught SIGABRT.
Current time: 2020-02-03 15:41:10 +0100
[2020-02-03 15:41:10 +0100] critical/Application: Icinga 2 has terminated unexpectedly. Additional information can be found in '/Users/noah/i2/var/log/icinga2/crash/report.1580740870.629094'
Same case, but debugged with LLDB:
(lldb) bt
* thread #24, stop reason = Use of out-of-scope stack memory
* frame #0: 0x000000010e148880 libclang_rt.asan_osx_dynamic.dylib`__asan::AsanDie()
frame #1: 0x000000010e15f81f libclang_rt.asan_osx_dynamic.dylib`__sanitizer::Die() + 175
frame #2: 0x000000010e14676b libclang_rt.asan_osx_dynamic.dylib`__asan::ScopedInErrorReport::~ScopedInErrorReport() + 411
frame #3: 0x000000010e145f9f libclang_rt.asan_osx_dynamic.dylib`__asan::ReportGenericError(unsigned long, unsigned long, unsigned long, unsigned long, bool, unsigned long, unsigned int, bool) + 447
frame #4: 0x000000010e0fbfe9 libclang_rt.asan_osx_dynamic.dylib`wrap_memmove + 713
frame #5: 0x00007fff71665bf8 libunwind.dylib`unw_init_local + 33
frame #6: 0x00007fff716665cb libunwind.dylib`unwind_phase2 + 41
frame #7: 0x00007fff71669ed1 libunwind.dylib`_Unwind_Resume + 51
frame #8: 0x000000010322886c icinga2`icinga::JsonRpc::ReadMessage(stream=0x0000616004162008, yc=boost::asio::yield_context @ 0x0000000115bb1840, maxMessageLength=-1)(), boost::asio::executor> >, long) at jsonrpc.cpp:0:1
frame #9: 0x000000010322a252 icinga2`icinga::JsonRpcConnection::HandleIncomingMessages(this=0x0000616004161f80, yc=boost::asio::yield_context @ 0x0000000115bb1fa0)(), boost::asio::executor> >) at jsonrpcconnection.cpp:67:14
frame #10: 0x00000001038c22b3 icinga2`icinga::JsonRpcConnection::Start(this=0x00006060003e8e50, yc=boost::asio::yield_context @ 0x0000000115bb20e0)::$_67::operator()(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >) const at jsonrpcconnection.cpp:55:83
frame #11: 0x00000001038c1fa0 icinga2`void icinga::IoEngine::SpawnCoroutine<boost::asio::io_context::strand, icinga::JsonRpcConnection::Start()::$_67>(this=0x00006060003e8e50, yc=boost::asio::yield_context @ 0x0000000115bb22c0)::$_67)::'lambda'(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >)::operator()(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >) const at io-engine.hpp:125:6
frame #12: 0x00000001038c1c3c icinga2`boost::asio::detail::coro_entry_point<boost::asio::executor_binder<void (*)(), boost::asio::io_context::strand>, void icinga::IoEngine::SpawnCoroutine<boost::asio::io_context::strand, icinga::JsonRpcConnection::Start()::$_67>(boost::asio::io_context::strand, icinga::JsonRpcConnection::Start()::$_67)::'lambda'(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >)>::operator(this=0x00000001152485e0, ca=0x0000000115bb2490)(boost::coroutines::pull_coroutine<void>&) at spawn.hpp:337:7
frame #13: 0x00000001038c1700 icinga2`boost::coroutines::detail::push_coroutine_object<boost::coroutines::pull_coroutine<void>, void, boost::asio::detail::coro_entry_point<boost::asio::executor_binder<void (*)(), boost::asio::io_context::strand>, void icinga::IoEngine::SpawnCoroutine<boost::asio::io_context::strand, icinga::JsonRpcConnection::Start()::$_67>(boost::asio::io_context::strand, icinga::JsonRpcConnection::Start()::$_67)::'lambda'(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >)>&, boost::coroutines::basic_standard_stack_allocator<boost::coroutines::stack_traits> >::run(this=0x0000000115bb2760) at push_coroutine_object.hpp:302:11
frame #14: 0x00000001038c1359 icinga2`void boost::coroutines::detail::trampoline_push_void<boost::coroutines::detail::push_coroutine_object<boost::coroutines::pull_coroutine<void>, void, boost::asio::detail::coro_entry_point<boost::asio::executor_binder<void (*)(), boost::asio::io_context::strand>, void icinga::IoEngine::SpawnCoroutine<boost::asio::io_context::strand, icinga::JsonRpcConnection::Start()::$_67>(boost::asio::io_context::strand, icinga::JsonRpcConnection::Start()::$_67)::'lambda'(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >)>&, boost::coroutines::basic_standard_stack_allocator<boost::coroutines::stack_traits> > >(t=(fctx = 0x0000000115248400, data = 0x0000000115248440)) at trampoline_push.hpp:70:11
frame #15: 0x000000010dc5fa2f libboost_context-mt.dylib`trampoline + 3
(lldb)
Next rounds:
In additional to that, going further with the ASAN analysis. Yesterday we stopped at JsonDecode
where the MessageHandler() from the JSON-RPC connection read sometimes crashed when sending back the response.
Task List
Mitigations
https://github.com/Icinga/icinga2/issues/7532#issuecomment-565510038
Analysis
Related issues
7569
7687
7624
7470
ref/NC/636691 ref/NC/644339 ref/NC/644553 ref/NC/647127 ref/NC/652035 ref/NC/652071 ref/NC/652087
Original Report
The setup is a dual master system which was upgraded to 2.11 around noon yesterday.
In the late evening crashes started to appear and are now consistent. The system ran on 2.11-rc1 before.
The user started upgrading agents to 2.11, this may be related.
ref/NC/636691
Latest crash
Alternative crashes