Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.01k stars 576 forks source link

JSON-RPC Crashes with 2.11 #7532

Closed lazyfrosch closed 4 years ago

lazyfrosch commented 5 years ago

Task List

Mitigations

https://github.com/Icinga/icinga2/issues/7532#issuecomment-565510038

Analysis

Related issues

7569

7687

7624

7470

ref/NC/636691 ref/NC/644339 ref/NC/644553 ref/NC/647127 ref/NC/652035 ref/NC/652071 ref/NC/652087

Original Report

The setup is a dual master system which was upgraded to 2.11 around noon yesterday.

In the late evening crashes started to appear and are now consistent. The system ran on 2.11-rc1 before.

The user started upgrading agents to 2.11, this may be related.

ref/NC/636691

Latest crash

  Application version: r2.11.0-1

System information:
  Platform: Ubuntu
  Platform version: 18.04.3 LTS (Bionic Beaver)
  Kernel: Linux
  Kernel version: 4.15.0-1050-aws
  Architecture: x86_64

Build information:
  Compiler: GNU 8.3.0
  Build host: runner-LTrJQZ9N-project-298-concurrent-0

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid
Stacktrace:

        (0) libc.so.6: gsignal (+0xc7) [0x7f9a99cb4e97]
        (1) libc.so.6: abort (+0x141) [0x7f9a99cb6801]
        (2) libc.so.6: <unknown function> (+0x3039a) [0x7f9a99ca639a]
        (3) libc.so.6: <unknown function> (+0x30412) [0x7f9a99ca6412]
        (4) icinga2: <unknown function> (+0x3656c3) [0x55c11ec5d6c3]
        (5) icinga2: icinga::NotificationComponent::NotificationTimerHandler() (+0x1116) [0x55c11ec77a06]
        (6) icinga2: <unknown function> (+0x6da759) [0x55c11efd2759]
        (7) icinga2: icinga::Timer::Call() (+0x2d) [0x55c11eff151d]
        (8) icinga2: <unknown function> (+0x6f0acd) [0x55c11efe8acd]
        (9) icinga2: boost::asio::detail::executor_op<boost::asio::detail::work_dispatcher<icinga::ThreadPool::Post<std::function<void ()> >(std::function<void ()>, icinga::SchedulerPolicy)::{lambda()#1}>, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, std::allocator<void>*, boost::system::error_code const&, unsigned long) (+0xc5) [0x55c11ef9e5d5]
        (10) icinga2: <unknown function> (+0x75807b) [0x55c11f05007b]
        (11) icinga2: <unknown function> (+0x6a2555) [0x55c11ef9a555]
        (12) icinga2: boost_asio_detail_posix_thread_function (+0xf) [0x55c11f047f3f]
        (13) libpthread.so.0: <unknown function> (+0x76db) [0x7f9a98a756db]
        (14) libc.so.6: clone (+0x3f) [0x7f9a99d9788f]

***
* This would indicate a runtime problem or configuration error. If you believe this is a bug in Icinga 2
* please submit a bug report at https://github.com/Icinga/icinga2 and include this stack trace as well as any other
* information that might be useful in order to reproduce this problem.
***

Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
The program is not being run.

Alternative crashes

        (0) libc.so.6: gsignal (+0xc7) [0x7f32355c0e97]
        (1) libc.so.6: abort (+0x141) [0x7f32355c2801]
        (2) libc.so.6: <unknown function> (+0x89897) [0x7f323560b897]
        (3) libc.so.6: <unknown function> (+0x9090a) [0x7f323561290a]
        (4) libc.so.6: cfree (+0x4dc) [0x7f3235619e2c]
        (5) icinga2: <unknown function> (+0x6d90d1) [0x56119dd120d1]
        (6) icinga2: icinga::JsonRpcConnection::SendMessageInternal(boost::intrusive_ptr<icinga::Dictionary> const&) (+0x4f) [0x56119dc4870f]
        (7) icinga2: <unknown function> (+0x5f6eac) [0x56119dc2feac]
        (8) icinga2: boost::asio::detail::strand_service::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) (+0x75) [0x56119dc8ce85]
        (9) icinga2: <unknown function> (+0x75807b) [0x56119dd9107b]
        (10) icinga2: icinga::IoEngine::RunEventLoop() (+0x5e) [0x56119dd85b1e]
        (11) libstdc++.so.6: <unknown function> (+0xbd66f) [0x7f323349b66f]
        (12) libpthread.so.0: <unknown function> (+0x76db) [0x7f32343816db]
        (13) libc.so.6: clone (+0x3f) [0x7f32356a388f]
Al2Klimov commented 4 years ago

Maybe someone of the externals thinks that we haven't any progress there, but that's not true. My reproduction strategy is getting better and better (slowly, but surely) and my cause search strategy – Git bisect – will definitively find the problem's cause.

lippserd commented 4 years ago

Hi @bunghi @MarkNReynolds @Napsty @mattrose @jivuvtramvae @verboEse @BastiBr @Sec42 and all others facing this bug,

We pushed several changes to our snapshot packages which affect and may mitigate the JSON-RPC crash (and another segfault related to notification result syncing). It would be of great help to us if you could test those packages in your environments and provide feedback. Please enable core dumps beforehand so that we get as many information as possible if the crash still occurs. If the crash happens "reliably" in your setup, please share as much details as possible about your setup.

Thanks and cheers, Eric

Al2Klimov commented 4 years ago

... and here's how you enable core dumps: https://github.com/Al2Klimov/icinga-iliketrains/blob/001c57022b825bf8f5e4cf6845d191dca7d3a847/ansible-playbook.yml#L152

Al2Klimov commented 4 years ago

Good news: The updated env crashed as expected.

Bad news: That took about 2.5 days. Better late than never.

Al2Klimov commented 4 years ago

Core dump & co.: https://nextcloud.icinga.com/index.php/s/nJ6yeAY2cFeQYyL

Al2Klimov commented 4 years ago

The crashed env was 1c79c2b34. Now I deployed e909302fd. I'll give you an update during the next 1-2.5 days.

dnsmichi commented 4 years ago

Some notes.

Might have an influence if both endpoint direction attempt to connect to each other with clearing one side right after. Or many broken pipes, e.g. the other side closes the socket not so clean.

Other research things, not sure if they match.

https://github.com/ceph/spawn/commits/master is a header only fork by a RedHat developer of asio::spawn making use of callcc instead of fcontext from the coroutine library. It removes the coroutine layer entirely.

Might again have to do something with stack unwinding and exceptions. Worth collecting the Boost versions and OS'es into a table for variety patterns?

Al2Klimov commented 4 years ago

e909302fd didn't crash, so we have a commit range: (e909302fd.. 1c79c2b34]

I'll continue with 794374a32 and give you an update on monday.

dnsmichi commented 4 years ago

Thanks πŸ‘

I have another theory: A combination of stack sizes, recursion and large JSON blobs processed inside the coroutines. If the connection between agent/master is dropped uncleanly, say broken pipe or connection reset, there may be a leak with a running coroutine which never terminates. Or something else corrupts the stack then, or we hit the number of opened files.

This may be influenced with setups where the connection direction is in both ends, contradicting to what's written in the docs. This is where a timer kills off the secondary connection for the endpoint, on both sides (race condition time).

Similar to number of spawned threads, we likely should measure how many coroutines we have spawned and log that somewhere with a timer, similar to the Workqueues, plus an API stats metric.

Al2Klimov commented 4 years ago

794374a32 crashed. New commit range: (e909302fd.. 794374a32] - https://github.com/icinga/icinga2/compare/e909302..794374a

I'll continue with 37de1a919 and give you an update on monday.

dnsmichi commented 4 years ago

Follow-up read on the Ceph discussion: https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/6LBFZIFUPTJQ3SNTLVKSQMVITJWVWTZ6/?sort=date

The pooled_fixedsize_stack appears to have the same 2 limitations as the current fixedsize_stack: each stack has a fixed size of 128K by default, and there's no memory protection to detect stack overflows. The only difference seems to be that freed stacks are kept in a boost::pool to be reused for later stack allocations.

The motivation for enforcing a limit on the per-connection stack size is to help scale to lots of concurrent connections. However, if the requests are mostly object gets/puts, we'll be buffering up to 16M of object data per request and that's going to dwarf these stacks.

I think the protected_fixedsize_stack gives us the most flexibility here. In addition to the overflow detection from mmap+mprotect, it lets us map a comfortable amount of memory without having to allocate pages for them up front, and essentially behaving like the pthread stacks we're used to.

I've tried implementing their fork, but it uses several C++14/17 features. Let's if we can just boil it down to whatever is needed. https://github.com/Icinga/icinga2/pull/7762 - https://github.com/boostorg/asio/pull/55/files

Other collected thoughts.

https://github.com/boostorg/beast/issues/1055

I built a REST based server built upon the http_client_coro.cpp example which uses stackful coroutines. During one of our request handlers the app needs to reread an xml config file using boost::property_tree. In the read_xml_internal method we get a Access Violation exception. The top of the stack shows the function _chkstk() being called. Using VS2017 core guideline checker we get C6262 warning in the boost property tree that too much stack space is being used. We set the Linker->System->Stack Reserve Size to 4194304 but the crash still happens.

https://groups.google.com/forum/#!topic/boost-list/Lgo931jYc0E

In C and C++, each coroutine (and thread for that matter) must have a single, contiguous stack which, in addition, cannot grow. Because of this, you should 1) avoid large local variables (e.g., arrays), 2) avoid recursion of any kind, 3) avoid calling any other functions that use a lot of stack (see 1) and 2)), 4) have enough stack space for signal handling [actually, the best thing to do is to install an alternate signal stack by using sigaltstack()].

If you're running in 64-bit mode, try setting the stack size of each coroutine to something like 1MB; the system's demand-paging logic will actually allocate as much physical memory as the program needs. But the fundamental limitations listed above remain.

My last guess for now is that with the stack protection changes implemented in the Kernel because of the Intel microcode fuckup, the behaviour of these stackful coroutines is different to what it used to be before. On Windows, there was definitely some stack protection going on, corrupting the stack with guard pages.

I'd say this wouldn't crash if being run on a Kernel without these stack protection patches.

Al2Klimov commented 4 years ago

37de1a919 crashed. New commit range: (e909302..37de1a919]

I'll continue with 64568f596 and give you an update on monday.

Al2Klimov commented 4 years ago

@lippserd Congrats, 1c79c2b34 crashes w/ (1) in about 3h. I'll re-test it and give you an update ASAP.

1

- hosts: 'icingas:!aklimov-iliketrains2-master1:!aklimov-iliketrains2-master2'
  become: yes
  become_method: sudo
  tasks:
  - copy:
      dest: /vola.pl
      owner: root
      group: root
      mode: '0755'
      content: |
        #!/usr/bin/perl
        while (1) {
          sleep((rand() * 9 + 1) * 60);
          print "OFF\n";
          `iptables -I INPUT -p tcp -m tcp --dport 5665 -j DROP`;
          `iptables -I OUTPUT -p tcp -m tcp --dport 5665 -j DROP`;
          sleep((rand() * 4 + 1) * 60);
          print "ON\n";
          `iptables -D INPUT -p tcp -m tcp --dport 5665 -j DROP`;
          `iptables -D OUTPUT -p tcp -m tcp --dport 5665 -j DROP`
        }
  - copy:
      dest: /etc/systemd/system/vola.service
      owner: root
      group: root
      mode: '0644'
      content: |
        [Unit]
        [Service]
        ExecStart=/vola.pl
        [Install]
        WantedBy=multi-user.target
  - systemd:
      daemon_reload: yes
  - service:
      name: vola
      state: started
      enabled: yes
Al2Klimov commented 4 years ago

The 2nd time it needed 24h. I'll re-test it again and give you an update ASAP.

dnsmichi commented 4 years ago

Meanwhile, following the diff leads to the following assumptions:

Al2Klimov commented 4 years ago

64568f596 didn't crash. New commit range: (64568f596..37de1a919]

I'll continue with 56894bea1 and give you an update on thursday.

Al2Klimov commented 4 years ago

56894bea1 was killed – not what I expected.

I'll continue with 02db12ae0 and give you an update on friday.

Al2Klimov commented 4 years ago

The remaining commit range and the changes are a bit long, but some commits seem not to build. I've patched them:

for c in $(git log --format=%h --first-parent 64568f5..37de1a9); do git checkout $c; git cherry-pick 56894bea17df0f7ede19dd1df2b093033ea4190c; git push origin HEAD:refs/heads/aklimov/iliketrains4/$c; done

... and will hopefully continue with 43e9ae5f8.

Al2Klimov commented 4 years ago

I've added the network volatility to the test env spec as it crashed again in less than two days.

Al2Klimov commented 4 years ago

The master (18eb06e33) crashed – on both master nodes this time.

dnsmichi commented 4 years ago

Just a comment from my side here, since questions are around - I now see the bisect strategy from @Al2Klimov as the only valid strategy to find out about the error.

Currently, I don't have further ideas for mitigating the error, except for raising the stack limit for coroutines. Which would highlight a stack overflow, but that's still a problem not necessarily solved with additional memory.

This involves waiting and no immediate feedback, and is totally different to a scenario where you debug something and it crashes after 1 hour in your dev environment.

After finding the commit which creates the problem, we sure need to invest development time to actually create a patch. And this patch again requires tests, and then we'll consider releasing 2.12 RC.

Currently this problem solely blocks the entire release chain for the IcingaDB release, including Web 2.8, the icingadb module, IcingaDB itself and Icinga 2.12.

After that's done, backporting the fix to 2.11, testing it again and then releasing a bugfix release is on the TODO list.

Cheers, Michael

Al2Klimov commented 4 years ago

I now see the bisect strategy from @Al2Klimov as the only valid strategy to find out about the error.

Finally somebody says it! 😎

Al2Klimov commented 4 years ago

43e9ae5 didn't crash. New commit range: (43e9ae5..37de1a9]

I'll continue with 268c8b395 and give you an update on monday.

dnsmichi commented 4 years ago

Thanks. Boils down to https://github.com/icinga/icinga2/compare/43e9ae5..37de1a9

Since the Timer is something we implement ourselves, I'll investigate here for a possible mitigation/change.

Al2Klimov commented 4 years ago

268c8b395 got killed, but not segv-ed. Assuming that it's not affected. New commit range: (268c8b395..37de1a919]

I'll continue with 0438c866f and give you an update on wednesday.

dnsmichi commented 4 years ago

Thanks a lot for the update. This removes the Timer fixes. https://github.com/icinga/icinga2/compare/268c8b3..37de1a9

Now I need to wait for future updates, I currently don't know what would cause the crash from the diff.

Al2Klimov commented 4 years ago

... and I don't know whether 268c8b3 actually isn't affected, yet.

dnsmichi commented 4 years ago

I've been analyzing #7771 with the boost::filesystem::remove_all call. There's a certain chance that these recursive attempts (and failure when a file cannot be deleted) cause a stack overflow, like mentioned here: http://boost.2283326.n4.nabble.com/filesystem-remove-all-error-handling-td2603383.html

Specifically, ApiListener::ConfigUpdateHandler() is called every time we receive a cluster config sync update. I've seen this exception on both Windows in #7761 (1.69.0) and Debian 10 in #7771 (1.67.0.1) now.

[2020-01-20 15:18:38 +0100] warning/JsonRpcConnection: Error while processing message for identity 'broker01.example.com'
Error: boost::filesystem::remove: Directory not empty: "/var/lib/icinga2/api/zones-stage/"

        (0) libboost_filesystem.so.1.67.0: <unknown function> (+0xe2ef) [0x7fc4bd0062ef]
        (1) libboost_filesystem.so.1.67.0: boost::filesystem::detail::remove_all(boost::filesystem::path const&, boost::system::error_code*) (+0xb6) [0x7fc4bd006946]
        (2) icinga2: icinga::Utility::RemoveDirRecursive(icinga::String const&) (+0x8f) [0x63967f]
        (3) icinga2: icinga::ApiListener::ConfigUpdateHandler(boost::intrusive_ptr<icinga::MessageOrigin> const&, boost::intrusive_ptr<icinga::Dictionary> const&) (+0x55b) [0x753aab]
        (4) icinga2: std::_Function_handler<icinga::Value (boost::intrusive_ptr<icinga::MessageOrigin> const&, boost::intrusive_ptr<icinga::Dictionary> const&), icinga::Value (*)(boost::intrusive_ptr<icinga::MessageOrigin> const&, boost::intrusive_ptr<icinga::Dictionary> const&)>::_M_invoke(std::_Any_data const&, boost::intrusive_ptr<icinga::MessageOrigin> const&, boost::intrusive_ptr<icinga::Dictionary> const&) (+0xf) [0x7a828f]
        (5) icinga2: icinga::JsonRpcConnection::MessageHandler(icinga::String const&) (+0x5b6) [0x7d8d66]
        (6) icinga2: icinga::JsonRpcConnection::HandleIncomingMessages(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >) (+0x1b6) [0x7da926]
        (7) /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2() [0x858f17]
        (8) libboost_context.so.1.67.0: make_fcontext (+0x2f) [0x7fc4bd63673f]

I'll continue with reading and testing tomorrow.

dnsmichi commented 4 years ago

@lippserd @gethash @Al2Klimov I had another idea while having a shower in the morning - what if the 64KB stack size for coroutines influences this behavior when boost::filesystem::remove_all is run with a huge set of directories and files, coming from zones.d? The recursion may have an influence and cause a stack overflow.

RenameFile might also be involved, since this also is used with SaveJsonFile in PKI requests. So a more generic problem with boost::filesystem over time.

Similar pattern what I had analyzed with the JSON decoding already.

dnsmichi commented 4 years ago

Also interesting to read with recursion and stackful operations in mind:

Al2Klimov commented 4 years ago

0438c86 didn't crash. New commit range: (0438c86..37de1a9]

I'll continue with a630d0185 and give you an update on friday.

dnsmichi commented 4 years ago

Ok, thanks, then the boost::filesystem::remove_all thingy is a different one, likely sourced from user cleanup scripts or corrupt filesystems.

https://github.com/icinga/icinga2/compare/0438c86..37de1a9

For the API Event Queue, we're registering signal handlers in lib/icinga/apievents.cpp. Maybe the condition change in

        if (queues.empty() && !inboxes)
                return;

now continues and fills the inboxes for some reason. I need to test-drive this locally just to be sure about it.

Al2Klimov commented 4 years ago

a630d01 crashed. First bad commit: a630d01

Not really helpful as we've already reverted it and that revert still crashed. 😞

@dnsmichi Please list the (merge) commits I shall apply to each first-parent commit of v2.10.0..v2.11.0 for a new bisect in addition to cb21086d6.

dnsmichi commented 4 years ago

We had a longer offline meeting, so here's another approach:

First bad

a630d01 - Merge pull request #6722 from Icinga/feature/notification-result Add notification result store/sync

Last good

0438c86 - Merge pull request #7102 from Icinga/feature/boost-fs-7101 Replace self-written filesystem ops with boost.filesystem

Tasks:

Base commit a630d01

git checkout -b bugfix/json-rpc-nr a630d01

git revert -m 1 a630d0185fb5e6270c0238002c03c91ac02b21fe

git cherry-pick -x 5725ba4bf37ef5e3eab718f0cab4b4f673d8d602

This requires the changes/fixes from

This is a pain to backport with the get_io_service() and GetContext() differences. We still need the exception swallowing fixes to be sure.

Al2Klimov commented 4 years ago

Damn. It seems the one patch doesn't apply to the one end of the bisect range and the other patch doesn't apply to the other one.

@lippserd @bobapple @gethash We need ASAP a customer where we can easily reproduce the crash (the right crash!) and where our PS colleagues can install my custom packages for (let's say) one or two days (total, not per bisect step).

Al2Klimov commented 4 years ago

I'll run the setup over the weekend with the current Git master (as of right now) and GDB attached to the main process.

@dnsmichi Any tips e.g. additional packages to install?

@dnsmichi @htriem @N-o-X @lippserd Please paste your public SSH keys here.

Al2Klimov commented 4 years ago

Forget it, I've got this one:

for u in dnsmichi htriem N-o-X lippserd; do curl -sSL https://github.com/$u.keys; done
dnsmichi commented 4 years ago

@Al2Klimov I would prefer if you'd run the following which we agreed on yesterday.

git checkout -b bugfix/json-rpc-nr a630d01

Revert NR PR immediately as full. To rule out that the revert in git master is broken, or anything else. do this on the first run

git revert -m 1 a630d0185fb5e6270c0238002c03c91ac02b21fe

If that fails, please coordinate with @lippserd on the next steps to take.

In terms of SSH keys, please get in touch with @bobapple. People may use different keys for GH and servers btw.

Al2Klimov commented 4 years ago

Bisect summary

Icinga 2 commit crashed? Env. spec.
c37c40d8c (master) yes all w/ GDB
1c79c2b (~ v2.11rc1) yes all ex. network volatility
794374a yes a.e.n.v.
37de1a9 yes a.e.n.v.
a630d01 (1st bad) yes all
0438c86 no all
43e9ae5 no all
64568f5 no a.e.n.v.
e909302 no a.e.n.v.
dnsmichi commented 4 years ago

Notes:

master with gdb: WorkQueue -> Bind -> Function -> MessageOrigin destructor -> JsonRpcConnection destruktur -> _free -> Bumm

Parallel task: thread sanitizer analysis down the rabbit hole. https://lemire.me/blog/2016/04/20/no-more-leaks-with-sanitize-flags-in-gcc-and-clang/ https://github.com/google/sanitizers/wiki/ThreadSanitizerCppManual

stack-use-after-scope wrap_memmove
mbaruchello77 commented 4 years ago

Damn. It seems the one patch doesn't apply to the one end of the bisect range and the other patch doesn't apply to the other one.

@lippserd @bobapple @gethash We need ASAP a customer where we can easily reproduce the crash (the right crash!) and where our PS colleagues can install my custom packages for (let's say) one or two days (total, not per bisect step).

Hi if you need a system where testing these patch we are available.. please send an email to Kiratech. Thanks

Matteo

Al2Klimov commented 4 years ago

How fast does your system crash?

mbaruchello77 commented 4 years ago

1 min or less

Al2Klimov commented 4 years ago

Also w/ the latest snapshot packages (as of right now)?

mbaruchello77 commented 4 years ago

no snapshot package

Al2Klimov commented 4 years ago

Then please test them.

mbaruchello77 commented 4 years ago

can you contact me in private? matteo.baruchello@kirateck.it

lippserd commented 4 years ago

@mbaruchello77 Thanks for the offer, we'll do that!

N-o-X commented 4 years ago

Stacktrace generated with -fsanitize=address Clang flag after killing 10 JSON-RPC connections with valid endpoints:

[2020-02-03 15:38:31 +0100] information/ApiListener: Finished syncing runtime objects to endpoint 'localhost'.
[2020-02-03 15:38:31 +0100] information/ApiListener: Finished sending runtime config updates for endpoint 'localhost' in zone 'localhost'.
==54100==WARNING: ASan is ignoring requested __asan_handle_no_return: stack top: 0x700002ef9000; bottom 0x000113e8e000; size: 0x6ffeef06b000 (123140722569216)
False positive error reports may follow
For details see https://github.com/google/sanitizers/issues/189
=================================================================
==54100==ERROR: AddressSanitizer: stack-buffer-underflow on address 0x000113e8fc08 at pc 0x00010c2d4fc9 bp 0x000113e8f6e0 sp 0x000113e8ee78
READ of size 168 at 0x000113e8fc08 thread T62
[2020-02-03 15:41:07 +0100] warning/JsonRpcConnection: API client disconnected for identity 'localhost'
[2020-02-03 15:41:07 +0100] warning/ApiListener: Removing API client for endpoint 'localhost'. 9 API clients left.
[2020-02-03 15:41:08 +0100] warning/JsonRpcConnection: API client disconnected for identity 'localhost'
[2020-02-03 15:41:08 +0100] warning/ApiListener: Removing API client for endpoint 'localhost'. 8 API clients left.
    #0 0x10c2d4fc8 in wrap_memmove (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x1cfc8)

0x000113e8fc08 is located 254984 bytes inside of 262144-byte region [0x000113e51800,0x000113e91800)
allocated by thread T77 here:
    #0 0x10c3187d3 in wrap_malloc (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x607d3)

Thread T62 created by T0 here:
[2020-02-03 15:41:08 +0100] warning/JsonRpcConnection: API client disconnected for identity 'localhost'
    #0 0x10c3107cd in wrap_pthread_create (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x587cd)
[2020-02-03 15:41:08 +0100] warning/ApiListener: Removing API client for endpoint 'localhost'. 7 API clients left.
[2020-02-03 15:41:08 +0100] warning/JsonRpcConnection: API client disconnected for identity 'localhost'
[2020-02-03 15:41:08 +0100] warning/ApiListener: Removing API client for endpoint 'localhost'. 6 API clients left.
[2020-02-03 15:41:08 +0100] warning/JsonRpcConnection: API client disconnected for identity 'localhost'
[2020-02-03 15:41:08 +0100] warning/ApiListener: Removing API client for endpoint 'localhost'. 5 API clients left.
[2020-02-03 15:41:08 +0100] warning/JsonRpcConnection: API client disconnected for identity 'localhost'
[2020-02-03 15:41:08 +0100] warning/ApiListener: Removing API client for endpoint 'localhost'. 4 API clients left.
[2020-02-03 15:41:08 +0100] warning/JsonRpcConnection: API client disconnected for identity 'localhost'
[2020-02-03 15:41:08 +0100] warning/ApiListener: Removing API client for endpoint 'localhost'. 3 API clients left.
    #1 0x10095e783 in std::__1::__libcpp_thread_create(_opaque_pthread_t**, void* (*)(void*), void*) __threading_support:328
    #2 0x100e98d5e in std::__1::thread::thread<void (icinga::IoEngine::*)(), icinga::IoEngine*, void>(void (icinga::IoEngine::*&&)(), icinga::IoEngine*&&) thread:368
    #3 0x100876434 in std::__1::thread::thread<void (icinga::IoEngine::*)(), icinga::IoEngine*, void>(void (icinga::IoEngine::*&&)(), icinga::IoEngine*&&) thread:360
    #4 0x10087514c in icinga::IoEngine::IoEngine() io-engine.cpp:93
    #5 0x100876634 in icinga::IoEngine::IoEngine() io-engine.cpp:88
    #6 0x100e88f89 in icinga::IoEngine::$_64::operator()() const io-engine.cpp:75
    #7 0x100e88f2e in decltype(std::__1::forward<icinga::IoEngine::$_64&>(fp)()) std::__1::__invoke<icinga::IoEngine::$_64&>(icinga::IoEngine::$_64&) type_traits:4361
    #8 0x100e88ece in std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > std::__1::__invoke_void_return_wrapper<std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > >::__call<icinga::IoEngine::$_64&>(icinga::IoEngine::$_64&) __functional_base:318
    #9 0x100e88e8e in std::__1::__function::__alloc_func<icinga::IoEngine::$_64, std::__1::allocator<icinga::IoEngine::$_64>, std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > ()>::operator()() functional:1527
    #10 0x100e8561f in std::__1::__function::__func<icinga::IoEngine::$_64, std::__1::allocator<icinga::IoEngine::$_64>, std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > ()>::operator()() functional:1651
    #11 0x100e8a5fc in std::__1::__function::__value_func<std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > ()>::operator()() const functional:1799
    #12 0x100e8a3ab in std::__1::function<std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > ()>::operator()() const functional:2347
    #13 0x100874b14 in icinga::LazyInit<std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > >::Get() lazy-init.hpp:56
    #14 0x100873ecf in icinga::IoEngine::Get() io-engine.cpp:79
    #15 0x101325e81 in icinga::ApiListener::AddListener(icinga::String const&, icinga::String const&) apilistener.cpp:364
    #16 0x101323609 in icinga::ApiListener::Start(bool) apilistener.cpp:255
    #17 0x1008066e3 in icinga::ConfigObject::Activate(bool, icinga::Value const&) configobject.cpp:367
    #18 0x101201330 in icinga::ConfigItem::ActivateItems(icinga::WorkQueue&, std::__1::vector<boost::intrusive_ptr<icinga::ConfigItem>, std::__1::allocator<boost::intrusive_ptr<icinga::ConfigItem> > > const&, bool, bool, bool, icinga::Value const&) configitem.cpp:696
    #19 0x101bbee83 in RunWorker(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, bool, icinga::String const&) daemoncommand.cpp:281
    #20 0x101b5c30d in StartUnixWorker(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, bool, icinga::String const&) daemoncommand.cpp:507
    #21 0x101b559fb in icinga::DaemonCommand::Run(boost::program_options::variables_map const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) const daemoncommand.cpp:697
    #22 0x10074b1eb in Main() icinga.cpp:662
    #23 0x10073fea0 in main icinga.cpp:959
    #24 0x7fff7142b404 in start (libdyld.dylib:x86_64+0x11404)

Thread T77 created by T0 here:
    #0 0x10c3107cd in wrap_pthread_create (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x587cd)
    #1 0x10095e783 in std::__1::__libcpp_thread_create(_opaque_pthread_t**, void* (*)(void*), void*) __threading_support:328
    #2 0x100e98d5e in std::__1::thread::thread<void (icinga::IoEngine::*)(), icinga::IoEngine*, void>(void (icinga::IoEngine::*&&)(), icinga::IoEngine*&&) thread:368
    #3 0x100876434 in std::__1::thread::thread<void (icinga::IoEngine::*)(), icinga::IoEngine*, void>(void (icinga::IoEngine::*&&)(), icinga::IoEngine*&&) thread:360
    #4 0x10087514c in icinga::IoEngine::IoEngine() io-engine.cpp:93
    #5 0x100876634 in icinga::IoEngine::IoEngine() io-engine.cpp:88
    #6 0x100e88f89 in icinga::IoEngine::$_64::operator()() const io-engine.cpp:75
    #7 0x100e88f2e in decltype(std::__1::forward<icinga::IoEngine::$_64&>(fp)()) std::__1::__invoke<icinga::IoEngine::$_64&>(icinga::IoEngine::$_64&) type_traits:4361
    #8 0x100e88ece in std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > std::__1::__invoke_void_return_wrapper<std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > >::__call<icinga::IoEngine::$_64&>(icinga::IoEngine::$_64&) __functional_base:318
    #9 0x100e88e8e in std::__1::__function::__alloc_func<icinga::IoEngine::$_64, std::__1::allocator<icinga::IoEngine::$_64>, std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > ()>::operator()() functional:1527
    #10 0x100e8561f in std::__1::__function::__func<icinga::IoEngine::$_64, std::__1::allocator<icinga::IoEngine::$_64>, std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > ()>::operator()() functional:1651
    #11 0x100e8a5fc in std::__1::__function::__value_func<std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > ()>::operator()() const functional:1799
    #12 0x100e8a3ab in std::__1::function<std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > ()>::operator()() const functional:2347
    #13 0x100874b14 in icinga::LazyInit<std::__1::unique_ptr<icinga::IoEngine, std::__1::default_delete<icinga::IoEngine> > >::Get() lazy-init.hpp:56
    #14 0x100873ecf in icinga::IoEngine::Get() io-engine.cpp:79
    #15 0x101325e81 in icinga::ApiListener::AddListener(icinga::String const&, icinga::String const&) apilistener.cpp:364
    #16 0x101323609 in icinga::ApiListener::Start(bool) apilistener.cpp:255
    #17 0x1008066e3 in icinga::ConfigObject::Activate(bool, icinga::Value const&) configobject.cpp:367
    #18 0x101201330 in icinga::ConfigItem::ActivateItems(icinga::WorkQueue&, std::__1::vector<boost::intrusive_ptr<icinga::ConfigItem>, std::__1::allocator<boost::intrusive_ptr<icinga::ConfigItem> > > const&, bool, bool, bool, icinga::Value const&) configitem.cpp:696
    #19 0x101bbee83 in RunWorker(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, bool, icinga::String const&) daemoncommand.cpp:281
    #20 0x101b5c30d in StartUnixWorker(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, bool, icinga::String const&) daemoncommand.cpp:507
    #21 0x101b559fb in icinga::DaemonCommand::Run(boost::program_options::variables_map const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) const daemoncommand.cpp:697
    #22 0x10074b1eb in Main() icinga.cpp:662
    #23 0x10073fea0 in main icinga.cpp:959
    #24 0x7fff7142b404 in start (libdyld.dylib:x86_64+0x11404)

SUMMARY: AddressSanitizer: stack-buffer-underflow (libclang_rt.asan_osx_dynamic.dylib:x86_64h+0x1cfc8) in wrap_memmove
Shadow bytes around the buggy address:
  0x1000227d1f30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1000227d1f40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1000227d1f50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1000227d1f60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1000227d1f70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x1000227d1f80: f1[f1]f1 f1 00 00 f3 f3 00 00 00 00 00 00 00 00
  0x1000227d1f90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1000227d1fa0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1000227d1fb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1000227d1fc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1000227d1fd0: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==54100==ABORTING
Caught SIGABRT.
Current time: 2020-02-03 15:41:10 +0100

[2020-02-03 15:41:10 +0100] critical/Application: Icinga 2 has terminated unexpectedly. Additional information can be found in '/Users/noah/i2/var/log/icinga2/crash/report.1580740870.629094'

Same case, but debugged with LLDB:

(lldb) bt
* thread #24, stop reason = Use of out-of-scope stack memory
  * frame #0: 0x000000010e148880 libclang_rt.asan_osx_dynamic.dylib`__asan::AsanDie()
    frame #1: 0x000000010e15f81f libclang_rt.asan_osx_dynamic.dylib`__sanitizer::Die() + 175
    frame #2: 0x000000010e14676b libclang_rt.asan_osx_dynamic.dylib`__asan::ScopedInErrorReport::~ScopedInErrorReport() + 411
    frame #3: 0x000000010e145f9f libclang_rt.asan_osx_dynamic.dylib`__asan::ReportGenericError(unsigned long, unsigned long, unsigned long, unsigned long, bool, unsigned long, unsigned int, bool) + 447
    frame #4: 0x000000010e0fbfe9 libclang_rt.asan_osx_dynamic.dylib`wrap_memmove + 713
    frame #5: 0x00007fff71665bf8 libunwind.dylib`unw_init_local + 33
    frame #6: 0x00007fff716665cb libunwind.dylib`unwind_phase2 + 41
    frame #7: 0x00007fff71669ed1 libunwind.dylib`_Unwind_Resume + 51
    frame #8: 0x000000010322886c icinga2`icinga::JsonRpc::ReadMessage(stream=0x0000616004162008, yc=boost::asio::yield_context @ 0x0000000115bb1840, maxMessageLength=-1)(), boost::asio::executor> >, long) at jsonrpc.cpp:0:1
    frame #9: 0x000000010322a252 icinga2`icinga::JsonRpcConnection::HandleIncomingMessages(this=0x0000616004161f80, yc=boost::asio::yield_context @ 0x0000000115bb1fa0)(), boost::asio::executor> >) at jsonrpcconnection.cpp:67:14
    frame #10: 0x00000001038c22b3 icinga2`icinga::JsonRpcConnection::Start(this=0x00006060003e8e50, yc=boost::asio::yield_context @ 0x0000000115bb20e0)::$_67::operator()(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >) const at jsonrpcconnection.cpp:55:83
    frame #11: 0x00000001038c1fa0 icinga2`void icinga::IoEngine::SpawnCoroutine<boost::asio::io_context::strand, icinga::JsonRpcConnection::Start()::$_67>(this=0x00006060003e8e50, yc=boost::asio::yield_context @ 0x0000000115bb22c0)::$_67)::'lambda'(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >)::operator()(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >) const at io-engine.hpp:125:6
    frame #12: 0x00000001038c1c3c icinga2`boost::asio::detail::coro_entry_point<boost::asio::executor_binder<void (*)(), boost::asio::io_context::strand>, void icinga::IoEngine::SpawnCoroutine<boost::asio::io_context::strand, icinga::JsonRpcConnection::Start()::$_67>(boost::asio::io_context::strand, icinga::JsonRpcConnection::Start()::$_67)::'lambda'(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >)>::operator(this=0x00000001152485e0, ca=0x0000000115bb2490)(boost::coroutines::pull_coroutine<void>&) at spawn.hpp:337:7
    frame #13: 0x00000001038c1700 icinga2`boost::coroutines::detail::push_coroutine_object<boost::coroutines::pull_coroutine<void>, void, boost::asio::detail::coro_entry_point<boost::asio::executor_binder<void (*)(), boost::asio::io_context::strand>, void icinga::IoEngine::SpawnCoroutine<boost::asio::io_context::strand, icinga::JsonRpcConnection::Start()::$_67>(boost::asio::io_context::strand, icinga::JsonRpcConnection::Start()::$_67)::'lambda'(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >)>&, boost::coroutines::basic_standard_stack_allocator<boost::coroutines::stack_traits> >::run(this=0x0000000115bb2760) at push_coroutine_object.hpp:302:11
    frame #14: 0x00000001038c1359 icinga2`void boost::coroutines::detail::trampoline_push_void<boost::coroutines::detail::push_coroutine_object<boost::coroutines::pull_coroutine<void>, void, boost::asio::detail::coro_entry_point<boost::asio::executor_binder<void (*)(), boost::asio::io_context::strand>, void icinga::IoEngine::SpawnCoroutine<boost::asio::io_context::strand, icinga::JsonRpcConnection::Start()::$_67>(boost::asio::io_context::strand, icinga::JsonRpcConnection::Start()::$_67)::'lambda'(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >)>&, boost::coroutines::basic_standard_stack_allocator<boost::coroutines::stack_traits> > >(t=(fctx = 0x0000000115248400, data = 0x0000000115248440)) at trampoline_push.hpp:70:11
    frame #15: 0x000000010dc5fa2f libboost_context-mt.dylib`trampoline + 3
(lldb)
dnsmichi commented 4 years ago

Next rounds:

In additional to that, going further with the ASAN analysis. Yesterday we stopped at JsonDecode where the MessageHandler() from the JSON-RPC connection read sometimes crashed when sending back the response.