Open nielsk opened 5 years ago
But the nodes do not crash - the master crashes which I just upgraded to 2.11.3 on start-up (btw. the sattelites are running Linux, the endpoints Linux or Windows)
The master runs FreeBSD. I upgrade icinga from 2.10.5 to 2.11.3 (or before). I restart icinga2 and it crashes on start, apparently when the satellites try to reconnect.
I just read @bsdlme comment and the 'when all nodes run 2.11.3'. I have to think what I do about that...
I just read @bsdlme comment and the 'when all nodes run 2.11.3'. I have to think what I do about that...
I don't think that this is necessary. There may still be a problem with the JSON-RPC.
Please could you test this one?
Please could you test this one?
How would I do that? Build it from source and installing it over the installed package (which is built on my poudriere)?
Btw. I upgraded now all my endpoints to 2.11.3 and the master still crashes after I updated it to 2.11.3.
Please could you test this one? https://github.com/Icinga/icinga2/tree/bugfix/freebsd-7539
How would I do that? Build it from source and installing it over the installed package (which is built on my poudriere)?
Exactly.
TBH I relied on what was said in the bug report at https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=245985 But I'm glad that we're moving forward now. @nielsk Do you need any help building the patched version?
yes. I am reading the documentation but I have no clue what to do to build it on FreeBSD @bsdlme
Okay, give a minute...
Just gunzip and put the attached patch from @Al2Klimov's last commit into net-mgmt/icinga2/files/ Then build Icinga as usual.
Thanks @bsdlme I built it succesfully, the crash still persists.
@bsdlme Are you sure that the resulting extension (".cpp") gets picked up by the build... patch... thing?
@Al2Klimov Yes, it got picked up. From the build-log:
=======================<phase: patch-depends >============================
===========================================================================
=======================<phase: patch >============================
===> Patching for icinga2-2.11.3_1
===> Applying extra patch /distfiles/local-patches/icinga2/patch-lib_remote_jsonrpcconnection-heartbeat.cpp
===> Applying FreeBSD patches for icinga2-2.11.3_1
Suppose, I've got a fresh FreeBSD v11.3 VM. Could you re-produce this and provide step-by-step instructions how to reproduce this from scratch?
I don't know to be honest since I don't know where it breaks. It seems that it breaks when it tries to connect to the satellites. This is a configuration with multiple zones, custom checks, satellites etc. It is not like I installed it, added a satellite and it crashed but an installation that runs for years now and got upgraded over time.
I gave the information I have. If someone can point me to what I can further do to provide more information I'd be happy to help.
Please could you generate a core dump of the crash, gzip it and the exact packages you have installed and drop it here?
Whatever I try there is no core dump generated. I can only offer the output of truss -f unfortunately.
Did you try attaching to the top three Icinga processes with a debugger, waiting until it crashes and letting the debugger generate the core dump?
A truss output can be found at https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=240812
Did you try attaching to the top three Icinga processes with a debugger, waiting until it crashes and letting the debugger generate the core dump?
it happens so fast that I cannot attach a debugger. I try now something else. A complete truss-output can be downloaded here: https://nextcloud.kobschaetzki.net/index.php/s/CnQLtqo9HX7j4QE
Hm.. does Icinga crash if you lock out the port 5665 via firewall?
And I cannot start it apparently with gdb or lldb because they do not recognize the executable.
Hm.. does Icinga crash if you lock out the port 5665 via firewall?
yes It only works iirc if I disable the API
Yes, disabling the API and it works. Re-enabling the API and it crashes
I will call it now a day, revert to the working boot environment and go into the weekend (yeah, May 1st)
Please try:
So, rebuild my packages with the new 2.11.3.
I updated my icinga2 master it still works just fine as it did before.
Then I upgraded a satellite where it was broken, still broken.
@bsdlme Are you sure that the resulting extension (".cpp") gets picked up by the build... patch... thing?
The framework picks up patch-*
files. I know, I wrote that bit. (and rewrote it this afternoon)
I'm about to set up 2 Jails for you to debug, @Al2Klimov @mat813 @nielsk, anything I need to do to make it crash?
Please don't forget to include the heartbeat/m_Endpoint patch.
Building boost with debug symbols takes some time...
In my case it is a master and there are satellites. API needs to be enabled. The moment the master connects it crashes in my case.
Schöne Grüße
Niels
On 30. Apr 2020, at 17:19, Lars E notifications@github.com wrote:
I'm about to set up 2 Jails for you to debug, @Al2Klimov @mat813 @nielsk, anything I need to do to make it crash?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
I now have two jails. One Icinga Master jail running 2.11.3 with the patch and one satellite with 2.11.2. Unfortunately I could not make the master crash, yet.
I now have two jails. One Icinga Master jail running 2.11.3 with the patch and one satellite with 2.11.2. Unfortunately I could not make the master crash, yet.
the API is configured and activated?
Yes, but maybe in a wrong configuration. I can share it later.
Note that my problem is not master crashing, it is satellites crashing. They are all running on i386.
Note that my problem is not master crashing, it is satellites crashing. They are all running on i386.
That's odd because I crash the master with or without satellites.
Briefly, I was forced to upgrade the master icinga2 machine at my site for various reasons and pitchforks. The timeline looks like this:
[2020-05-20 19:12:08 -0700] critical/cli: The daemon could not be started. See log output for details.
So with no satellites running, it crashes out of the box.
This output might be useful to someone:
# /usr/local/sbin/icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: r2.11.3-1)
Copyright (c) 2012-2020 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
System information:
Platform: Unknown
Platform version: Unknown
Kernel: FreeBSD
Kernel version: 11.3-PRERELEASE
Architecture: amd64
Build information:
Compiler: Clang 8.0.0
Build host: pkg.dream-tech.com
...
# ldd /usr/local/sbin/icinga2/sbin/icinga2
/usr/local/lib/icinga2/sbin/icinga2:
libexecinfo.so.1 => /usr/local/lib/libexecinfo.so.1 (0x801250000)
libboost_context.so.1.72.0 => /usr/local/lib/libboost_context.so.1.72.0 (0x80145f000)
libboost_coroutine.so.1.72.0 => /usr/local/lib/libboost_coroutine.so.1.72.0 (0x801661000)
libboost_date_time.so.1.72.0 => /usr/local/lib/libboost_date_time.so.1.72.0 (0x801868000)
libboost_filesystem.so.1.72.0 => /usr/local/lib/libboost_filesystem.so.1.72.0 (0x801a72000)
libboost_thread.so.1.72.0 => /usr/local/lib/libboost_thread.so.1.72.0 (0x801c8d000)
libboost_system.so.1.72.0 => /usr/local/lib/libboost_system.so.1.72.0 (0x801ea5000)
libboost_program_options.so.1.72.0 => /usr/local/lib/libboost_program_options.so.1.72.0 (0x8020a6000)
libboost_regex.so.1.72.0 => /usr/local/lib/libboost_regex.so.1.72.0 (0x802304000)
libboost_chrono.so.1.72.0 => /usr/local/lib/libboost_chrono.so.1.72.0 (0x8025b5000)
libboost_atomic.so.1.72.0 => /usr/local/lib/libboost_atomic.so.1.72.0 (0x8027bd000)
libssl.so.47 => /usr/local/lib/libssl.so.47 (0x8029bf000)
libcrypto.so.45 => /usr/local/lib/libcrypto.so.45 (0x802c1b000)
libedit.so.0 => /usr/local/lib/libedit.so.0 (0x80300d000)
libncurses.so.8 => /lib/libncurses.so.8 (0x803244000)
libc++.so.1 => /usr/lib/libc++.so.1 (0x803499000)
libcxxrt.so.1 => /lib/libcxxrt.so.1 (0x803768000)
libm.so.5 => /lib/libm.so.5 (0x803987000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x803bb7000)
libthr.so.3 => /lib/libthr.so.3 (0x803dca000)
libc.so.7 => /lib/libc.so.7 (0x803ff2000)
libicudata.so.66 => /usr/local/lib/libicudata.so.66 (0x8043ad000)
libicui18n.so.66 => /usr/local/lib/libicui18n.so.66 (0x804600000)
libicuuc.so.66 => /usr/local/lib/libicuuc.so.66 (0x804b21000)
librt.so.1 => /usr/lib/librt.so.1 (0x804f19000)
# /usr/local/bin/openssl version
LibreSSL 3.0.2
I don't think LibreSSL is the factor here just by observation, but I could be wrong.
While this particular installation of icinga2 is important to me, it's not really production. So if you want to throw patches at me, please do as I am willing to do almost whatever it takes to get this running again. Thanks in advance.
So, updated to 2.12.0 and it still crashes on startup.
I'm running into an issue under OpenBSD 6.7-stable, which to me looks really similar to what is being described above. Setup is similar: icinga2-2.11.5v0 (from packages) and API feature enabled and setup as a satalite.
When sending updates from the master it crashes. In my particular case directly after /var/lib/icinga2/api/zones-stage/pub//_etc/generated_dbconfig_hosts.conf
is copied into /var/lib/icinga2/api/zones/pub//_etc/generated_dbconfig_hosts.conf
.
After creating a build with debug symbols I managed to get the following backtrace:
#0 thrkill () at -:3
#1 0x00000a94dd2ce2ae in _libc_abort () at /usr/src/lib/libc/stdlib/abort.c:61
#2 0x00000a94dd231e9c in _libc_pthread_mutex_unlock (mutexp=<optimized out>) at /usr/src/lib/libc/thread/rthread_mutex.c:265
#3 0x00000a9272acc510 in boost::posix::pthread_mutex_unlock (m=0xa927306f488 <icinga::ApiListener::m_ConfigSyncStageLock>) at /usr/local/include/boost/thread/pthread/mutex.hpp:71
#4 boost::mutex::unlock (this=0xa927306f488 <icinga::ApiListener::m_ConfigSyncStageLock>) at /usr/local/include/boost/thread/pthread/mutex.hpp:125
#5 boost::unique_lock<boost::mutex>::~unique_lock (this=<optimized out>) at /usr/local/include/boost/thread/lock_types.hpp:331
#6 icinga::intrusive_ptr_release<boost::unique_lock<boost::mutex> > (object=0xa952e854d60) at /usr/ports/pobj/icinga2-2.11.5/icinga2-2.11.5/lib/base/shared.hpp:27
#7 boost::intrusive_ptr<icinga::Shared<boost::unique_lock<boost::mutex> > >::~intrusive_ptr (this=<optimized out>) at /usr/local/include/boost/smart_ptr/intrusive_ptr.hpp:98
#8 icinga::ApiListener::AsyncTryActivateZonesStage(std::__1::vector<icinga::String, std::__1::allocator<icinga::String> > const&, boost::intrusive_ptr<icinga::Shared<boost::unique_lock<boost::mutex> > > const&)::$_34::~$_34() (this=0xa949f7244c8)
at /usr/ports/pobj/icinga2-2.11.5/icinga2-2.11.5/lib/remote/apilistener-filesync.cpp:648
#9 std::__1::__compressed_pair_elem<icinga::ApiListener::AsyncTryActivateZonesStage(std::__1::vector<icinga::String, std::__1::allocator<icinga::String> > const&, boost::intrusive_ptr<icinga::Shared<boost::unique_lock<boost::mutex> > > const&)::$_34, 0, false>::~__compressed_pair_elem() (this=0xa949f7244c8)
at /usr/include/c++/v1/memory:2134
#10 0x00000a9272acc423 in std::__1::__function::__alloc_func<icinga::ApiListener::AsyncTryActivateZonesStage(std::__1::vector<icinga::String, std::__1::allocator<icinga::String> > const&, boost::intrusive_ptr<icinga::Shared<boost::unique_lock<boost::mutex> > > const&)::$_34, std::__1::allocator<icinga::ApiListener::AsyncTryActivateZonesStage(std::__1::vector<icinga::String, std::__1::allocator<icinga::String> > const&, boost::intrusive_ptr<icinga::Shared<boost::unique_lock<boost::mutex> > > const&)::$_34>, void (icinga::ProcessResult const&)>::destroy() (this=<optimized out>) at /usr/include/c++/v1/functional:1546
#11 std::__1::__function::__func<icinga::ApiListener::AsyncTryActivateZonesStage(std::__1::vector<icinga::String, std::__1::allocator<icinga::String> > const&, boost::intrusive_ptr<icinga::Shared<boost::unique_lock<boost::mutex> > > const&)::$_34, std::__1::allocator<icinga::ApiListener::AsyncTryActivateZonesStage(std::__1::vector<icinga::String, std::__1::allocator<icinga::String> > const&, boost::intrusive_ptr<icinga::Shared<boost::unique_lock<boost::mutex> > > const&)::$_34>, void (icinga::ProcessResult const&)>::destroy_deallocate() (this=0xa949f7244c0) at /usr/include/c++/v1/functional:1643
#12 0x00000a92729bad13 in std::__1::__function::__value_func<void (icinga::ProcessResult const&)>::~__value_func() (this=<optimized out>) at /usr/include/c++/v1/functional:1758
#13 std::__1::function<void (icinga::ProcessResult const&)>::~function() (this=<optimized out>) at /usr/include/c++/v1/functional:2334
#14 std::__1::__bind<std::__1::function<void (icinga::ProcessResult const&)>&, icinga::ProcessResult&>::~__bind() (this=<optimized out>) at /usr/include/c++/v1/functional:2648
#15 std::__1::__compressed_pair_elem<std::__1::__bind<std::__1::function<void (icinga::ProcessResult const&)>&, icinga::ProcessResult&>, 0, false>::~__compressed_pair_elem() (this=<optimized out>) at /usr/include/c++/v1/memory:2134
#16 std::__1::__function::__alloc_func<std::__1::__bind<std::__1::function<void (icinga::ProcessResult const&)>&, icinga::ProcessResult&>, std::__1::allocator<std::__1::__bind<std::__1::function<void (icinga::ProcessResult const&)>&, icinga::ProcessResult&> >, void ()>::destroy() (this=<optimized out>)
at /usr/include/c++/v1/functional:1546
#17 std::__1::__function::__func<std::__1::__bind<std::__1::function<void (icinga::ProcessResult const&)>&, icinga::ProcessResult&>, std::__1::allocator<std::__1::__bind<std::__1::function<void (icinga::ProcessResult const&)>&, icinga::ProcessResult&> >, void ()>::destroy_deallocate() (this=0xa94b97a1000)
at /usr/include/c++/v1/functional:1643
#18 0x00000a92729d438d in std::__1::__function::__value_func<void ()>::~__value_func() (this=<optimized out>) at /usr/include/c++/v1/functional:1758
#19 std::__1::function<void ()>::~function() (this=<optimized out>) at /usr/include/c++/v1/functional:2334
#20 bool icinga::ThreadPool::Post<std::__1::function<void ()> >(std::__1::function<void ()>, icinga::SchedulerPolicy)::{lambda()#1}::~SchedulerPolicy() (this=<optimized out>) at /usr/ports/pobj/icinga2-2.11.5/icinga2-2.11.5/lib/base/threadpool.hpp:59
#21 boost::asio::system_executor::dispatch<bool icinga::ThreadPool::Post<std::__1::function<void ()> >(std::__1::function<void ()>, icinga::SchedulerPolicy)::{lambda()#1}, std::__1::allocator<void> >(bool icinga::ThreadPool::Post<std::__1::function<void ()> >(std::__1::function<void ()>, icinga::SchedulerPolicy)::{lambda()#1}&&, std::__1::allocator<void> const&) const (this=<optimized out>, f=<optimized out>) at /usr/local/include/boost/asio/impl/system_executor.hpp:40
#22 0x00000a92729d420c in boost::asio::detail::work_dispatcher<bool icinga::ThreadPool::Post<std::__1::function<void ()> >(std::__1::function<void ()>, icinga::SchedulerPolicy)::{lambda()#1}>::operator()() (this=<optimized out>) at /usr/local/include/boost/asio/detail/work_dispatcher.hpp:58
#23 boost::asio::asio_handler_invoke<boost::asio::detail::work_dispatcher<bool icinga::ThreadPool::Post<std::__1::function<void ()> >(std::__1::function<void ()>, icinga::SchedulerPolicy)::{lambda()#1}> >(boost::asio::detail::work_dispatcher<bool icinga::ThreadPool::Post<std::__1::function<void ()> >(std::__1::function<void ()>, icinga::SchedulerPolicy)::{lambda()#1}>&, ...) (function=...) at /usr/local/include/boost/asio/handler_invoke_hook.hpp:69
#24 boost_asio_handler_invoke_helpers::invoke<boost::asio::detail::work_dispatcher<bool icinga::ThreadPool::Post<std::__1::function<void ()> >(std::__1::function<void ()>, icinga::SchedulerPolicy)::{lambda()#1}>, bool icinga::ThreadPool::Post<std::__1::function<void ()> >(std::__1::function<void ()>, icinga::SchedulerPolicy)::{lambda()#1}>(boost::asio::detail::work_dispatcher<bool icinga::ThreadPool::Post<std::__1::function<void ()> >(std::__1::function<void ()>, icinga::SchedulerPolicy)::{lambda()#1}>&, bool icinga::ThreadPool::Post<std::__1::function<void ()> >(std::__1::function<void ()>, icinga::SchedulerPolicy)::{lambda()#1}&) (function=..., context=...) at /usr/local/include/boost/asio/detail/handler_invoke_helpers.hpp:37
#25 boost::asio::detail::executor_op<boost::asio::detail::work_dispatcher<bool icinga::ThreadPool::Post<std::__1::function<void ()> >(std::__1::function<void ()>, icinga::SchedulerPolicy)::{lambda()#1}>, std::__1::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, std::__1::allocator<void>*, boost::system::error_code const&, unsigned long) (owner=0xa94be39e300, base=0xa94b97a1e00) at /usr/local/include/boost/asio/detail/executor_op.hpp:70
#26 0x00000a927292c9c8 in boost::asio::detail::scheduler_operation::complete (this=<optimized out>, owner=0xa94be39e300, ec=..., bytes_transferred=<optimized out>) at /usr/local/include/boost/asio/detail/scheduler_operation.hpp:40
#27 boost::asio::detail::scheduler::do_run_one (this=0xa94be39e300, lock=..., this_thread=..., ec=...) at /usr/local/include/boost/asio/detail/impl/scheduler.ipp:401
#28 0x00000a927292c492 in boost::asio::detail::scheduler::run (this=0xa94be39e300, ec=...) at /usr/local/include/boost/asio/detail/impl/scheduler.ipp:154
#29 0x00000a927293d867 in boost::asio::thread_pool::thread_function::operator() (this=<optimized out>) at /usr/local/include/boost/asio/impl/thread_pool.ipp:33
#30 boost::asio::detail::posix_thread::func<boost::asio::thread_pool::thread_function>::run (this=0xa9505e37ac0) at /usr/local/include/boost/asio/detail/posix_thread.hpp:86
#31 0x00000a927293d7a5 in boost::asio::detail::boost_asio_detail_posix_thread_function (arg=0xa9505e37ac0) at /usr/local/include/boost/asio/detail/impl/posix_thread.ipp:74
#32 0x00000a95076c10d1 in _rthread_start (v=<optimized out>) at /usr/src/lib/librthread/rthread.c:96
#33 0x00000a94dd2c6c58 in __tfork_thread () at /usr/src/lib/libc/arch/amd64/sys/tfork_thread.S:77
#34 0x0000000000000000 in ?? ()
Where frame 2 points to the following snippet of code:
int
pthread_mutex_unlock(pthread_mutex_t *mutexp)
{
pthread_t self = pthread_self();
pthread_mutex_t mutex;
if (mutexp == NULL)
return (EINVAL);
if (*mutexp == NULL)
#if PTHREAD_MUTEX_DEFAULT == PTHREAD_MUTEX_ERRORCHECK
return (EPERM);
#elif PTHREAD_MUTEX_DEFAULT == PTHREAD_MUTEX_NORMAL
return(0);
#else
abort();
#endif
mutex = *mutexp;
_rthread_debug(5, "%p: mutex_unlock %p (%p)\n", self, (void *)mutex,
(void *)mutex->owner);
if (mutex->owner != self) {
_rthread_debug(5, "%p: different owner %p (%p)\n", self, (void *)mutex,
(void *)mutex->owner);
if (mutex->type == PTHREAD_MUTEX_ERRORCHECK ||
mutex->type == PTHREAD_MUTEX_RECURSIVE) {
return (EPERM);
} else {
/*
* For mutex type NORMAL our undefined behavior for
* unlocking an unlocked mutex is to succeed without
* error. All other undefined behaviors are to
* abort() immediately.
*/
if (mutex->owner == NULL &&
mutex->type == PTHREAD_MUTEX_NORMAL)
return (0);
else
abort(); /* line causing the crash */
}
}
Just for shits and giggles I enabled the threading debugging output and when filtering out the specific mutex I find the following:
# grep -F 0xe0c6dd24f00 /tmp/mutex_debug
0xe0ca5effa40: mutex_lock 0xe0c6dd24f00 (0x0)
0xe0ca5effa40: mutex_unlock 0xe0c6dd24f00 (0xe0ca5effa40)
0xe0be5accc40: mutex_lock 0xe0c6dd24f00 (0x0)
0xe0ca5eff640: mutex_unlock 0xe0c6dd24f00 (0xe0be5accc40)
0xe0ca5eff640: different owner 0xe0c6dd24f00 (0xe0be5accc40)
Which doesn't give a lot of extra information but confirms that the mutex is being unlocked by a thread which is not the one who acquired the lock. Removing the owner check in libc for testing purposes (which I don't recommend anyone to do) icinga keeps on running, which confirms that this issue restricted the wrong thread releasing the mutex.
Since I'm not a C++-programmer, let alone familiar with the boost and icinga paradigms, this is basically where I got stuck, but hopefully this helps someone with more in depth knowledge solve this issue.
confirms that this issue restricted the wrong thread releasing the mutex
Of course! We lock the mutex in one thread and hand it over to another one. And libc doesn't like it? Damn...
Please could you test #8308?
A quick test seems that this fixes my issue. I'm going to leave it running over the weekend and report back somewhere next week
One minor sidenote, which probably doesn't apply to your implementation (I don't know what std::atomic_flag uses under the hood, but probably not the pthreadspin* family): POSIX states that a pthread_spin_unlock called by a thread not owning the lock results in undefined behaviour[0] and could just as easily cause an abort, similar to what pthread_mutex_unlock does on OpenBSD.
[0] https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_spin_unlock.html
After a couple of days it still seems to run as expected.
There's one minor issue where after some time icinga fails to exit when sending it a SIGTERM via:
pkill -T "0" -xf "/usr/local/lib/icinga2/sbin/icinga2 daemon.*"
as per OpenBSD's rc-framework. This however seems not directly related to this diff, since I can restart icinga just after a config-update has been pushed. I'll investigate further and if I find something useful I'll place it on an appropriate ticket.
Is the patch included in the latest 2.12.1 release?
Great! I just updated the FreeBSD port, @nielsk and @mat813 can you please confirm that you setup does work now?
Describe the bug
After upgrading from icinga 2.10.5 to 2.11 on FreeBSD 11.3-p3, icinga2 daemon -C shows that the configuration is correct, but starts and immediately exits when the api-feature is enabled. It works without the api-feature.
After re-running the api setup, I got it working but it crashed when I tried to send a notification.
output from running truss icinga2 daemon -x debug before 'api setup'
crash
Your Environment
Include as many relevant details about the environment you experienced the problem in
icinga2 --version
): r2.11.0-1icinga2 feature list
): api checker command ido-mysql mainlog notificationicinga2 daemon -C
):additional context
I opened a thread on the community discourse where I might have wrote more: https://community.icinga.com/t/problems-with-upgrading-icinga-2-10-5-to-2-11-on-freebsd/2325