sschindel commented 5 years ago

Hey!

Since upgrading to 2.10.1-1 we face some issues with the logrotating part. The daemon sometime crashes after the logrotate run, this did not happen before. This is the crash report:

cat /var/log/icinga2/crash/report.1540790715.016844

Application version: r2.10.1-1

System information: Platform: Debian GNU/Linux Platform version: 9 (stretch) Kernel: Linux Kernel version: 4.9.0-6-amd64 Architecture: x86_64

Build information: Compiler: GNU 6.3.0 Build host: 1880c6977367

Application information:

General paths: Config directory: /etc/icinga2 Data directory: /var/lib/icinga2 Log directory: /var/log/icinga2 Cache directory: /var/cache/icinga2 Spool directory: /var/spool/icinga2 Run directory: /run/icinga2

Old paths (deprecated): Installation root: /usr Sysconf directory: /etc Run directory (base): /run Local state directory: /var

Internal paths: Package data directory: /usr/share/icinga2 State path: /var/lib/icinga2/icinga2.state Modified attributes path: /var/lib/icinga2/modified-attributes.conf Objects path: /var/cache/icinga2/icinga2.debug Vars path: /var/cache/icinga2/icinga2.vars PID path: /run/icinga2/icinga2.pid Stacktrace:

(0) libc.so.6: gsignal (+0xcf) [0x7fa1362c5fff]
(1) libc.so.6: abort (+0x16a) [0x7fa1362c742a]
(2) libc.so.6: <unknown function> (+0x70c00) [0x7fa136303c00]
(3) libc.so.6: <unknown function> (+0x76fc6) [0x7fa136309fc6]
(4) libc.so.6: <unknown function> (+0x7780e) [0x7fa13630a80e]
(5) icinga2: std::_Function_base::_Base_manager<std::_Bind<std::_Mem_fn<void (icinga::Timer::*)()> (boost::intrusive_ptr<icinga::Timer>)> >::_M_manager(std::_Any_data&, std::_Any_data const&, std::_Manager_operation) (+0x38) [0x563e339d0358]
(6) icinga2: icinga::ThreadPool::WorkerThread::ThreadProc(icinga::ThreadPool::Queue&) (+0x4ca) [0x563e339c116a]
(7) libboost_thread.so.1.62.0: <unknown function> (+0x12116) [0x7fa138c1f116]
(8) libpthread.so.0: <unknown function> (+0x7494) [0x7fa137c41494]
(9) libc.so.6: clone (+0x3f) [0x7fa13637bacf]

This would indicate a runtime problem or configuration error. If you believe this is a bug in Icinga 2
please submit a bug report at https://github.com/Icinga/icinga2 and include this stack trace as well as any other
information that might be useful in order to reproduce this problem.

Failed to launch GDB: No such file or directory

Elias481 commented 5 years ago

For me it's hard to believe that this happens exactly during log-rotate time where this well-known issue exists and each day has 24 hours. So I believe that the original issue in #6684 happening at logrotate time could just be the same issue as this one as I have absolutely no prove for the opposite. :-) And for the other one reported in that issue (https://github.com/Icinga/icinga2/issues/6684#issuecomment-441987060) I believe... ...I believe there is another bug in icinga2 at reload logic that leaves room for similiar race condition... We also observed a crash (SIGBUS) at reload time but sadly without any information (beside the SIGBUS error in syslog) so I could not investigate.

I don't think I will give a 2.11 snapshots a try because a) we got this issue seldom (possibly first time) and b) I still wait for the three investigated segfault bugs in icinga2 from which two hit us regulary are fixed (all three should be simple to fix, at least for this one and for #6874 it's pretty clear what to do and for the third one the cause is also identified and a fix is simple to implement but not that abvious which is the straight-forward way for a bug-fix) - and finally c) I don't believe and we encountered no problems with UTF-8 special chars and anyway what I see is related to #7019 besides the things I don't belive are related. And I would prefer to mitigate this at the source or destination (just use correct UTF-8 where possibly some way)...

dnsmichi commented 5 years ago

Thanks for your thoughts and input. I can only offer you to test things already available, everything else is work in progress.

I tend to disagree on "should be simple to fix", as the analysis in the issue has proven that the problem exists, but is not an easy one to reproduce. Since you're experiencing this in your environment, and obviously have some more deep knowledge about C++ and signal handling, it might be a good effort to patch this and send in a PR with analysis before and after.

I'm not sure how the UTF8 things relate here, that's just a different issue having been fixed with the JSON library replacement. The main thing is that memory corruption and leaks are hard to find, and at the time this crashes, they may have influenced this but the root cause lies in a different scope.

I don't believe that this is the true cause for logrotate either, but testing the removed YAJL problem allows us to mitigate the problem further. The stacktrace on the crash will definitely look different, and hopefully gives more insights on this. Therefore I am asking you and everyone else experiencing the problem to test the snapshot packages.

If we don't find anything here in a reasonable amount of time, there's two options: a) Block 2.11 release, meaning to say everyone else that's waiting for the network stack etc. b) Remove this from 2.11 and leave it for later. That's bad for everyone else here.

Cheers, Michael

Elias481 commented 5 years ago

Regarding not wanting to test this: It's just not really testable for us because the last one, aborting in yayl, is not common in our environment and as the other ones. And if it is a separate issue, it thus had lower priority. As the other ones are verified (including possible fix and some other ways to mitigate this) I just don't believe the yayl at logrotate time is a yayl fault. Accessing an already deleted object or it's memory while the object is still accessed (as it is the case often as verified) can possibly lead to such memory allocation errors in random other scopes, as You confirmed. So testing the yayl replacement from #7019 is something that could make sense for miso231 who claimed in #6684 to get this error condition on logrotate frequently. Not sure as I did not verify but I think it's quite likely he then will get these errors (which directly point to the roots) or similiar error from other json parser if it does some similiar memory allocation. But this is only what I assume seeing relation logotate time which does not trigger any json parsing related stuff or influence it.

For the crashes that are reported in this issue I did a verification that it is related to unsafe refcounting. And I made one clean fix/patch changing as less as possible just fixing the issue with only a small "interface" change (OnTimerExpired callback does not return a Ptr object with reference but just a plain pointer - and thus had to be careful not to increment it's refcount or copy it byside for later use without knowing the object is still alive then - but this parameter to OnTimerExpired is nowhere used beside test case). Another possibility would be a local fix in StreamLogger destructor but this does not fix the bug with the Timer module refcounting so I would not recommend. The third option is now, seeing the other errors, something that I would strike-through (it's anyway a dirty solution and would not avoid such errors as can be seen at json parsing if they are related - it was just an example to verify bug).

One could prefer to do a bigger rework to keep the Timer-Scheduler keeping references on its Timers to not have the interface change described above, but that's not as simple. Sure I could do if that would be Your requirement for a fix but it would be some efford I would only spent if really needed.

Sure I can put it in a PR with analysis before and after - the one I have.

I could also try prove whether the yayl abort is reproducible if synchronizing that part to the other I already synchronized for the verification in such a way that the malloc from yayl occurs after one of the parties finished destruction completely (freed memory) but before the other one does the same and raises the double-free SEGV (so the one that is not the first destructor waits for a nitification from yayl). Some efford and it has the side effect that even more extra code would need to be added so that it would be even more difficult to verify that it's just showing a race condition of original code and not an error introduced by these synchronization mechanisms. So I would prefer to just implement a fix for this (the small and punctual one I already did or a bigger more complicated one that causes the timer-scheduler to be the Owner of all Timers that are in it's queue or currently executed by the scheduler) and see whether there are more - or if someone has the yayl error more often he could test with that version.

So what's Your requirement? Just a PR from what I already did or does it need to verify whether the yayl error is covered or is it required to provide OnTimerExpired callback with a Ptr object with valid refcount while at the same time keeping the auto-desctruction behaviour of the Timer objects which would make a fix much more complex?

I don't have a deep knowledge of C++ and signal-handling is also nothing I'm really deep in. But this is not really related to signal handling that far and I have quite deep understanding of programming/threading logic which is the same across all languages at the end/behind the curtins for the hardware we speak about (we don't speak about quantum-computers and such stuff where I expect some major differences).

So for the other Issues I can comment there (a bit shorter).

Al2Klimov commented 5 years ago

Hello @Elias481 and thank you for the inspiration!

With #7129 (I think) I've found the root cause. Please could you test #7130?

Best, AK

Elias481 commented 5 years ago

review done - #7130 does not change anything and is not addressing rootcause. the root-cause is fixed with #7129.

if you want a workarounf like that 7130 which does not address the root-cause I posted something that is working instead in comment above: https://github.com/Elias481/icinga2/commit/9917e3dd12a83e5253c94c4774bba5caa68dbe81

As I'm not a fan to not fix the root-cause but just implement a local workaround I will not make a PR for that. But You could indeed make a merge of what's show there (it's already tested - and for the problem observiced within the issue the second one change would be the least needed - but then one more place stays unfixed....)

dnsmichi commented 5 years ago

Hi,

thanks for the critical section patch to force the application crash, this really helped in analysing all the possible patches.

Here's what we've tested today.

kill -USR1 $(pidof icinga2)

Nothing

Git Master, 02db12ae024fe8704b24715402d2c475fce0efbb

Result: Segfault.

Only initialize Flush Timer once

From #7130.

Result: Icinga doesn't crash, but the log flush is running into a lock, enforced by the CS patch and different circumstances.

Timer const*

From #7129.

Result: Icinga doesn't crash and continues to log on stdout and mainlog.

Combine flush timer once with timer const*

Result: Works fine, doesn't have any other influences.

Conclusion, first part

Using a const pointer instead of an intrusive ptr fixes the problem, still the flush timer "re-newal" is totally unnecessary and needs to be fixed as well.

lambda function instead of std::bind

From #7004.

Result: Doesn't fix, crashes too.

lambda function and flush

Result: Doesn't crash.

Conclusion, second part

Somehow std::bind is involved here with intrusive_ptr, starting with the changes in 2.9.

A const* pointer or using a lambda bind is effectively keeping the pointer reference, and the call continues, no application crashes involved.

icinga2_logrotate_crash_analysis.patch.txt

dnsmichi commented 5 years ago

After some research within #7006 I've come to the following conclusion:

lambdas capture by reference, still holding a reference on the timer object
lambdas are more likely to be inlined by the compiler than function objects with std::bind
std::bind resp. std::function require a copy or move constructor, always binding the value and using that when invoke happens later.
Timer::Call happens just once inside the application thread pool, there's no need to keep the returned function object around and re-use it multiple times.
The function object copies the pointer, and while another thread may already have freed it, it tries to access the address resulting in a crash
Lambdas hold a reference on the passed pointer and actually capture its value on execution (later in the TP).
Best practice are lambdas over std::bind with C++11 and future versions.

You can avoid the intrusive_ptr going out of scope by either making this a const raw pointer, or by passing its reference into a lambda function.

Since the Timer class explicitly uses raw pointers instead of intrusive_ptr's, I'd stay there and combine the three patches with 1) const Timer* 2) lambdas 3) avoid re-newing the timer.

In terms of moving away from std::bind to lambdas, follow my research here: https://github.com/Icinga/icinga2/issues/7006#issuecomment-486098608

dnsmichi commented 5 years ago

Snapshot packages for Debian/Ubuntu need to be fixed in our build infra, the rest is available.

dnsmichi commented 5 years ago

ref/IP/13853

Icinga / icinga2

Logrotate issues since 2.10.1-1 #6737

cat /var/log/icinga2/crash/report.1540790715.016844

Nothing

Only initialize Flush Timer once

Timer const*

Combine flush timer once with timer const*

Conclusion, first part

lambda function instead of std::bind

lambda function and flush

Conclusion, second part