Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 578 forks source link

2.11 RC1: Nessus Scan crash the Windows-Client. #7431

Closed stevie-sy closed 5 years ago

stevie-sy commented 5 years ago

Describe the bug

We tested the 2.11 RC1 against the Nessus Scan again. So it's a litte bit a follow up from #6559: the Linux Client survived on CentOS 7 - Great! But not on our Windows System. Here the service is stopped after the scan. 2.11 RC1 is installed on a Windows Server 2012 R2 Standard.

If you Need something more pls tell us.

To Reproduce

Scan with a security scanner against a Windows system

Expected behavior

The Windows Client should also not Crash like the Linux Client.

Your Environment

Include as many relevant details about the environment you experienced the problem in

stevie-sy commented 5 years ago

Addition: To be sure, we have scanned 3 times.

dnsmichi commented 5 years ago

Since we don't have access to your environment (this implies an NDA/support contract), we need all the logs, as always. Also, the entries from the windows event console. Are you 100% certain that you did restart the agent after upgrading to 2.11 RC?

stevie-sy commented 5 years ago

Yes I did also a reinstall of the windows Client to be sure. You get the logs as soon as possible.

stevie-sy commented 5 years ago

My college did the security scan yesterday between 15:31 and 15:44 for the test environment. I filtered and anonymize the icinga2.log for this period. There was also only one entry in the Windows Event log. Icinga didn't produce a crash log. We have no debug.log yet. But if this two log aren't enough, we can do - of course - an another scan to produce one. Also if you need a more detailed Windows Event log. But for that I have to "clean" the server a little bit. Because there is also other software installed and running for testing (which is also monitored by icinga) and nessus is also scanning this one at the same time of course. event_icinga2_nessus.txt

icinga2_nessus.log

For me it looks like, that the windows client has problems with the GET-requests at 15:39:21, like GET %24%7B%7B57550614+16044095%7D%7D//. After that the client stopped working.

dnsmichi commented 5 years ago

Thanks, the logs unfortunately don't tell much except for the connections and requests made. The last log line before the restart doesn't unveil much.

I've performed the HTTP GET request against my Windows 10 VM, works as expected.

$ curl -k 'https://192.168.243.157:5665/%24%7B%7B57550614+16044095%7D%7D//'
<h1>Unauthorized. Please check your user credentials.</h1>

Just our of interest - I read that Nessus is a free scanner, but obviously has enterprise modules and so on. Which version of it are you using and do you think that there is a chance to reproduce this just with using the free scanner?

stevie-sy commented 5 years ago

Ok, than I ask my college for another scan and I try to "clean" (and stop all other Services from our other software on that Server) to get good logs; inclusive debug.log. Also I ask him which version and modules we use from Nessus. What I know we use the commercial version.

stevie-sy commented 5 years ago

So this is the Information I got from my colleague about the version:

Nessus Enterprise Tenable.sc Version: 5.10.1 Nessus Free (for scanning up to 16 IPs) „Nessus Essentials“ https://www.tenable.com/products/nessus/nessus-essentials If free and Enterprise version scan in the same manner could only tell Tenable Support.

logs are coming later

stevie-sy commented 5 years ago

So my colleague from the security department and I did a lot of scans with various scenarios.

note: First of all we noticed, that we have also the same problem with the debuglog on Windows, described in #7388 . We also install our windows agents with the PS Module. And it's correct what you wrote there, the api feature is disabled. But with a manual enabled api feature we also don't get a debug.log file. So what we do now to produce this file for you is to start manually an extra daemon with icinga2 daemon --log-level debug > C:\ProgramData\icinga2\var\log\icinga2\debug.log

Scans With this knowledge we build some diffrent scans today. That means we disable/enable the api and/or the debuglog feature, start an extra daemon to get the debug log. After that we start icinga and the scan again. This resulted in this Matrix (the time is for your to find this better in the logs): image

Logs for this: icinga2.log - this is not seperated, the complete scans are in this file debug_1.log - scan with extra process and enabled api/debuglog feature debug_2.log - scan with extra process and disabled api and enabled debuglog feature debug_3.log - scan with extra process and disabled api/debuglog feature

an interesting fact from yesterday's scan (before wie read the issue with the debuglog) With the extra process for the debug log icinga survied - every process! But today not. So we don't know why and can't reproduce this. That are the logs from this scan: icinga2.log debug.log

Conclusion: Every time when the icinga daemon crashed there is an entry in the windows event log with the faulting module is C:\Program Files\ICINGA2\ucrtbase.DLL with the exeption code: 0xc0000409 (STATUS_STACK_BUFFER_OVERRUN) and the offset 0x000000000006f08e image

image

I hope this helps you to find the bug. If you need more, pls tell us.

stevie-sy commented 5 years ago

I did a little reasearch about this dll. It's from Microsoft so if this is the fault auf the crash, it's maybe a little bit difficult to fix. image

With google I found some postings in forums, that also other people has problems with this DLL. They also have crashes. They didn't write the reason. But maybe this is helpful for you:

dnsmichi commented 5 years ago

Thanks, I'll talk to you on Tuesday when back from vacation.

dnsmichi commented 5 years ago

We'll continue offline. I'll update the issue once all data is collected and whether we've found something from the analysis.

dnsmichi commented 5 years ago

Both Stefan and Philipp are doing a marvelous job here - they provided me with a full self-written PDF on how to install and configure Nessus Essentials to trigger the full scan.

Previously I tested with a macOS install, but with a wrong subnet Nessus locked me out of the trial (yeah, enterprise software which always wants you to buy something). Now I am using a CentOS 7 Vagrant box.

I can see that Icinga stops running when such a scan happens. This is good news, so I can dig deeper.

For security reasons, I am not sharing any details about Nessus and its configuration nor the exact logs for triggering things until I have found and implemented a fix.

dnsmichi commented 5 years ago

The referenced PR doesn't entirely solve the problem. We're now somewhere in the event polling loops in Boost Asio on Windows with socket cancel actions. So this definitely only is a Windows specific problem, only with the RC1 and Git Master.

stevie-sy commented 5 years ago

That are a Little bit bad news. Because that means for you more hard work to fix. But you can do that, we're sure. But on the other hand it's good to know that the Linux version is stable after the rework of the network stack. Very good work. In any case, the hard work has paid off.

dnsmichi commented 5 years ago

http::async_read_header() may return an error during liveness disconnect. For some reason, reading the header doesn't return in this case. Calling cancel() in Disconnect() does not cancel running operations, nor would it react shuttingDown. Either we're overlapping buffers here (which I highly doubt), or there's a race condition involved where Boost asio fails. Or something fails with coroutines and strands here.

Upgrade to Boost 1.71 would also be a test option for next week. Code resides in bugfix/windows-nessus-crash-http now @Al2Klimov.

Last resort: Disable IOCP on Windows and force to use select() for IO handling with Boost Asio.

The error is somewhat different, at some point the thrown error is somehow broken. Maybe there is an overlap with the error codes as well. Looked into ceph's handling here too https://github.com/ceph/ceph/blob/6171399fdedd928b4249d135b4036e3de25079aa/src/rgw/rgw_asio_frontend.cc

Screen Shot 2019-08-30 at 16 51 32

Screen Shot 2019-08-30 at 17 05 42 Screen Shot 2019-08-30 at 17 09 37

Btw, you cannot even go into the Windows specific DLLs to read source code. Only Assembler. Closed source sucks.

dnsmichi commented 5 years ago

Final decision from today's meeting with @lippserd @Al2Klimov and @dnsmichi

Analysis thus far

Environment

Nessus VM

Setup a Nessus or OpenVAS VM and do port scans as well as web security scans.

Windows

Icinga 2

Dev-Environment with an msi package installed icinga2 Windows package, api setup is enough. https://icinga.com/docs/icinga2/snapshot/doc/21-development/#windows-dev-environment

Terminals

Powershell with Tail:

Get-Content .\icinga2.log -tail 10 -wait

2nd PS with

Get-Service icinga2

Restart-Service icinga2

Debugger

Run Visual Studio and attach to the running icinga2 process.

windows_icinga_vs_attach_01 windows_icinga_vs_attach_02

Analysis

Run a Nessus scan against the IP address and wait until Visual Studio catches the exception.

windows_icinga_vs_crash_01 windows_icinga_vs_crash_02

windows_icinga_vs_crash_03

Possible Topics

Everything changed below is not needed on Linux/Unix/macOS, only Windows.

Memory access to specific error variables is unavailable in the debugger. This would indicate sort of heap corruption, caused by requests run before.

Specifically with a Nessus scan, this connects via TLS and sends a HTTP request. Icinga reads the first few bytes being the header.

https://lists.boost.org/boost-bugs/2013/05/30321.php

Http Client Disconnect

This somehow triggers the problem. Sometimes the crash is reproducible with openssl s_client and also icinga2 console.

It always needs multiple requests, in parallel or shot in fast cycles.

The memory exceptions come one by one with a crash at some point in the future. Memory usage exceeds during the scan, 2.6 GB from 4 GB.

https://stackoverflow.com/questions/49322387/boost-beast-memory-usage-for-bulk-requests

Memory Management with Context

https://www.boost.org/doc/libs/1_69_0/libs/context/doc/html/context/overview.html https://docs.microsoft.com/en-us/cpp/build/reference/safeseh-image-has-safe-exception-handlers?view=vs-2017 https://stackoverflow.com/questions/45569418/how-to-use-boost-context-correctly

https://github.com/boostorg/coroutine/issues/30 https://github.com/STEllAR-GROUP/hpx/issues/3824 https://github.com/boostorg/coroutine2/issues/19

https://developercommunity.visualstudio.com/content/problem/234121/building-a-project-using-boost-context-results-in.html

https://developercommunity.visualstudio.com/content/problem/297477/vs-2017-15741575-crash-caused-by-activated-inline.html

https://www.boost.org/doc/libs/1_64_0/libs/context/doc/html/context/ecv2.html#context.ecv2.stack_unwinding

http://boost.2283326.n4.nabble.com/context-Crash-in-case-of-exception-from-a-context-td4672953.html

https://github.com/boostorg/context/issues/116

Lifetime & Stack Size

https://stackoverflow.com/questions/56178220/boost-context-problems-with-exception-propagation

https://stackoverflow.com/questions/51110650/minimal-possible-stack-size-on-windows-when-using-c-exceptions-using-boost-co

Windows SEH and Compiler Optimizations

CMake doesn't cache them. https://cmake.org/cmake/help/latest/command/add_compile_options.html

//Flags used by the CXX compiler during all build types.
CMAKE_CXX_FLAGS:STRING=/DWIN32 /D_WINDOWS /W3 /GR /bigobj /GL- /EHs
+  # /GT - https://docs.microsoft.com/en-us/cpp/build/reference/gt-support-fiber-safe-thread-local-storage?view=vs-2019
+  # /Od - https://docs.microsoft.com/en-us/cpp/build/reference/od-disable-debug?view=vs-2019
+  # /GL- - https://docs.microsoft.com/en-us/cpp/build/reference/gl-whole-program-optimization?view=vs-2019
+  # /GS- - https://docs.microsoft.com/en-us/cpp/build/reference/gs-buffer-security-check?view=vs-2019
+  # /RTCcu - https://docs.microsoft.com/en-us/cpp/build/reference/rtc-run-time-error-checks?view=vs-2019
+  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /bigobj /GT /Od /RTCcu /GL- /GS- /EHs" CACHE STRING "Flags used by the CXX compiler during all build types." FORCE)
+  message(STATUS "Configuring compiler flags for Boost on Windows: ${CMAKE_CXX_FLAGS}")

http://www.mikemarcin.com/post/coroutine_a_million_stacks/ https://docs.microsoft.com/en-us/windows/win32/memory/memory-limits-for-windows-releases#memory_limits

https://github.com/boostorg/fiber/issues/125

https://social.msdn.microsoft.com/Forums/SECURITY/en-US/d9625f42-d9bd-47fa-95da-5ce10bacdcc3/how-to-disable-seh?forum=vcgeneral

Coroutine lifetime

https://www.reddit.com/r/cpp/comments/6vhg38/support_for_c_coroutines_in_visual_studio_is/dm103np/ http://boost.2283326.n4.nabble.com/corountine2-context-forced-unwind-exception-on-Windows-with-TDM-GCC-td4706583.html

Exceptions thrown in Asio Coroutines

Win32 puts their exceptions onto the stack by default, therefore /SAFESEH:NO is specified as linker flag. Wine 64 uses linked tables instead.

Coroutines use a stack pointer when paused. Exceptions may have modified this somehow causing corruption.

[context] Crash in case of exception from a context

http://boost.2283326.n4.nabble.com/context-Crash-in-case-of-exception-from-a-context-td4672953.html

I use boost::context to implement cooperative multitasking. At the moment I use boost 1.57 on Windows 7 with VisualStudio 2013. 

Everything works fine until I throw an exception from a context created by "boost::context::make_fcontext(.......)".  I have seen someone had the same problem and it was caused by the Windows OS, that checks the exception hierarchy in the stack and if it finds that the starting point is not the one that should come from the OS then it terminates the process. It suspects some malitious code that corrupted the stack. 

http://www.blinnov.com/en/2013/04/11/english-boostcontext-and-seh/ 

This guy metioned in his blogpost, that this problem was solved in boost 1.53 ("Boost 1.53 has updated and fixed version of boost::context library that addresses exactly this problem. "), but somehow I still have the same thing in 1.57. Is there some special preprocessor flag (#define) that I have to activate to get this fix? 

http://boost.2283326.n4.nabble.com/Can-t-catch-Boost-Asio-exceptions-inside-coroutines-td4689174.html

https://www.boost.org/doc/libs/1_67_0/doc/html/boost_asio/overview/core/spawn.html

Other Tests

Other things

Possible solution

stevie-sy commented 5 years ago

@dnsmichi We'll keep our fingers crossed, that your ideas are helpful!

dnsmichi commented 5 years ago

Dynamically link Boost DLLs

Still crashes, but at least a crash log is now written with an unhandled SEH exception.

C:\Program Files\ICINGA2\sbin> Get-Content "C:\ProgramData\icinga2\var\log\icinga2/crash/report.1567670652.985000"
  Application version: v2.11.0-rc1-93-gcd5aa4390

System information:
  Platform: Windows
  Platform version: 8
  Kernel: Windows
  Kernel version: 6.2
  Architecture: x86_64

Build information:
  Compiler: MSVC 19.22.27905.0
  Build host: WINMIF

Application information:

General paths:
  Config directory: C:\ProgramData\icinga2\etc\icinga2
  Data directory: C:\ProgramData\icinga2\var\lib\icinga2
  Log directory: C:\ProgramData\icinga2\var\log\icinga2
  Cache directory: C:\ProgramData\icinga2\var\cache\icinga2
  Spool directory: C:\ProgramData\icinga2\var\spool\icinga2
  Run directory: C:\ProgramData\icinga2\var\run\icinga2

Old paths (deprecated):
  Installation root: C:\Program Files\ICINGA2\
  Sysconf directory: C:\ProgramData\icinga2\etc
  Run directory (base): C:\ProgramData\icinga2\var\run
  Local state directory: C:\ProgramData\icinga2\var

Internal paths:
  Package data directory: C:\Program Files\ICINGA2\\share\icinga2
  State path: C:\ProgramData\icinga2\var\lib\icinga2/icinga2.state
  Modified attributes path: C:\ProgramData\icinga2\var\lib\icinga2/modified-attributes.conf
  Objects path: C:\ProgramData\icinga2\var\cache\icinga2/icinga2.debug
  Vars path: C:\ProgramData\icinga2\var\cache\icinga2/icinga2.vars
  PID path: C:\ProgramData\icinga2\var\run\icinga2/icinga2.pid
Caught unhandled SEH exception.
Current time: 2019-09-05 10:04:13 +0200

Stacktrace:

        (0): free+40
        (1): (unknown function)
        (2): (unknown function)
        (3): (unknown function)
        (4): (unknown function)
        (5): (unknown function)
        (6): (unknown function)
        (7): (unknown function)
        (8): (unknown function)
        (9): (unknown function)
        (10): (unknown function)
        (11): (unknown function)
        (12): (unknown function)
        (13): (unknown function)
        (14): (unknown function)
        (15): (unknown function)
        (16): (unknown function)
        (17): (unknown function)
        (18): (unknown function)
        (19): (unknown function)
        (20): (unknown function)
        (21): (unknown function)
        (22): (unknown function)
        (23): (unknown function)
        (24): (unknown function)
        (25): (unknown function)
        (26): (unknown function)
        (27): (unknown function)
        (28): (unknown function)
        (29): (unknown function)
        (30): (unknown function)
        (31): (unknown function)
        (32): (unknown function)
        (33): register_onexit_function+236
        (34): BaseThreadInitThunk+20
        (35): RtlUserThreadStart+33
***
* This would indicate a runtime problem or configuration error. If you believe this is a bug in Icinga 2
* please submit a bug report at https://github.com/Icinga/icinga2 and include this stack trace as well as any other
* information that might be useful in order to reproduce this problem.
***
dnsmichi commented 5 years ago

TL;DR - 2 weeks with emotional ups and downs and a lot of technical background learned. Icinga now survives a Nessus scan.

Screen Shot 2019-09-09 at 17 46 50

Analysis

Setup

Scanner Setup

Both Nessus and OpenVAS work. Thanks to Stefan & Philipp for providing a setup documentation for Nessus Essential. Won't be provided for security reasons here.

Wireshark TCP Dump

https://www.netways.de/blog/2018/07/26/ssl-geknackt-naja-fast/

Change the cipher_list, add the node's private key for following the TLS traffic, extract the sent malformed requests. This worked to mitigate the request which caused the immediate crash. Unfortunately this isn't always the same, and we found another way to trigger the crash as well.

Request Analysis

Only failing requests would lead to stack buffer underrun or memory corruption errors.

Memory Analysis

Origin

Boost Beast HTTP Parser

http::async_read_header may throw exceptions leading into crashes. In order not to populate the exception stack, it sounds reasonable to move to error_code handling anyways.

Boost ASIO Socket IO

Same strand in m_IoStrand as with other operations. No special influence here.

https://github.com/boostorg/asio/issues/154 https://www.boost.org/doc/libs/1_47_0/doc/html/boost_asio/overview/core/strands.html

http://www.crazygaze.com/blog/2016/03/17/how-strands-work-and-why-you-should-use-them/

Boost Coroutines

ASIO wraps them and makes them easier to use.

https://stackoverflow.com/questions/30557582/what-does-boostasiospawn-do

Coroutine Stack Unwinding on Complete

https://stackoverflow.com/questions/9286550/is-there-any-method-that-causes-whole-stack-frame-unwinding-in-c-except-usin

However, I've found some critical flaw. Any user code that "swallow exception" as below will completely break the assumption of cooperative scheduling.

try {
    ...
} catch(...) { // ContextClosingException 
    // do nothing, just swallow exception.
}

So I need to find other way to invoke stack unwinding (or any other way to destruct stack object in the continuation). Standard conformance way would be nice - but the continuation implementation itself is dependent on platform specific API, so non-portable way would be acceptable. (I'm using win32)

Swallowing exceptions is evil with Boost Coroutines.

Root Cause

Coroutines work in the way that they allocate their own stack. Whenever an exception is thrown, it is expected to be caught or to terminate the coroutine.

Exceptions are asynchronous and put onto the stack. This stack is also used by the Coroutine to store the function pointer when yielding the operation.

At some point, the thrown exceptions corrupt the stack. Only on Windows, only with Visual Studio as compiler.

In order to trigger this problem, either fire many Nessus Scan Attacks against the HTTP API. Or you'll go with something via curl/debug console which always triggers an exception.

The easiest way to reproduce this is to use the debug console and evaluate a script expression which is incorrect. At first glance, the generated ScriptError exception somehow works. Later iterations always lead into a crash.

C:\Program Files\ICINGA2\sbin> .\icinga2.exe console --connect 'https://root:icinga@localhost:5665/' --eval 'get_host("bla)'

Note

Rewriting the code parts which invoke std::invalid_argument do not fix this, the memory is still corrupted. Also, EnsureValidHeaders() is passed without throwing any exception.

We really need to work on the exceptions themselves.

Coroutines and Context Stack

Our (exception) stack is foobar at some point. Either Beast exceptions throw, our own ScriptError exceptions throw and something asio related itself.

One can manipulate the coroutine stack size allocated by Boost.Context. This defaults to 64KB.

Default constructor using boost::context::default_stacksize(), does unwind the stack after coroutine/generator is complete.

https://gist.github.com/chfast/bfc1e1af4176960b4722#file-context-cpp-L85

typedef ctx::simple_stack_allocator<
    8 * 1024 * 1024, // 8MB
    64 * 1024, // 64kB
    8 * 1024 // 8kB
>       stack_allocator;

https://emcl-gitlab.iwr.uni-heidelberg.de/hiflow3.org/hiflow3/blob/c7a3ea7216a9cb15a638112dfb17732e932cc265/contrib/boost_libraries/libs/context/performance/winfiber/performance.cpp

Sets the default stack size to 8 MB. 64MB. Does not immediately crash, but later.

https://www.boost.org/doc/libs/1_68_0/libs/context/doc/html/context/stack.html

Exceptions

http://boost.2283326.n4.nabble.com/corountine2-context-forced-unwind-exception-on-Windows-with-TDM-GCC-td4706583.html

forced_unwind is part of the internal plumbing of Boost.Context and can 
be thrown from any yield point when the coroutine is destroyed while 
suspended.  It's required that this exception is not swallowed and the 
whole call stack between the yield point and the "outside" of the 
coroutine is not noexcept. 

https://github.com/raptorjit/raptorjit/issues/110

https://asio-users.narkive.com/QP9g9dj9/can-t-catch-boost-asio-exceptions-inside-coroutines

Are you linking to pre-built boost DLL's provided by MXE? If so, the
DLL is probably not seeing the exception handler created by your
program, probably because the compiler is not creating the exception
handler in the way expected by the DLL. If that is the case, there
might be some compiler command line options that would fix it, or
possibly the compiler was not built correctly to support exceptions
coming from DLL's. Linking to a static boost library instead of a DLL
might fix it though either way.

http://boost.2283326.n4.nabble.com/context-Crash-in-case-of-exception-from-a-context-td4672953.html

[context] Crash in case of exception from a context

I use boost::context to implement cooperative multitasking. At the moment I use boost 1.57 on Windows 7 with VisualStudio 2013. 

Everything works fine until I throw an exception from a context created by "boost::context::make_fcontext(.......)".  I have seen
someone had the same problem and it was caused by the Windows OS, that checks the exception hierarchy in the stack and if it finds
that the starting point is not the one that should come from the OS then it terminates the process.

It suspects some malitious code that corrupted the stack. 

https://www.acodersjourney.com/top-15-c-exception-handling-mistakes-avoid/

Beyond knowing how to use the try/catch syntax, one of the fundamental concepts to know regarding C++ exception handling is the concept
of Stack Unwinding.

When an exception is thrown and control passes from a try block to a handler, the C++ run time calls destructors for all automatic objects
constructed since the beginning of the try block. This process is called stack unwinding. The automatic objects are destroyed in reverse order
of their construction. If an exception is thrown during construction of an object consisting of sub-objects or array elements, destructors are
only called for those sub-objects or array elements successfully constructed before the exception was thrown.

Why should you know this ? Because this'll help you understand the exception handling tips and tricks for making your code robust and efficient.
A full discussion of the Stack Unwinding process is beyond the scope of this article – but here is an excellent reference from msdn : 
https://msdn.microsoft.com/en-us/library/hh254939.aspx.

https://docs.microsoft.com/en-us/cpp/cpp/exceptions-and-stack-unwinding-in-cpp?view=vs-2019

Mistake # 15: Swallowing Exceptions
Swallowing critical exceptions will cause your program to do either of two things –  to fail in unexpected ways downstream or prevent the program from fulfilling it’s purpose. Sometimes programmers will catch any exception via catch(…) and then swallow them . This is usually done for exceptions that the programmer did not foresee happening. However, this can lead to downstream failure – sometimes with no obvious reason for the failure since the stacktrace disappears with the swallowed exception.

If you must swallow exceptions, make sure that you Log the exception as well as document them in code and in your documentation. 

https://groups.google.com/forum/#!topic/boost-list/6oHRdLmuQ5k

The documentation for boost context contains the following warning: "Do not jump from inside a catch block and then re-throw the exception in another continuation. "
I can attempt to answer "why" from a high-level point of view: I think is that it could be summarized as: the exception machinery is not reentrant. Throwing and catching exceptions performs a lot of messy manipulations with the data on the stack. During exception unwinding, you have to traverse the stack (a single-linked list) during which 1) a check is made at each level for whether the current stack frame has an exception handler, and if not 2) call destructors for local objects. Stack traversal is implemented by the runtime and even possibly relies on the OS' data structures (in case of Windows) and *might* use also thread-local storage (e.g., exception_ptr).

Switching the stack aborts this process without informing the run-time exception machinery and.. what happens when you resume it? What if another exceptions has been thrown and handled on the same thread before resuming a suspended exception handler? A handler which is written with the assumption that only a single active invocation can exist on any particular thread?

There are other concerns too: on Windows you can get asynchronous callbacks from the kernel (akin to unix signals, just nicer) and if you switch the stack when called from the kernel, you can really hose your process.

Incompatibility between std, boost and ScriptError exceptions?

I've found a library which wraps the coroutine exceptions from their different types.

https://github.com/niekbouman/commelec-api/blob/master/commelec-api/coroutine-exception.hpp#L35

The implementation is safe and good to use - since I've now learned that coroutines only support Boost exceptions and nothing else.

Compiler Optimization

The Visual Studio Compiler optimizes several bits in a way we as developers do not expect it. Keep in mind, the underlaying code works 100% on Linux/Unix when compiled with GCC.

In terms of Boost, we need to disable certain flags already:

# Disable optimization for Boost::context
# https://www.boost.org/doc/libs/1_69_0/libs/context/doc/html/context/overview.html
# https://docs.microsoft.com/en-us/cpp/build/reference/gl-whole-program-optimization?view=vs-2017
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} /bigobj /GL- /EHs")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /bigobj /GL- /EHs")

# detect if 32-bit target
if(CMAKE_VS_PLATFORM_NAME STREQUAL "Win32")
  # SAFESEH is not supported in Boost on Windows x86
  # maybe it is when Boost is compiled with it...
  # https://lists.boost.org/Archives/boost/2013/10/206720.php
  # https://docs.microsoft.com/en-us/cpp/build/reference/safeseh-image-has-safe-exception-handlers?view=vs-2017
  set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} /SAFESEH:NO")
endif()

These optimizations don't have much of influence here.

Instead, I've found some compiler optimizations going on to reduce the binary sizes. This includes improving the exception handling and stack unwinding operations.

https://habr.com/en/company/microsoft/blog/442996/ https://developercommunity.visualstudio.com/content/problem/273518/incorrect-stack-unwinding-after-throwing-exception.html

https://www.boost.org/doc/libs/1_68_0/libs/context/doc/html/context/ff/implementations__fcontext_t__ucontext_t_and_winfiber.html

Because the TIB (thread information block on Windows) is not fully described in the MSDN, it might be possible that not all required TIB-parts are swapped. Using WinFiber implementation migh be an alternative.

ASIO IO Threads manipulating memory?

https://stackoverflow.com/questions/14981291/boostasio-internal-threads

On Windows, defining BOOST_ASIO_DISABLE_IOCP will cause Asio to use the select-based implementation, which is slower but should not spawn additional threads. 

Tested this, but it did not fix the crash. Won't apply the patch on Windows therefore.

Windows Stack

Well, turns out that the default allocated stack size for a coroutine is 64 KB. Whenever it has exceptions inside, this stack may be too low.

https://stackoverflow.com/questions/51110650/minimal-possible-stack-size-on-windows-when-using-c-exceptions-using-boost-co

Raising the default coroutine stack on Windows actually fixes the problem.

https://docs.microsoft.com/en-us/cpp/build/exception-handling-x64?view=vs-2019 https://devblogs.microsoft.com/cppblog/making-cpp-exception-handling-smaller-x64/

Final Analysis

Statement: The crash only happens when an error is raised.

Whenever valid requests are made, everything is fine.

Statement: When something doesn't work, it throws an exception.

Boost Beast http::async_read_header can throw an exception. We can re-build this into error codes. Doesn't change. The debug console triggers a ScriptError raised from the config compiler. Wrong header reading triggers std::invalid_argument()

Not all of them are boost compatible exceptions.

Statement: Whenever an exception happens, the coroutine returns where called and unwinds the stack.

This includes a stack deallocate() call which calls std::free(). This crashes with pointing to memory which is corrupted.

Statement: Throwing std::invalid_argument() on Windows corrupts the exception stack.

Yes. Rewriting EnsureValidHeaders() to not use these exceptions allows the code to complete.

No fucking idea what the Visual Studio compiler renders from this code bit. It may originate from static INLINE too.

Statement: ScriptError still causes the application to crash.

The stack size with 64 KB is too low. Why? Look into what's stored inside the exception. The whole DebugInfo struct is massively bloated.

It may also be the case that exceptions inside coroutines cause an overhead which is timing related to many requests exceeding the limit.

Statement: Swallowing exceptions with catch(...) is evil.

Yes, Boost Coroutine doesn't work with the unwind exception.

Statement: Defer-Disconnect with exceptions swalled in the Destructor don't work.

Yes, these two patterns are incompatible within the new network stack. Avoid them at all cost.

Statement: It is unclear which coroutine stack size is needed for Windows and MSVC.

Tests have proven something between 8 and 64 MB. This still is too high, but mitigates the issue. With all the error handling and code alignments we are even more safe, also in regard to Linux/Unix. A windows agent typically doesn't expose much HTTP traffic (only scanners), JSON-RPC also is low (2 parent satellites).

Fixes

Ensure that Coroutine Stack unwinding is fully supported within the API.

Ensure that HTTP Disconnects don't fire I/O.

Wrap thrown exceptions into Boost compatible exceptions in boost::asio::spawn(). Windows coroutine stack size: Increase stack size to prevent crashes with MSVC.

Side Fixes

References and URLs

Everything I have had opened and shed some light into finding the cause.

https://geidav.wordpress.com/2013/03/21/anatomy-of-dynamic-stack-allocations/ https://docs.microsoft.com/en-us/windows/uwp/cpp-and-winrt-apis/concurrency-2 https://www.boost.org/doc/libs/1_62_0/libs/fiber/doc/html/fiber/stack.html https://stackoverflow.com/a/45027110/1821348 https://www.boost.org/doc/libs/1_54_0/doc/html/boost_asio/overview/core/spawn.html https://docs.microsoft.com/de-de/cpp/cpp/exception-specifications-throw-cpp?view=vs-2019 https://stackoverflow.com/questions/9286550/is-there-any-method-that-causes-whole-stack-frame-unwinding-in-c-except-usin https://gist.github.com/chfast/bfc1e1af4176960b4722#file-context-cpp-L85 https://github.com/boostorg/beast/issues/1055 https://stackoverflow.com/questions/41030285/boostproperty-treeread-xml-segfaults-in-an-asio-handler-spawned-using-boost https://www.boost.org/doc/libs/1_66_0/libs/coroutine/doc/html/coroutine/attributes.html https://github.com/boostorg/beast/issues/1209 https://www.boost.org/doc/libs/1_68_0/libs/context/doc/html/context/stack.html https://github.com/boostorg/coroutine2/issues/19 http://www.mikemarcin.com/post/coroutine_a_million_stacks/ https://docs.microsoft.com/en-us/windows/win32/memory/memory-limits-for-windows-releases#memory_limits https://github.com/mxe/mxe/issues/1559 http://www.blinnov.com/en/2013/04/11/english-boostcontext-and-seh/ http://boost.2283326.n4.nabble.com/context-Crash-in-case-of-exception-from-a-context-td4672953.html https://news.ycombinator.com/item?id=11781201 https://rethinkdb.com/blog/improving-a-large-c-project-with-coroutines/ https://github.com/facebook/folly/tree/master/folly/fibers https://stackoverflow.com/questions/46738872/weird-after-exit-error-on-coroutine-test-module-on-windows-x64 https://stackoverflow.com/questions/50429345/fast-fibers-coroutines-under-x64-windows https://www.boost.org/doc/libs/1_65_1/libs/fiber/doc/html/fiber/performance.html https://paoloseverini.wordpress.com/tag/fibers/ https://news.ycombinator.com/item?id=11781201 https://stackoverflow.com/questions/49036542/visual-c-2017-bug-compiler-optimizes-away-expression?noredirect=1#comment85079616_49036542 https://docs.microsoft.com/de-de/cpp/build/reference/f-set-stack-size?view=vs-2019 https://developercommunity.visualstudio.com/content/problem/345025/heap-corruption-in-stack-unwinding-when-inlining-f.html https://www-user.tu-chemnitz.de/~heha/viewchm.php/hs/Win32SEH.chm/

Libraries

https://github.com/owt5008137/libcopp https://github.com/yyzybb537/libgo

Google Searches

Newest first.


6.9.2019 

boost coroutine std::exception problem
boost thread default stack size windows
boost fibers coroutine stack size
windows c++ disable exception hierarchy in the stack
BOOST_ASIO_DISABLE_THREADS
boost coroutine exceptions windows
windows export linked libraries of binary
boost coroutine windows visual studio std::invalid_argument exceptions stack corruption site:stackoverflow.com
boost coroutine windows visual studio std::invalid_argument exceptions stack corruption
windows visual studio std::invalid_argument exceptions stack corruption
boost coroutine stack with exception templates windows
boost coroutine stack with exception templates
boost coroutine STATIC_POOL_SIZE
BOOST_PROPERTY_TREE_RAPIDXML_STATIC_POOL_SIZE
visual studio threadsanitizer
windows visual studio use clang++ cmake
boost context windows visual studio stack corruption
boost context windows stack corruption
windows visual studio stackful coroutines
detached coroutine boost asio
boost asio socket cancel io strand
boost asio spawn avoid exceptions windows visual studio
boost asio spawn avoid exceptions windows
coroutine do not unwind stack windows misaligned
coroutine do not unwind stack windows
boost coroutines stack unwind option
boost windows fibers stack size performance
windows visual studio optimizer couroutines stack size
windows visual studio compiler stack guards optimization
windows visual studio compiler stack optimization
cmake windows disable stack guard
fno stack protector
windows exception stack guards
windows allocate stack guards
boost::context::default_stacksize
boost::context::default_stacksize windows
boost::coroutines::attributes
boost::asio::spawn default coroutines stack size
boost coroutine exceptions windows
boost beast exception coroutine windows stack
boost asio coroutine exception stack windows
boost asio coroutine exception stack
boost asio io strand spawn
boost asio io strand spawn windows exception stack broken
visual studio optimization boost asio spawn
visual studio c++ lambda context optimized unused
visual studio lambda context optimized unused
windows crash in register_onexit_function
error C2491: 'boost::unit_test::unit_test_main': definition of dllimport function not allowed
boost::uuids BCryptOpenAlgorithmProvider
windows boost Boost_USE_STATIC_LIBS
windows boost context static lib crash
Og compiler flag
visual studio disable seh exceptions boost asio
visual studio disable seh exceptions
boost asio spawn exception memory violation windows
visual studio c++ exception types
visual studio exception types
visual studio set exception handling boost context asio
boost coroutine exceptions windows
boost coroutines basic_standard_stack_allocator deallocate memory violation
CMAKE_CXX_FLAGS cmake windows wrong cache
cmake visual studio compiler flags
cmake visual studio compile /GL-
windows msvc turn off global program optimization (/GL)
boost context stack unwound windows
visual studio boost context /Og
boost context increase stack size windows
windows c++ set stack size
windows exception execution_context_v2 in Boost Context
boost binaries context safeseh
disable compiler optimization visual studio 2017 c++
boost context visual studio 2017 crash
cmake windows override /EHsc
basic_standard_stack_allocator boost windows crash
compile boost context windows
visual studio 2017 memory profiler
windows c++ print memory usage from self
windows boost asio context out of memory
boost asio spawn shared_from_this
C++ exception: boost::wrapexcept<boost::system::system_error> at memory location
visual studio exception thrown at memory location
endless http::async_read_header
boost beast async_read_header utf8 problem
 Microsoft C++ exception: boost::system::system_error at memory location
boost asio error code
boost::asio::error::operation_aborted
BOOST_ASIO_DISABLE_IOCP
windows boost asio bad_address
windows bad_address EFAULT
boost::system::error_code bad_address windows
boost::system::error_code access violation windows
boost beast win_iocp
boost beast http::async_read_header ec null pointer
boost beast http::async_read exception
boost asio http::async_read exception windows invalid memory access
Microsoft C++ exception: boost::system::system_error at memory location

28.8.2019

lowest_layer().cancel()
Der E/A-Vorgang wurde wegen eines Threadendes oder einer Anwendungsanforderung abgebrochen
boost http::async_read_header ucrtbase crash windows
windows analyse wer dump file visual studio 2017
reset nessus You've now scanned over 16 different IP addresses over time, and Nessus will not let you scan any additi... 
stevie-sy commented 5 years ago

Thank you very much @dnsmichi, @Al2Klimov and everybody in background who helped to fix this. That's really amazaing that you find the cause of this Problem. Thank you for your hard and awesome work! You have our big respect! Chapeau!

dnsmichi commented 5 years ago

Thanks for the kind feedback, now we can start preparing the 2.11 release.

As written on Twitter already, the intention with all the details is to not only help an Icinga developer, but also everyone developing with Boost, Coroutines, etc. on Windows and debugging the same issue. Sharing knowledge is what makes open source strong, not only the source code.