Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.01k stars 577 forks source link

Improve network stack (Cluster, REST API) and use Boost ASIO, Beast, Coroutines and Context #7041

Closed dnsmichi closed 5 years ago

dnsmichi commented 5 years ago

TODOs

Will be updated upon merged PRs. Listed first for developers.

Presentation with status from CW18 / 2019

Package Tests

Snapshot packages are available for all supported platforms. Please check the instructions on how to test them: https://icinga.com/docs/icinga2/snapshot/doc/21-development/#snapshot-packages-nightly-builds

The installation documentation for the /snapshot docs also is already updated for 2.11 in case you're looking for dependencies and whatnot.

Note: SLES11 & Ubuntu 14 are EOL and not packaged.

Upgrading docs

User Feedback

The Story

TL;DR - we're rewriting a core part of Icinga 2 due to many bugs, crashes, problems. You'll likely don't see them in a small scale environment, but we aim to scale and not waste resources from embedded hardware to "super" computers these days.

This analysis leads back some years of experience, and is a collected story from GitHub issues, community forum questions, customer support tickets and personal experience.

The attempt is to solve two main things:

Known Problems

Incoming TLS Connection Handling

At first glance, every TLS connection is accepted, and this needs to be fast. With the introduced thread pool in 2.10, this became slower.

TLS IO Engine

The TLS handshake/read/writes happen in a different event engine, the full scope is described in the technical concepts in 2.10.

Currently there are 8 IO threads handling these TLS events. If you'll have 8 stalled TCP connection timeouts, or slowly lagging TLS connections, or a client whose CPU is too slow to work fast on the CPU intense TLS algorithms, this will block.

It may also occur that specific event messages are not correctly handled back to the caller (bug time).

Also, there is different event handling involved - poll() for Windows and macOS, epoll() by default on Linux. While this should work, debugging and troubleshooting gets harder. In the past, certain Ubuntu Kernels in 14.04 could not properly handle epoll() resulting in unreproducible crashes.

Overall, these code parts should work, but developers have a hard time debugging problems in there.

Outgoing TLS Connection Handling

This follows the Zone relationship, so masterA inside the master zone will actively to connect to all Endpoint objects which are/have:

Non-revised configurations in many environments do populate this by default, and as such, Icinga will attempt to regularly connect to endpoints which it cannot reach via TCP connect. This results in hanging threads waiting for the TCP connection timeout. Applications must not lower the timeout, this becomes tremendously hard to debug on the Kernel level. To conclude with, having 1000 endpoints unreachable, but Icinga attempts to reconnect every 10 seconds. Lots of wait times and blocked resources.

On the other hand, endpoints being reachable, this takes a while up until all connections are established again. This scenario happens on restart/reload when connections have been closed by the previous process - unfortunately there is no connection cycling between processes available on Linux/Unix.

The more endpoints are configured for active connection attempts, the more resources this will take. Versions before 2.10 spawned a thread for each connection, and deleted the thread after the connection was established. Spawning and dropping a thread is really resource consuming, which is why we've changed this into a global thread pool managing the required threads automatically.

This led into the problem that connection attempts were not as fast as before with just spawning a thread. Compared to incoming connections, it isn't a real bottleneck though.

TLS Handshake Timeouts

When the TLS handshake cannot succeed, this stalls this very own thread for this connection up until the timeout is reached. Versions before 2.8.2 might hang tight endlessly, a newly introduced timeout hardcoded to 10s has been made a configuration option in 2.10. Still, there may be low powered devices around, or high latency connections with packet losses involved.

Either way, Icinga should be able to perform, even when there's a lag in connection handling.

Stalled TLS read/writes

With v2.8.2 there were several attempts made to rate limit connections and avoid possible exploitation. While this seemed to happen in the upper layers, it actually influenced how TLS read/writes were handled, and caused "stalled TLS connection" handling. These patches with "cork" as identifier have been reverted in 2.10 already.

Context Switching and Resources

Threads are typically a good way to work in an asynchronous fashion. If you're not properly managing the thread count and required unique locks, you'll end up in worse performance than before. Modern kernels and CPU allow for fast context switching, but still - if everything is blocked from some hanging TCP connections, you did something wrong.

Also, having 2 CPUs and 100 threads on a steady basis is not a good idea, and possible lock waits can happen. The most visible part is when the REST API request is fired, next to already connecting to unreachable endpoints - it hangs forever, not getting any resources.

On another note, firing many requests into the HTTP server while already being overloaded may lead into unexpected behaviour and crashes even. The most "fun part" for developers is that they cannot simply reproduce this problem, and can only guess with patches. This must be changed in the future, going back to well known and tested libraries also in large scale environments.

The Problems in Issues

There are more from the past, already closed with attempted fixes of course.

Development Problems

Icinga 2's development started out in early 2012, natively on a relatively old C++ standard. If you've heard about C++11 already, this wasn't available at this time. Typically C++ code is generated into a machine binary by a compiler, gcc or clang/llvm. Both of them need to parse the code and understand the keywords given and so on - similar to what the Icinga 2 DSL config compiler does.

Having modern compilers fully supporting C++11 took a while, and with the general availability of gcc >= 4.8 on all supported platforms in 2018, we were finally able to start modernizing our code. Thing is, each distributions code should be compiled with the same compiler version, to avoid binary incompatibility. That being said, you cannot guarantee 100% compatibility with gcc7 compiled binaries linking against gcc4.8 compiled libraries.

Apart from the compiler and modern programming features, the C++ STL doesn't provide simple things like threading and concurrency on its own. This improved a bit with having std::thread since C++11, before that you would need to implement Posix threads on Linux/Unix and Windows threading on your own.

We've been using the Boost framework for that very reason, abstracting OS specifics away from the developer. Using Boost also turned out into a dependency problem - RHEL5 for example didn't support the UUID functionality (to generate unique object names, or API package names). Therefore we had to invent our own algorithms, slowly fading out now.

For the changes now going on, we focus on the Boost library more than ever, and ensure that we provide the most modern version as dependency on our own. Community packagers might need to catch up, still we also release package build sources as always.

Dependencies: Boost

Boost provides really cool features in modern versions, starting around 1.55 and 1.60. One of them is ASIO, allowing you to easily manage network connections with event listeners and IO also with TLS. This requires us to bump the required Boost version to 1.66.

Not many distributions provide this already, so we need to go the way to provide up-to-date package versions on packages.icinga.com. Our main focus is not to replace system libraries and use a custom package name and installation path. Typically these packages are maintained under the Icinga umbrella.

The involved build infrastructure is work-in-progress, following our migration from Jenkins to GitLab CI - for all supported distributions which are round about 25 at the time of writing, including Windows (where we already bundle Boost anyways).

ASIO

Asio provides the basic building blocks for C++ networking, concurrency and other kinds of I/O.

Basically everything we've built by hand up until now.

https://think-async.com/Asio/ https://www.reddit.com/r/cpp/comments/5xxv61/a_modern_c_network_library_for_developing_high/ https://dieboostcppbibliotheken.de/boost.asio-netzwerkprogrammierung

Beast

Our HTTP server, client and also the request/response handling is purely implemented from scratch starting with 2.4. This involves building custom JSON response messages, header dictionaries and custom parsing as well.

Boost Beast provides an abstracted interface for this, and works in combination with ASIO. It also has integrated buffers, and asynchronous header reading.

HTTP fields and verbs are also available as instances, and not pure strings as before. This allows to detect programming error by the compiler at first glance.

https://www.boost.org/doc/libs/1_66_0/libs/beast/doc/html/index.html

Coroutines and Context

If you're familiar with goroutines, the Boost library provides the same principle in a new library and Boost ASIO benefits from it. Typically, functions need not run til the end, but may return and continue at a later point. This benefit

https://theboostcpplibraries.com/boost.coroutine https://www.informatik-aktuell.de/entwicklung/programmiersprachen/c-17-coroutinen-in-c-mit-boostcoroutine2.html

Available architectures are listed here: https://www.boost.org/doc/libs/1_65_1/libs/context/doc/html/context/architectures.html

Deep dive blog post: https://www.netways.de/blog/2019/04/04/modern-c-programming-coroutines-with-boost/

Golang Sugar: Defer

This one takes a lambda function call, and gets called when the current scope is destroyed. This works in the same way as we use the Log class with << operators to immediately flush after the object is destroyed.

Example: Close the TLS stream when the connection method is gone, fixes for #6990.

The Changes Involved

Having analyzed the multiple problems and prepared the development environment, the main task is to rewrite Icinga 2 at its core in #7005.

Base Library Changes

This affects

Introduce changes:

Remote Library Changes

Introduce the following changes:

CLI Changes

Feature Changes

Additional Tasks

To be considered for 2.11 but not necessarily going to happen.

dnsmichi commented 5 years ago

Screen Shot 2019-03-27 at 14 51 04

dnsmichi commented 5 years ago
[2019-03-27 15:51:40 +0100] warning/ApiListener: Unexpected certificate common name while connecting to endpoint 'master1': got 'mbpmif.int.netways.de'
[2019-03-27 15:51:40 +0100] information/ApiListener: Finished reconnecting to endpoint 'master1' via host '127.0.0.1' and port '5665'
[2019-03-27 15:51:48 +0100] information/ApiListener: New client connection for identity 'master1' from [::ffff:127.0.0.1]:55843
[2019-03-27 15:51:48 +0100] information/ApiListener: Sending config updates for endpoint 'master1' in zone 'master'.

The warning above must lead to an error and disconnect. This is a matter of trust.

Al2Klimov commented 5 years ago

@dnsmichi Please try the branch aklimov/boost-asio.

Al2Klimov commented 5 years ago

I'm not quite sure about the trust problem. Let's discuss it offline.

dnsmichi commented 5 years ago

Mention that our boost packages are new, and that the old packages should be purged (if not used anywhere else).

Al2Klimov commented 5 years ago

Oh yeah... apropos follow-up tasks...

dnsmichi commented 5 years ago

ref/IP/13853 ref/IP/9745

ref/NC/588711 ref/NC/602777 ref/NC/599869 ref/NC/599801 ref/NC/597307 ref/NC/584405 ref/NC/591721 ref/NC/624569

dnsmichi commented 5 years ago

Here's a presentation I did at our Icinga guild at NETWAYS. It holds more details for those who are interested.

dnsmichi commented 5 years ago

CLI commands are finished, the hard part is the check_nscp_api plugin since NSCP doesn't follow the RFC for the status header which needs a more advanced workaround. This will allow us to remove all unused legacy code and improve our code quality overall.

This leaves this issue to

Something which was reported lately - memory consumption is higher than 2.10. This due to the fact that coroutines use a context which puts the function's execution frame onto the stack. That's the "price to pay" for fast and asynchronous network I/O.

dnsmichi commented 5 years ago

check_nscp_api is done which brings me to code cleanup stages and see what else is missing. Finally, this wrong header crap was really stressing me.

dnsmichi commented 5 years ago

@lippserd tests from a customer environment showed that snapshot packages would leave sockets in close wait. We're investigating on this and @Al2Klimov tries to reproduce this in a test environment.

dnsmichi commented 5 years ago

Browser tests against the REST API.

[2019-05-27 16:46:09 +0200] critical/ApiListener: Client TLS handshake failed (from [::1]:57376): Error: sslv3 alert certificate unknown

[2019-05-27 16:46:09 +0200] critical/ApiListener: Client TLS handshake failed (from [::1]:57377): Error: stream truncated

[2019-05-27 16:46:09 +0200] information/ApiListener: New client connection from [::1]:57378 (no client certificate)
dnsmichi commented 5 years ago

Our SSL context allows sslv3, we should strictly disallow this. https://stackoverflow.com/questions/38021693/boost-ssl-server-fails-with-sslv3-error

@@ -162,7 +162,7 @@ std::shared_ptr<boost::asio::ssl::context> MakeAsioSslContext(const String& pubk

        InitializeOpenSSL();

-       auto context (std::make_shared<ssl::context>(ssl::context::sslv23));
+       auto context (std::make_shared<ssl::context>(ssl::context::tlsv1));

        SetupSslContext(context->native_handle(), pubkey, privkey, cakey);
dnsmichi commented 5 years ago

In terms of truncated streams, this happens with Google Chrome where the browser doesn't gracefully close the connection.

https://github.com/boostorg/beast/issues/38

dnsmichi commented 5 years ago

@Al2Klimov @lippserd support for the transfer-encoding: chunked header was removed in the new implementation, we should discuss re-adding this.

dnsmichi commented 5 years ago

Ok, forget it. Beast implements that already.

dnsmichi commented 5 years ago

get_io_service() is deprecated - I am not sure which Boost version may remove this. Might be that @bsdlme sees that on FreeBSD.

https://stackoverflow.com/questions/39445636/alternative-to-deprecated-get-io-service

    auto& io = m_Listener.get_executor().context();

Edit: Yes, 1.70 removes that. https://github.com/rstudio/rstudio/issues/4636

bsdlme commented 5 years ago

I'm using boost-libs-1.70.0 on FreeBSD.

dnsmichi commented 5 years ago

We should look whether SO_ socket options can be abstracted into Boost itself.

acceptor.set_option(boost::asio::ip::tcp::acceptor::reuse_address(true));
dgoetz commented 5 years ago

Could we also disable TLS v1 and only allow v1.1 or later?

dnsmichi commented 5 years ago

Depends on the OpenSSL version support, CentOS 6 and Debian Jessie still use 1.0.1.

dnsmichi commented 5 years ago

@bsdlme I've created #7210 to address this.

dnsmichi commented 5 years ago

@dgoetz I would even go one step further and disable TLS 1.1 entirely. OpenSSL 1.0.1 supports 1.2 already. Likely this breaks clients running with OpenSSL 0.9.8 (el5) but this is out of our support scope. Opinions?

dnsmichi commented 5 years ago

TLS v1.3 support in Boost ASIO was added in 1.69, so we'll need to stick with manual flags anyways. https://www.boost.org/doc/libs/1_69_0/boost/asio/ssl/context_base.hpp

dgoetz commented 5 years ago

I would be fine with 1.2 as minimum version. Mozilla is recommending 1.2, Nist also. 1.1 is only the minimum version to mitigate some attacks on the protocol. Also 1.2 will limit the usable ciphers, but I do not think the list will be too restrictive. Losing Clients like WinXP's IE is not a problem in my opinion as our client is typically Icinga 2 itself, curl or programming languages.

We should keep in mind to adjust options for ApiListener in config and documentation, too.

dnsmichi commented 5 years ago

TLS 1.3 uses different macros to detect the minimum version, still, the current algorithm works - if >= 1.2 is specified, everything else is explicitly forbidden when building the SSL context.

We should keep in mind to adjust options for ApiListener in config and documentation, too.

That's already done while testing, thanks :) You should know me ;) PR incoming.

dnsmichi commented 5 years ago

Coming back to the error handling - browsers are really funny these days. Chrome for example fires 1-2 requests beforehand to pre-load the site, with actually sending nothing.

[2019-06-03 17:17:51 +0200] information/ApiListener: New client connection from [::1]:54277 (no client certificate)
[2019-06-03 17:17:51 +0200] information/ApiListener: No data received on new API connection from [::1]:54277. Ensure that the remote endpoints are properly configured in a cluster setup.
[2019-06-03 17:17:51 +0200] information/ApiListener: New client connection from [::1]:54278 (no client certificate)
[2019-06-03 17:17:51 +0200] information/HttpServerConnection: Request: GET /v1/objects/idomysqlconnections (from [::1]:54278), user: root, agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36).

This one already mitigates the truncated stream, which is simply ignored during TLS handshake now. See https://github.com/boostorg/beast/issues/915 for details and a linked workaround.

The problem with zero data connections is that we don't know whether we are HTTP or JSON-RPC at this point, so it is hard to not log anything for disconnected peers.

dnsmichi commented 5 years ago

Considering that eliptic curve thing from #5555 in an extra step. Except for the technical docs I'd say we're finished. I'll rewrite them in a minute.

dnsmichi commented 5 years ago

Thanks everyone for your combined effort, this is the largest rewrite after 2.0 and 2.4 👍

@lippserd this one is done, from the code til the docs.

dnsmichi commented 5 years ago

A note for the connection direction and hanging connections/double connections: I've rewritten these parts inside the distributed monitoring scenario examples while updating the client/agent terms and creating new images.

See #7342 for details.