Closed dnsmichi closed 5 years ago
body_limit
and errors out if not raised.[2019-03-27 15:51:40 +0100] warning/ApiListener: Unexpected certificate common name while connecting to endpoint 'master1': got 'mbpmif.int.netways.de'
[2019-03-27 15:51:40 +0100] information/ApiListener: Finished reconnecting to endpoint 'master1' via host '127.0.0.1' and port '5665'
[2019-03-27 15:51:48 +0100] information/ApiListener: New client connection for identity 'master1' from [::ffff:127.0.0.1]:55843
[2019-03-27 15:51:48 +0100] information/ApiListener: Sending config updates for endpoint 'master1' in zone 'master'.
The warning above must lead to an error and disconnect. This is a matter of trust.
@dnsmichi Please try the branch aklimov/boost-asio
.
I'm not quite sure about the trust problem. Let's discuss it offline.
Mention that our boost packages are new, and that the old packages should be purged (if not used anywhere else).
Oh yeah... apropos follow-up tasks...
/v1/events
)ref/IP/13853 ref/IP/9745
ref/NC/588711 ref/NC/602777 ref/NC/599869 ref/NC/599801 ref/NC/597307 ref/NC/584405 ref/NC/591721 ref/NC/624569
Here's a presentation I did at our Icinga guild at NETWAYS. It holds more details for those who are interested.
CLI commands are finished, the hard part is the check_nscp_api plugin since NSCP doesn't follow the RFC for the status header which needs a more advanced workaround. This will allow us to remove all unused legacy code and improve our code quality overall.
This leaves this issue to
Something which was reported lately - memory consumption is higher than 2.10. This due to the fact that coroutines use a context which puts the function's execution frame onto the stack. That's the "price to pay" for fast and asynchronous network I/O.
check_nscp_api is done which brings me to code cleanup stages and see what else is missing. Finally, this wrong header crap was really stressing me.
@lippserd tests from a customer environment showed that snapshot packages would leave sockets in close wait. We're investigating on this and @Al2Klimov tries to reproduce this in a test environment.
Browser tests against the REST API.
[2019-05-27 16:46:09 +0200] critical/ApiListener: Client TLS handshake failed (from [::1]:57376): Error: sslv3 alert certificate unknown
[2019-05-27 16:46:09 +0200] critical/ApiListener: Client TLS handshake failed (from [::1]:57377): Error: stream truncated
[2019-05-27 16:46:09 +0200] information/ApiListener: New client connection from [::1]:57378 (no client certificate)
Our SSL context allows sslv3, we should strictly disallow this. https://stackoverflow.com/questions/38021693/boost-ssl-server-fails-with-sslv3-error
@@ -162,7 +162,7 @@ std::shared_ptr<boost::asio::ssl::context> MakeAsioSslContext(const String& pubk
InitializeOpenSSL();
- auto context (std::make_shared<ssl::context>(ssl::context::sslv23));
+ auto context (std::make_shared<ssl::context>(ssl::context::tlsv1));
SetupSslContext(context->native_handle(), pubkey, privkey, cakey);
In terms of truncated streams, this happens with Google Chrome where the browser doesn't gracefully close the connection.
@Al2Klimov @lippserd support for the transfer-encoding: chunked
header was removed in the new implementation, we should discuss re-adding this.
Ok, forget it. Beast implements that already.
get_io_service() is deprecated - I am not sure which Boost version may remove this. Might be that @bsdlme sees that on FreeBSD.
https://stackoverflow.com/questions/39445636/alternative-to-deprecated-get-io-service
auto& io = m_Listener.get_executor().context();
Edit: Yes, 1.70 removes that. https://github.com/rstudio/rstudio/issues/4636
I'm using boost-libs-1.70.0 on FreeBSD.
We should look whether SO_
socket options can be abstracted into Boost itself.
acceptor.set_option(boost::asio::ip::tcp::acceptor::reuse_address(true));
Could we also disable TLS v1 and only allow v1.1 or later?
Depends on the OpenSSL version support, CentOS 6 and Debian Jessie still use 1.0.1.
@bsdlme I've created #7210 to address this.
@dgoetz I would even go one step further and disable TLS 1.1 entirely. OpenSSL 1.0.1 supports 1.2 already. Likely this breaks clients running with OpenSSL 0.9.8 (el5) but this is out of our support scope. Opinions?
TLS v1.3 support in Boost ASIO was added in 1.69, so we'll need to stick with manual flags anyways. https://www.boost.org/doc/libs/1_69_0/boost/asio/ssl/context_base.hpp
I would be fine with 1.2 as minimum version. Mozilla is recommending 1.2, Nist also. 1.1 is only the minimum version to mitigate some attacks on the protocol. Also 1.2 will limit the usable ciphers, but I do not think the list will be too restrictive. Losing Clients like WinXP's IE is not a problem in my opinion as our client is typically Icinga 2 itself, curl or programming languages.
We should keep in mind to adjust options for ApiListener in config and documentation, too.
TLS 1.3 uses different macros to detect the minimum version, still, the current algorithm works - if >= 1.2 is specified, everything else is explicitly forbidden when building the SSL context.
We should keep in mind to adjust options for ApiListener in config and documentation, too.
That's already done while testing, thanks :) You should know me ;) PR incoming.
Coming back to the error handling - browsers are really funny these days. Chrome for example fires 1-2 requests beforehand to pre-load the site, with actually sending nothing.
[2019-06-03 17:17:51 +0200] information/ApiListener: New client connection from [::1]:54277 (no client certificate)
[2019-06-03 17:17:51 +0200] information/ApiListener: No data received on new API connection from [::1]:54277. Ensure that the remote endpoints are properly configured in a cluster setup.
[2019-06-03 17:17:51 +0200] information/ApiListener: New client connection from [::1]:54278 (no client certificate)
[2019-06-03 17:17:51 +0200] information/HttpServerConnection: Request: GET /v1/objects/idomysqlconnections (from [::1]:54278), user: root, agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36).
This one already mitigates the truncated stream, which is simply ignored during TLS handshake now. See https://github.com/boostorg/beast/issues/915 for details and a linked workaround.
The problem with zero data connections is that we don't know whether we are HTTP or JSON-RPC at this point, so it is hard to not log anything for disconnected peers.
Considering that eliptic curve thing from #5555 in an extra step. Except for the technical docs I'd say we're finished. I'll rewrite them in a minute.
Thanks everyone for your combined effort, this is the largest rewrite after 2.0 and 2.4 👍
@lippserd this one is done, from the code til the docs.
A note for the connection direction and hanging connections/double connections: I've rewritten these parts inside the distributed monitoring scenario examples while updating the client/agent terms and creating new images.
See #7342 for details.
TODOs
Will be updated upon merged PRs. Listed first for developers.
Presentation with status from CW18 / 2019
Package Tests
Snapshot packages are available for all supported platforms. Please check the instructions on how to test them: https://icinga.com/docs/icinga2/snapshot/doc/21-development/#snapshot-packages-nightly-builds
The installation documentation for the /snapshot docs also is already updated for 2.11 in case you're looking for dependencies and whatnot.
User Feedback
The Story
TL;DR - we're rewriting a core part of Icinga 2 due to many bugs, crashes, problems. You'll likely don't see them in a small scale environment, but we aim to scale and not waste resources from embedded hardware to "super" computers these days.
This analysis leads back some years of experience, and is a collected story from GitHub issues, community forum questions, customer support tickets and personal experience.
The attempt is to solve two main things:
Known Problems
Incoming TLS Connection Handling
At first glance, every TLS connection is accepted, and this needs to be fast. With the introduced thread pool in 2.10, this became slower.
TLS IO Engine
The TLS handshake/read/writes happen in a different event engine, the full scope is described in the technical concepts in 2.10.
Currently there are 8 IO threads handling these TLS events. If you'll have 8 stalled TCP connection timeouts, or slowly lagging TLS connections, or a client whose CPU is too slow to work fast on the CPU intense TLS algorithms, this will block.
It may also occur that specific event messages are not correctly handled back to the caller (bug time).
Also, there is different event handling involved - poll() for Windows and macOS, epoll() by default on Linux. While this should work, debugging and troubleshooting gets harder. In the past, certain Ubuntu Kernels in 14.04 could not properly handle epoll() resulting in unreproducible crashes.
Overall, these code parts should work, but developers have a hard time debugging problems in there.
Outgoing TLS Connection Handling
This follows the Zone relationship, so masterA inside the master zone will actively to connect to all Endpoint objects which are/have:
host
attribute of the Endpoint object is setNon-revised configurations in many environments do populate this by default, and as such, Icinga will attempt to regularly connect to endpoints which it cannot reach via TCP connect. This results in hanging threads waiting for the TCP connection timeout. Applications must not lower the timeout, this becomes tremendously hard to debug on the Kernel level. To conclude with, having 1000 endpoints unreachable, but Icinga attempts to reconnect every 10 seconds. Lots of wait times and blocked resources.
On the other hand, endpoints being reachable, this takes a while up until all connections are established again. This scenario happens on restart/reload when connections have been closed by the previous process - unfortunately there is no connection cycling between processes available on Linux/Unix.
The more endpoints are configured for active connection attempts, the more resources this will take. Versions before 2.10 spawned a thread for each connection, and deleted the thread after the connection was established. Spawning and dropping a thread is really resource consuming, which is why we've changed this into a global thread pool managing the required threads automatically.
This led into the problem that connection attempts were not as fast as before with just spawning a thread. Compared to incoming connections, it isn't a real bottleneck though.
TLS Handshake Timeouts
When the TLS handshake cannot succeed, this stalls this very own thread for this connection up until the timeout is reached. Versions before 2.8.2 might hang tight endlessly, a newly introduced timeout hardcoded to 10s has been made a configuration option in 2.10. Still, there may be low powered devices around, or high latency connections with packet losses involved.
Either way, Icinga should be able to perform, even when there's a lag in connection handling.
Stalled TLS read/writes
With v2.8.2 there were several attempts made to rate limit connections and avoid possible exploitation. While this seemed to happen in the upper layers, it actually influenced how TLS read/writes were handled, and caused "stalled TLS connection" handling. These patches with "cork" as identifier have been reverted in 2.10 already.
Context Switching and Resources
Threads are typically a good way to work in an asynchronous fashion. If you're not properly managing the thread count and required unique locks, you'll end up in worse performance than before. Modern kernels and CPU allow for fast context switching, but still - if everything is blocked from some hanging TCP connections, you did something wrong.
Also, having 2 CPUs and 100 threads on a steady basis is not a good idea, and possible lock waits can happen. The most visible part is when the REST API request is fired, next to already connecting to unreachable endpoints - it hangs forever, not getting any resources.
On another note, firing many requests into the HTTP server while already being overloaded may lead into unexpected behaviour and crashes even. The most "fun part" for developers is that they cannot simply reproduce this problem, and can only guess with patches. This must be changed in the future, going back to well known and tested libraries also in large scale environments.
The Problems in Issues
6981 pki request hangs
6973 cluster connections fail with TLS clients
6930 REST API stalls
6836 crashes while Saint security scanner runs
6559 crashes while Nessus security scanner runs
3029 TLS IO Engine cannot read from socket, bombing the log on Windows
7075 Stale TLS connections from security scanner
7043 Fully utilized thread pool leads to unresponsive API
There are more from the past, already closed with attempted fixes of course.
Development Problems
Icinga 2's development started out in early 2012, natively on a relatively old C++ standard. If you've heard about C++11 already, this wasn't available at this time. Typically C++ code is generated into a machine binary by a compiler, gcc or clang/llvm. Both of them need to parse the code and understand the keywords given and so on - similar to what the Icinga 2 DSL config compiler does.
Having modern compilers fully supporting C++11 took a while, and with the general availability of gcc >= 4.8 on all supported platforms in 2018, we were finally able to start modernizing our code. Thing is, each distributions code should be compiled with the same compiler version, to avoid binary incompatibility. That being said, you cannot guarantee 100% compatibility with gcc7 compiled binaries linking against gcc4.8 compiled libraries.
Apart from the compiler and modern programming features, the C++ STL doesn't provide simple things like threading and concurrency on its own. This improved a bit with having
std::thread
since C++11, before that you would need to implement Posix threads on Linux/Unix and Windows threading on your own.We've been using the Boost framework for that very reason, abstracting OS specifics away from the developer. Using Boost also turned out into a dependency problem - RHEL5 for example didn't support the UUID functionality (to generate unique object names, or API package names). Therefore we had to invent our own algorithms, slowly fading out now.
For the changes now going on, we focus on the Boost library more than ever, and ensure that we provide the most modern version as dependency on our own. Community packagers might need to catch up, still we also release package build sources as always.
Dependencies: Boost
Boost provides really cool features in modern versions, starting around 1.55 and 1.60. One of them is ASIO, allowing you to easily manage network connections with event listeners and IO also with TLS. This requires us to bump the required Boost version to 1.66.
Not many distributions provide this already, so we need to go the way to provide up-to-date package versions on packages.icinga.com. Our main focus is not to replace system libraries and use a custom package name and installation path. Typically these packages are maintained under the Icinga umbrella.
The involved build infrastructure is work-in-progress, following our migration from Jenkins to GitLab CI - for all supported distributions which are round about 25 at the time of writing, including Windows (where we already bundle Boost anyways).
ASIO
Basically everything we've built by hand up until now.
https://think-async.com/Asio/ https://www.reddit.com/r/cpp/comments/5xxv61/a_modern_c_network_library_for_developing_high/ https://dieboostcppbibliotheken.de/boost.asio-netzwerkprogrammierung
Beast
Our HTTP server, client and also the request/response handling is purely implemented from scratch starting with 2.4. This involves building custom JSON response messages, header dictionaries and custom parsing as well.
Boost Beast provides an abstracted interface for this, and works in combination with ASIO. It also has integrated buffers, and asynchronous header reading.
HTTP fields and verbs are also available as instances, and not pure strings as before. This allows to detect programming error by the compiler at first glance.
https://www.boost.org/doc/libs/1_66_0/libs/beast/doc/html/index.html
Coroutines and Context
If you're familiar with goroutines, the Boost library provides the same principle in a new library and Boost ASIO benefits from it. Typically, functions need not run til the end, but may return and continue at a later point. This benefit
https://theboostcpplibraries.com/boost.coroutine https://www.informatik-aktuell.de/entwicklung/programmiersprachen/c-17-coroutinen-in-c-mit-boostcoroutine2.html
Available architectures are listed here: https://www.boost.org/doc/libs/1_65_1/libs/context/doc/html/context/architectures.html
Deep dive blog post: https://www.netways.de/blog/2019/04/04/modern-c-programming-coroutines-with-boost/
Golang Sugar: Defer
This one takes a lambda function call, and gets called when the current scope is destroyed. This works in the same way as we use the
Log
class with<<
operators to immediately flush after the object is destroyed.Example: Close the TLS stream when the connection method is gone, fixes for #6990.
The Changes Involved
Having analyzed the multiple problems and prepared the development environment, the main task is to rewrite Icinga 2 at its core in #7005.
Base Library Changes
This affects
Introduce changes:
Remote Library Changes
Introduce the following changes:
Use boost::asio::ip::tcp for network communication, with transparent IPv6 handling
Accept incoming connections with boost::asio::ip::tcp::acceptor
TLS handling is done with ASIO
ASIO Thread pools for
HTTP parsing and processing is handed over to ASIO and Beast
ALL URL endpoints need to rewritten for ASIO and Beast handling
API Event streaming must keep the connection open, different IO handling
JsonRpc::SendMessage and NetString::WriteStringToStream need to use ASIO with async writes
JsonRpc::ReadMessage and NetString::ReadStringFromStream use ASIO
Rewrite JsonRpcConnection to use Boost ASIO
CLI Changes
pki request
which sends a JSON-RPC message to request a certificate (also used innode wizard
andnode setup
console
andcheck_nscp_api
implementing an HTTP clientFeature Changes
Additional Tasks
To be considered for 2.11 but not necessarily going to happen.
Implement the global thread pool using Boost ASIO (#6988) - Utility::QueueAsyncCallback()
Replace our timer class with Boost ASIO timers (https://www.boost.org/doc/libs/1_65_1/doc/html/boost_asio/tutorial/tuttimer2.html)
In theory, we could use Boost Beast to implement websockets too (https://www.boost.org/doc/libs/1_66_0/libs/beast/doc/html/index.html)
With the availability of Boost 1.66, we can use other self-implemented features too.