Closed jschanz closed 5 years ago
Hello Jens,
I have the same Problem. dnsmichi told me in https://github.com/Icinga/icinga2/issues/6445#issuecomment-406569958 i should open an Bugreport for this, but i had no time.
I know that it might be insane in that setup, but I'd be interested in a gdb full backtrace from both running processes (master and client) at the time of such a reconnect attempt. Anything which looks suspicious, also network bandwidth peak usage, or CPU/IO changes. A full verbose look into the system, and a reproducer - be it many clients, a specific scenario with connection scenarios, etc. From a first look this can be anything and nothing.
Oh, and I'd like to see the zones.conf from both ends, especially I am interested in the connection direction and log_duration
.
Founds a method to reproduce this issue:
Use socat to mock the Icinga Agents:
socat OPENSSL-LISTEN:5665,certificate=icinga-certs/sat1.crt,key=icinga-certs/sat1.key,cafile=icinga-certs/ca.crt,fork - > /dev/null
Let Icinga connect to socat 7000 times:
for (i in range(7000)) {
object Endpoint "sat" + i {
host = "10.211.55.63"
}
object Zone "sat" + i {
parent = ZoneName
endpoints = [ name ]
}
}
And as the last step you need to disable the hostname check on our master:
diff --git a/lib/remote/apilistener.cpp b/lib/remote/apilistener.cpp
@@ -471,1 +471,1 @@ void ApiListener::NewClientHandlerInternal
- identity = GetCertificateCN(cert);
+ identity = hostname;
This will result in:
[2018-08-09 10:45:55 +0200] critical/ApiListener: Client TLS handshake failed (to [10.211.55.63]:5665) Error: Timeout during handshake.
Context:
(0) Handling new API client connection
TlsStream::Handshake()
will throw std::runtime_error("Timeout during handshake."))
after 10 seconds, but this is ignored at ApiListener::NewClientHandlerInternal:448
. Don't really know why it's running into that timeout and can't say if thats the one we're searching for.
Same problem here, after restart
[2018-08-17 13:47:01 +0200] warning/TlsStream: TLS stream was disconnected.
[2018-08-17 13:47:01 +0200] critical/ApiListener: Client TLS handshake failed (from [::ffff:123.123.123.123]:49778)
Context:
(0) Handling new API client connection
[2018-08-17 13:48:01 +0200] warning/TlsStream: TLS stream was disconnected.
[2018-08-17 13:48:01 +0200] critical/ApiListener: Client TLS handshake failed (from [::ffff:123.123.123.123]:49870)
Context:
(0) Handling new API client connection
i tried the following:
@dnsmichi Do you need further informations for this issue? I'm back from holiday now ...
@jschanz I'm currently trying to reproduce this correctly. Do those TLS handshake failures also happen on 2.8.4, or is this only happening on 2.9.x? Because in my tests 2.8.4 and 2.9.1 both get those TLS handshake failures every now and then with 1000+ Endpoints.
@N-o-X Is see this errors also in 2.8.4 and before, but not so many as in 2.9.1. I think it has something to do with a lot of connections and a timeout somewhere set. In 2.8.4 this happens sometimes on WAN clients with a bad connection. Maybe packets get lost, dropped or whatever or the amount of data to transfer is to large for the small bandwidth. I'll try to upgrade back to 2.9.x in the next days, but before i need to split the client connections to more icinga2 satellites to avoid further problems.
Between a lot of TLS failures, I see also messages like the following:
[2018-09-04 14:43:52 +0200] warning/JsonRpcConnection: Error while reading JSON-RPC message for identity 'randomhost.randomdomain': Error: Max data length exceeded: 64 KB
(0) icinga2: icinga::NetString::ReadStringFromStream(boost::intrusive_ptr<icinga::Stream> const&, icinga::String*, icinga::StreamReadContext&, bool, long) (+0x40c) [0x5ec1dc]
(1) icinga2: icinga::JsonRpc::ReadMessage(boost::intrusive_ptr<icinga::Stream> const&, icinga::String*, icinga::StreamReadContext&, bool, long) (+0x2e) [0x691e1e]
(2) icinga2: icinga::JsonRpcConnection::ProcessMessage() (+0x67) [0x691f17]
(3) icinga2: icinga::JsonRpcConnection::DataAvailableHandler() (+0x98) [0x6ae418]
(4) /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2() [0x618c8e]
(5) icinga2: icinga::Stream::SignalDataAvailable() (+0x3d) [0x673a8d]
(6) icinga2: icinga::TlsStream::OnEvent(int) (+0x2bc) [0x68040c]
(7) icinga2: icinga::SocketEventEngineEpoll::ThreadProc(int) (+0x3eb) [0x67b71b]
(8) libstdc++.so.6: <unknown function> (+0xb6970) [0x2b9281969970]
(9) libpthread.so.0: <unknown function> (+0x8064) [0x2b9280b8b064]
(10) libc.so.6: clone (+0x6d) [0x2b92821bd62d]
Error: Max data length exceeded: 64 KB
. Ouch @N-o-X.
Mhm... that should only happen with unauthorized clients. Known endpoints should not be limited by that. Might need some investigation.
That max data length thing is a different problem which will be solved with #6595.
The rest is - after days and many hours of reading the code and debugging different changes - a mix of asynchronous parallel connection and TLS handshake handling. More insights and possible patches in the coming days.
Worth a read: http://developerweb.net/viewtopic.php?id=5154
The problem is with a master connecting actively to 17000 endpoints on startup, and later in a 60 seconds reconnect timer. In our test setup, we don't have so many containers. Noah created a scenario in 2 VMs which involves a master VM, and a "satellite" VM which is the home for 500-800 clients built in Docker containers where this is reproducible.
In addition to the master connections, the reported setup also includes IDO queries and Director deployments. @jschanz was so kind to provide logs and gdb traces offline (I asked him since there's no support contract or NDA signed, but we know each other in person).
The reason I've asked for gdb backtraces from the running process is simply explained - connection problems normally indicate a resource problem. The logs from Jens also show threads handling TCP connections (and later performing TLS handshakes).
At some point either the network stack gives up on 1000+ parallel connections, or the TLS handshake is so damn expensive (CPU wise) that other operations slow down. Generally speaking, threads are fighting for resources and locks will be there, and likely many context switches.
This problem isn't new at all, there's no key difference between 2.8.4 or 2.9.1 or 2.7.2 even. The possible influencers for not seeing it under specific circumstances are simple: CPU resources and fast network.
A patch from 2015/2016 moved the synchronous TLS handling into an asynchronous event loop with wakeup signals via multiplexed socket pairs. Threads are waiting for others threads waking them up, and with the many socket IO from 500 to 17000 connection attempts, the Kernel might be overloaded.
In our analysis, next to trying many things which may have influenced the problem, we've also started to reverse engineer the current code base and its socket IO with events to get a full picture. During the analysis we've also looked into alternative implementations since our code implements these design patterns, but does not use external code which might be more error-prone.
As we've agreed during our team weekend, we will push this knowledge into the docs underneath "19-technical-concepts.md".
In order to fully debug the problem, we've created some custom docker image builds from source (still using release builds). Then docker-compose fires up the client containers with a fixed port range. The master gets a looped zones.conf for all these endpoints to connect to (same IP, different port).
A small patch is made against Icinga 2: Disable the name checks in the certificate and the connecting CN. Otherwise we could not re-use a single certificate to verify the CA trust chain.
Another opt-in patch disables the "Liveness"-Checks which would disconnect idle connections after 60 seconds.
In addition to the above changes, we've added many log messages to see the exact actions and calls in the event loop and state machine. This many threads make it impossible to debug with gdb and variants. These log messages won't be added to release builds though, only those which help the user. One of them is https://github.com/Icinga/icinga2/pull/6602
The resources and time to build the above can be counted as round about 10 work days.
Drop the Socket-Timeout, or make it last longer. This results into stale TLS Handshakes and leaks connection threads.
Revert IsEof() handling, which previously fixed false positive. Doesn't solve anything.
Deal with SSL_shutdown() not closed cleanly, but this doesn't influence the handshake itself. https://bugs.ruby-lang.org/issues/4237
SSL_do_handshake sometimes never returns success, and needs a repeated handshake. Our sockets are non-blocking, so a TLS handshake might need more retries. Tried it, made it worse and best of luck sometimes solved the handshake problem - the data processing and connection problem still exists.
Reconnect timer -> all 60 seconds, get all endpoints and connect if possible. Maybe do that more often, but only with a shard of endpoints. Currently 500 clients create 500 tcp connect() calls which may overload the system itself.
Client Connect Queue
Tested with a WQ with infinite tasks and 16 threads, on a 16 core system with 32 gb ram. Requires an additional patch which moves the "endpoint is connecting" into the same place where the asynchronous enqueuing happens - to avoid duplicated connection tasks in the WQ from an ongoing timer run.
Decreased the reconnect timer interval to 10s, not sure whether this is a good idea at this point, this definitely needs more tests.
Works pretty good, and moves the API request processing into "batches". At some point the queue is empty again (in the last log line)
michi@mbmif ~/coding/netways/elippmann/icinga2-api-stampede (master *=) $ php -d memory_limit=512M bin/stampede.php --hosts 1000 --verbose
[2018-09-11 17:33:01 +0200] information/WorkQueue: #7 (ApiListener, IncomingConnectQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2018-09-11 17:34:01 +0200] information/WorkQueue: #7 (ApiListener, IncomingConnectQueue) items: 127, rate: 14.2833/s (857/min 857/5min 857/15min);
[2018-09-11 17:33:51 +0200] information/WorkQueue: #7 (ApiListener, IncomingConnectQueue) items: 441, rate: 9.05/s (543/min 543/5min 543/15min); empty in less than 1 millisecond
[2018-09-11 17:33:41 +0200] information/WorkQueue: #7 (ApiListener, IncomingConnectQueue) items: 812, rate: 2.86667/s (172/min 172/5min 172/15min); empty in 10 seconds
[2018-09-11 17:34:10 +0200] information/ConfigItem: Activated all objects.
Find a reasonable number for
Create a patch and PR.
Tests by those affected.
Below is an analysis I've been writing with our docs in mind. Will result in a separate PR.
TLS-Handshake timeouts occur if the server is busy with reconnect handling and other tasks which run in isolated threads. Icinga 2 uses threads in many ways, e.g. for timers to wake them up, wait for check results, etc.
In terms of the cluster communication, the following flow applies.
AddConnection()
whereas a Tcp::Connect()
is called to create a TCP socket.ApiListener::NewClientHandler()
.TLS->Handshake()
ListenerThreadProc()
with an endless loop.TCP->Accept()
performs the initial socket establishment.ApiListener::NewClientHandler()
, Role being Server.TLS->Handshake()
.Once the TLS handshake and certificate verification is completed, the role is either Client
or Server
.
TLS->WaitForData()
waits for incoming messages from the remote client.Client
in this case is the instance which initiated the connection. If the master is doing this, the Icinga 2 client/agent acts as "server" which accepts incoming connections.
Everything runs through TLS, we don't use any "raw" connections nor plain message handling.
The TLS handshake and further read/write operations are not performed in a synchronous fashion in the new client's thread. Instead, all clients share an asynchronous "event pool".
The TlsStream constructor registers a new SocketEvent by calling its constructor. It binds the previously created TCP socket and itself into the created SocketEvent object.
SocketEvent::InitializeEngine()
takes care of whether to use epoll (Linux) or poll (BSD, Unix, Windows) as preferred socket poll engine. epoll has proven to be faster on Linux systems.
The selected engine is stored as l_SocketIOEngine
and later Start()
ensures to do the following:
dumb_socketpair
which basically is a pipe from in->out
and multiplexes the TCP socket into a local Unix socket. This removes the complexity and slowlyness of the kernel dealing with the TCP stack and new events.InitializeThread()
prepares epoll with epoll_create, socket descriptors and event mapping for later wakeup.ThreadProc
as endless loop.By default, there are 8 of these worker threads.
In the ThreadProc
loop, the following happens:
epoll_wait
gets called and provides an event whether new data is ready
(via socket IO from the Kernel).epoll_event
holds the .fd.data
attribute which references the multiplexed event FD (and therefore tcp socket FD).OnEvent()
.OnEvent
implements the "state machine" depending on the current desired action. By default, this is TlsActionNone
.
Once TlsStream->Handshake() is called, this initializes the current action to TlsActionHandshake
and performs SSL_do_handshake()
. This function returns > 0 when successful, anything below needs to be dealt separately.
If the handshake was successful, the registered condition variable m_CV
gets signalled and the thread waiting for the handshake in TlsStream->Handshake()
wakes up and continues within the ApiListener::NewClientHandler()
function.
Once the handshake is completed, current action is changed to either TlsActionRead
or TlsActionWrite
. This happens
in the beginning of the state machine when there is no action selected yet.
This also depends on the returned error codes of the SSL interface functions. Whenever SSL_WANT_READ
occurs, the event polling needs be changed to use POLLIN
, vice versa for SSL_WANT_WRITE
and POLLOUT
.
In the scenario where the master actively connects to the clients, the client will wait for data and change the event sockets to Read
once there's something coming on the sockets.
Action | Description |
---|---|
Read | Calls SSL_read() with a fixed buffer size of 64 KB. If rc > 0, the receive buffer of the TLS stream is filled and success indicated. This endless loop continues until a) SSL_pending() says no more data from remote b) Maximum bytes are read. If success is true, the condition variable notifies the thread in WaitForData to wake up. |
Write | The send buffer of the TLS stream Peek() s the first 64KB and calls SSL_write() to send them over the socket. The returned value is the number of bytes written, this is adjusted within the send buffer in the Read() call (it also optimizes the memory usage). |
Handshake | Calls SSL_do_handshake() and if successful, the condition variable wakes up the thread waiting for it in Handshake() . |
TLS error code | Description |
---|---|
SSL_WANT_READ |
The next event should read again, change events to POLLIN . |
SSL_ERROR_WANT_WRITE |
The next event should write, change events to POLLOUT . |
SSL_ERROR_ZERO_RETURN |
Nothing was returned, close the TLS stream and immediately return. |
default | Extract the error code and log a fancy error for the user. Close the connection. |
From this question:
With non-blocking sockets, SSL_WANT_READ means "wait for the socket to be readable, then call this function again."; conversely, SSL_WANT_WRITE means "wait for the socket to be writeable, then call this function again.". You can get either SSL_WANT_WRITE or SSL_WANT_READ from both an SSL_read() or SSL_write() call.
none
. This re-evaluates the conditions upon next event call.POLLIN|POLLOUT
.POLLIN
to wait for data.Once a stream has data available, it calls SignalDataAvailable()
. This holds a condition variable which wakes up another thread in a handled which was previously registered, e.g. for JsonRpcConnection, HttpServerConnection or HttpClientConnection objects.
All of them read data from the stream and process the messages. At this point the string is available as JSON already and later decoded (e.g. Icinga data structures, as Dictionary).
Taken from https://www.ibm.com/developerworks/aix/library/au-libev/index.html
One of the biggest problems facing many server deployments, particularly web server deployments, is the ability to handle a large number of connections. Whether you are building cloud-based services to handle network traffic, distributing your application over IBM Amazon EC instances, or providing a high-performance component for your web site, you need to be able to handle a large number of simultaneous connections.
A good example is the recent move to more dynamic web applications, especially those using AJAX techniques. If you are deploying a system that allows many thousands of clients to update information directly within a web page, such as a system providing live monitoring of an event or issue, then the speed at which you can effectively serve the information is vital. In a grid or cloud situation, you might have permanent open connections from thousands of clients simultaneously, and you need to be able to serve the requests and responses to each client.
Before looking at how libevent and libev are able to handle multiple network connections, let's take a brief look at some of the traditional solutions for handling this type of connectivity.
### Handling multiple clients
There are a number of different traditional methods that handle multiple connections, but usually they result in an issue handling large quantities of connections, either because they use too much memory, too much CPU, or they reach an operating system limit of some kind.
The main solutions used are:
* Round-robin: The early systems use a simple solution of round-robin selection, simply iterating over a list of open network connections and determining whether there is any data to read. This is both slow (especially as the number of connections increases) and inefficient (since other connections may be sending requests and expecting responses while you are servicing the current one). The other connections have to wait while you iterate through each one. If you have 100 connections and only one has data, you still have to work through the other 99 to get to the one that needs servicing.
* poll, epoll, and variations: This uses a modification of the round-robin approach, using a structure to hold an array of each of the connections to be monitored, with a callback mechanism so that when data is identified on a network socket, the handling function is called. The problem with poll is that the size of the structure can be quite large, and modifying the structure as you add new network connections to the list can increase the load and affect performance.
* select: The select() function call uses a static structure, which had previously been hard-coded to a relatively small number (1024 connections), which makes it impractical for very large deployments.
There are other implementations on individual platforms (such as /dev/poll on Solaris, or kqueue on FreeBSD/NetBSD) that may perform better on their chosen OS, but they are not portable and don't necessarily resolve the upper level problems of handling requests.
All of the above solutions use a simple loop to wait and handle requests, before dispatching the request to a separate function to handle the actual network interaction. The key is that the loop and network sockets need a lot of management code to ensure that you are listening, updating, and controlling the different connections and interfaces.
An alternative method of handling many different connections is to make use of the multi-threading support in most modern kernels to listen and handle connections, opening a new thread for each connection. This shifts the responsibility back to the operating system directly but implies a relatively large overhead in terms of RAM and CPU, as each thread will need it's own execution space. And if each thread (ergo network connection) is busy, then the context switching to each thread can be significant. Finally, many kernels are not designed to handle such a large number of active threads.
While analysing Icinga 2's socket IO event handling, the libraries and implementations below have been collected too. This thread also sheds more light in modern programming techniques.
Our main "problem" with Icinga 2 are modern compilers supporting the full C++11 feature set. Recent analysis have proven that gcc on CentOS 6 or SLES11 are too old to use modern programming techniques or anything which implemens C++14 at least.
Given the below projects, we are also not fans of wrapping C interfaces into C++ code in case you want to look into possible patches.
One key thing for external code is license compatibility with GPLv2.
Modified BSD and Boost can be pulled into the third-party/
directory, best header only and compiled
into the Icinga 2 binary.
Hallo, I try to add some of my observations. We have about 100 Hosts in 30 Zones. On my System the Problem starts with 2.9.0/2.9.1 on the Master and are gone after downgrade to 2.8.4 with the same config. IMHO it is related to a communication Problem between the Master and the Satellite due to a firewall which block the Communication from the Master to the Satellite and allow the Communication from Satellite to Master. The Problem is only on the Master-System. The Satellite can work 2.9.1 with no Problem.
I hope this might give you another look at the problem.
I really like this:
As we've agreed during our team weekend, we will push this knowledge into the docs underneath "19-technical-concepts.md".
It is super great to have detailed technical documentation that explains how Icinga2 works! Thank you! 👍
Thanks.
For the TLS handshake timeout, I would opt for making this configurable to allow users fine tuning (at the cost of performance with lagging connections).
[2018-09-13 12:41:55 +0000] critical/ApiListener: Client TLS handshake failed (to [10.77.20.8]:10062): Error: Timeout during handshake.
(0) Handling new API client connection
Context:
(0) Handling new API client connection
@Icebird2000 thanks, but that would mean that both, your master and your satellite are trying to connect to each other, or am I mistaken here? Meaning to say, the endpoints have the host
attribute set everywhere on all involved nodes.
Yes, you are right. Thehost
attribute is set on all involved nodes.
zone.conf master
object Zone "sat1.dc.example.com" {
endpoints = [ "sat1.dc.example.com" ]
parent = "master"
}
object Endpoint "sat1.dc.example.com" {
host = "XX.XX.XX.X"
}
/*
* Global zone for templates
*/
object Zone "global-templates" {
global = true
}
zone.conf satellite
object Endpoint "mas1.dc.example.com" {
host = "mas1.dc.example.com"
port = "5665"
}
object Zone "master" {
endpoints = [ "mas1.dc.example.com" ]
}
object Endpoint NodeName {
}
object Zone ZoneName {
endpoints = [ NodeName ]
parent = "master"
}
/*
* Global zone for templates
*/
object Zone "global-templates" {
global = true
}
So, this seems to be a problem with the master actively using Tcp->Connect(), but timing out on the firewall prohibiting this?
What happens in your setup, if you ensure that the master does not have the host
attribute for all satellite endpoints? (instead, the satellite nodes should have the host
attribute for the master endpoint) ... does this change something with 2.9.1?
In short, it works.
I had removed the host
attribute and test it for 15 Minutes and no error occured.
I change readd the host
attribute and the errors are back
[2018-09-13 17:28:12 +0200] critical/ApiListener: Client TLS handshake failed (to [XX.XX.XX.XX]:5665)
Context:
(0) Handling new API client connection
[2018-09-13 17:28:12 +0200] critical/ApiListener: Client TLS handshake failed (to [XX.XX.XX.XX]:5665)
Context:
(0) Handling new API client connection
Now i run with the host
attribute removed - config.
Hm. This would indicate that a TCP->Connect() was successful, and the TLS->Handshake continues. This really is strange. Thanks for the insights, much appreciated.
@N-o-X I'm not sure if our error handling in TcpSocket::Connect() is good enough. It seems to resolve the host
attribute and then loop, setting the file descriptor which later gets checked. We should test this behaviour with many clients not being reachable.
Further tests from a possible patch will be in #6619, once this is merged you can test the snapshot packages.
The workqueue in the middle still needs discussion, especially with threads waiting endlessly until tcp/tls timeouts happen. The queue may also block pending tasks if workers are acquired. A better approach would likely be a) use a connection pool ourselves with wake-up events, but no threads waiting for tasks, with the possibility to add on-demand threads if performance doesn't scale. b) drop the socketevent implementation and move back to synchronous socket handshakes/polls.
We're reading and evaluating more on this now. http://www.kegel.com/c10k.html
Moving the TLS handshake part out of the async IO design pattern partially works. Until you'll have to handle incoming socket events where no events are triggered anymore. Once you'll decide to move the remaining read/write operations out of this, you may read from the plain sockets but there's no way to signal data availability to registered handlers late away in JSON-RPC/HTTP.
One thing I've recognized during my rewrite is that SocketEvents are statically initialized once. This doesn't happen during application startup, but once the first TlsStream object is constructed - maybe too late for the socket IO thing when 7000 socket connect/handshake events then happen.
Something like this without any connection made yet.
(lldb) thr list
Process 47793 stopped
* thread #1: tid = 0xea700b, 0x00007fff59f88d82 libsystem_kernel.dylib`__semwait_signal + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
thread #2: tid = 0xea7023, 0x00007fff59f8a09a libsystem_kernel.dylib`poll + 10
thread #3: tid = 0xea7024, 0x00007fff59f8a09a libsystem_kernel.dylib`poll + 10
thread #4: tid = 0xea7025, 0x00007fff59f8a09a libsystem_kernel.dylib`poll + 10
thread #5: tid = 0xea7026, 0x00007fff59f8a09a libsystem_kernel.dylib`poll + 10
thread #6: tid = 0xea7027, 0x00007fff59f8a09a libsystem_kernel.dylib`poll + 10
thread #7: tid = 0xea7028, 0x00007fff59f8a09a libsystem_kernel.dylib`poll + 10
thread #8: tid = 0xea7029, 0x00007fff59f8a09a libsystem_kernel.dylib`poll + 10
thread #9: tid = 0xea702a, 0x00007fff59f8a09a libsystem_kernel.dylib`poll + 10
watch -n 1 -c "curl -k -s -u root:icinga 'https://localhost:5665/v1/status/ApiListener?pretty=1' | grep num_conn_endpoints"
watch -n 1 -c "ps -T -C icinga2 | wc -l"
500 Clients. Reconnect-Interval lowered from 60 to 10 seconds.
45 secs to fully reconnect 500 clients, 400 after 10 seconds. Thread peak was at 130, then drops to 100 and now 50 once all connections have been established.
Simulated reload in between (stop, start) and in a good spot (seems that the Kernel did not drop the connection yet, or any sort of keepalive re-using). 5 Seconds is very good.
root@nhilverling2:~/icinga2# icinga2 daemon
[2018-09-24 15:21:16 +0000] information/cli: Icinga application loader (version: v2.9.1-190-g91989ff)
...
[2018-09-24 15:21:20 +0000] information/ApiListener: Finished syncing endpoint 'icinga-sat-124' in zone 'icinga-sat-124'.
Hmmm 7000 endpoints with socat crashes my macbook, since fork() fails at some point.
mbmif /usr/local/icinga2/var/lib/icinga2/certs (master *) # socat OPENSSL-LISTEN:7000,certificate=mbmif.int.netways.de.crt,key=mbmif.int.netways.de.key,cafile=ca.crt,fork - > /dev/null
Moving to bigger test VMs.
root@nhilverling2:/usr/local/var/lib/icinga2/certs# socat OPENSSL-LISTEN:7000,certificate=nhilverling2.crt,key=nhilverling2.key,cafile=ca.crt,fork - /dev/null
root@nhilverling2:~/icinga2# vim /usr/local/etc/icinga2/zones.conf
/* SOCAT */
for (i in range(5000)) {
object Endpoint "sat" + i {
host = "127.0.0.1"
port = 7000
}
object Zone "sat" + i {
parent = "master"
endpoints = [ name ]
}
}
root@nhilverling2:~# watch -n 1 -c "ps -T -C icinga2 | wc -l"
watch -n 1 -c "curl -k -s -u root:icinga 'https://localhost:5665/v1/status/ApiListener?pretty=1' | grep num_conn_endpoints"
Starts to peak at 93 threads. Starts at :16, 5000 clients are connected at :46 ... so it takes ~30 seconds to reconnect.
More tomorrow.
Takes ~82 seconds to connect 10000 endpoints. Threads stay at 93. After a while, connections drop since there's nothing returned by the dummy clients. In a real world scenario, the cluster heartbeat would be returned and clients stay connected.
One observation - a dynamic thread pool compared to spawning and destroying threads is probably the solution here. 7000 threads for connections are very expensive, and I assume that our switch from boost::thread
to std::thread
also causes troubles here (2.8 -> 2.9 development tasks for C++11). An interesting read: https://www.meetingcpp.com/blog/items/c11-and-boost.html
stampede creates 1000 hosts simultaneously, therefore it opens many connections. This leads up to 2000 threads for this second, and once all connections are established, it fires the requests. 2 out of 1000 result in a 500 error. The dynamic thread pool resizes immediately once the many tasks are finished. The problem with this test are the config object creation which slow down Icinga 2 - it involves compiling the object and starting it into memory. A little less operations like querying the object status would return faster.
Following the design pattern for many connections on HTTP servers, this is totally fine to spawn this many threads for a moment. Since the thread pool implementation also kills off 2 threads at max, the resources for dropping idle threads are not that wasted as before - where 5000 threads were just destroyed.
Re-opening for test feedback. Triggering snapshot builds as we speak.
Hey there, I am sorry for necroposting - we have the same problem with 2 masters.
Version icinga2 - The Icinga 2 network monitoring daemon (version: r2.10.2-1)
The question is - should I try to reproduce and file a new bug or post all information here?
Thanks, Ivan
Not all Endpoints can't reconnect due to "Client TLS handshake failed" error after "reload or restart"
After upgrading to Icinga-2.9.0 and 2.9.1 we ran into a huge problem with reconnecting to our endpoints.
Do a "systemctl restart icinga2", a "systemctl reload icinga2" or a reload in Icingaweb2 and Director-1.4.3. The following is happening:
Between 15:31:12 and 15:31:14 a reconnect is triggered (and logged) for all 6897 endpoints.
For a few endpoint this is successful, the connection is established and config files are synced and updated.
Suddenly the CLient TLS handshakes appear
On client side (10.127.11.99) the following will occur in the logfile
Also the API gets unresponsive now and you can't query something. System and database load isn't higher as in 2.8.4 at this time.
A few minutes later, some system succeed in reconneting and syncing, other not.
Another few minutes later, about 1000 systems could do a successful reconnect without any TLS errors, but then TLS errors in logfile raises again. In half an hour there are only ~ 3000 endpoints reconnected, all others have to deal with the strange TLS handshake failures
With 2.9.0 and 2.9.1 it isn't possible to reconnect all endpoints in a time effective manner.
A Rollback to 2.8.4 fixes this problem immediately. After approximately 5 minutes all endpoints are reconnected.
Expected Behavior
Reconnection in 2.9.0 and 2.9.1 should work as in 2.8.x and before.
Current Behavior
It's not possible to do reconnect for ALL endpoints. Only a few ( ~ 1500 till 2000) could do a successful reconnect. All others have to deal with TLS errors nd possible timeouts.
Additionally a restart or reload creates a process in state
Possible Solution
Rollback to 2.8.4 at the moment ...
Steps to Reproduce (for bugs)
Hard to reproduce, because you need at least ~ 7000 endpoints and ~ 140.000 services connected.
Context
Your Environment
icinga2 --version
):Rollback done to 2.8.4-1
Operating System and version:
Enabled features (
icinga2 feature list
):** Disabled features: compatlog debuglog elasticsearch gelf graphite livestatus opentsdb perfdata statusdata syslog
icinga2 daemon -C
):If you run multiple Icinga 2 instances, the
zones.conf
file (oricinga2 object list --type Endpoint
andicinga2 object list --type Zone
) from all affected nodes.icinga2 object list --type Zone | grep Object | wc -l q6898
icinga2 object list --type Zone | grep Object | wc -l 6899