Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2.03k stars 578 forks source link

No TLS Handshake applied anymore #7805

Closed Elias481 closed 4 years ago

Elias481 commented 4 years ago

Describe the bug

It looks like (at least) the TLS Handshake Timeout got lost with new 2.11 networking stack. We sometimes have restarted machines that do not ome online again after a restart of the agent for example. From logging I see that a "starting reconnect" is logged and about 2 hours later a connection reset is received during TLS handshake. Immediately after that error the connection is set up successfully.

To Reproduce

  1. Prepare some way to let an TLS handshake time out (but the TCP connection itself must establish before)
  2. Reconnect to such agent. See the agent is not reconnecting until something in the network layer rests the connection - or You restart satellite/master.

Expected behavior

A meaningful timeout (the old default of 10 seconds is fine, for example)

Your Environment

Include as many relevant details about the environment you experienced the problem in

Additional context

I can find TlsHandshakeTimeout related code in old tlsstram.cpp, but nothin in current version (there should be something like get_lowest_layer(stream).expires_after(std::chrono::seconds(10)); in NewClientHandlerInternal before sslConn.async_handshake I assume ..

mcktr commented 4 years ago

Hi,

Prepare some way to let an TLS handshake time out (but the TCP connection itself must establish before)

@Elias481 do you know a way to provoke a TLS handshake timeout? What causes the timeout in your environment?

Kind regards Michael

Elias481 commented 4 years ago

The simplest case to simulate such situation, for example, is:

  1. on a test agent connected to test satellite or master, shut down icinga2
  2. open a stub server on the port that accepts incoming connection (but does not automatically answer or initiate SSL handshake or close because of protocol errir), i.e. nc -w 3600 -v -l 5665

Observer the logs during that. You can see befor the stub server is restarted there are quite many reconnect tries with connection refused. Afterwards it does npothing after the connection attempt started. Btw. if the connection is quite cleanly shut down, the log states information/ApiListener: Finished reconnecting to endpoint 'xxxnn.subdom.priv.nmop.de' via host '10.101.15.209' and port '5665' without any information that the connection closed. While in production we giot a connection reset beause the infrastructure somehow terminated the connection.

The reason is unclear. It happened directly after reboot for example or after restarting agent. One time I was called to look on it myself and I ensured the agent was completely stoppe (no stale process and no TCP connection on that port visible at agent side). Anyway such things can happen if there is a realworld network between the servers.

mcktr commented 4 years ago

Reproducer

Setup

Master

# sudo docker run -ti -h deb10i2m1 -p 5665:5665 debian:buster /bin/bash

# apt-get update && apt-get upgrade -y && apt-get install wget gnupg2 ca-certificates vim apt-transport-https -y && echo "deb https://packages.icinga.com/debian/ icinga-buster-snapshots main" > /etc/apt/sources.list.d/icinga.list && wget -O - https://packages.icinga.com/icinga.key | apt-key add - && apt-get update && apt-get install icinga2 monitoring-plugins -y && /usr/lib/icinga2/prepare-dirs

# icinga2 node wizard
Welcome to the Icinga 2 Setup Wizard!

We will guide you through all required configuration details.

Please specify if this is an agent/satellite setup ('n' installs a master setup) [Y/n]: n

Starting the Master setup routine...

Please specify the common name (CN) [deb10i2m1]: 
Reconfiguring Icinga...
Checking for existing certificates for common name 'deb10i2m1'...
Certificates not yet generated. Running 'api setup' now.

Generating master configuration for Icinga 2.
Enabling feature api. Make sure to restart Icinga 2 for these changes to take effect.

Master zone name [master]: 
Default global zones: global-templates director-global
Do you want to specify additional global zones? [y/N]: Please specify the API bind host/port (optional):
Bind Host []: Bind Port []: 
Do you want to disable the inclusion of the conf.d directory [Y/n]: 
Disabling the inclusion of the conf.d directory...
Checking if the api-users.conf file exists...

Done.

Now restart your Icinga 2 daemon to finish the installation!

# vim /etc/icinga2/zones.conf

object Endpoint "deb10i2m1" {
}

object Zone "master" {
    endpoints = [ "deb10i2m1" ]
}

object Zone "global-templates" {
    global = true
}

object Zone "director-global" {
    global = true
}

object Endpoint "deb10i2a1" {
    host = "172.17.0.3"
}

object Zone "deb10i2a1" {
    parent = "master"
    endpoints = [ "deb10i2a1" ]
}

Agent

# sudo docker run -ti -h deb10i2a1 debian:buster /bin/bash

# apt-get update && apt-get upgrade -y && apt-get install wget gnupg2 ca-certificates vim apt-transport-https -y && echo "deb https://packages.icinga.com/debian/ icinga-buster-snapshots main" > /etc/apt/sources.list.d/icinga.list && wget -O - https://packages.icinga.com/icinga.key | apt-key add - && apt-get update && apt-get install icinga2 monitoring-plugins netcat -y && /usr/lib/icinga2/prepare-dirs

# icinga2 node wizard
Welcome to the Icinga 2 Setup Wizard!

We will guide you through all required configuration details.

Please specify if this is an agent/satellite setup ('n' installs a master setup) [Y/n]: 

Starting the Agent/Satellite setup routine...

Please specify the common name (CN) [deb10i2a1]: 

Please specify the parent endpoint(s) (master or satellite) where this node should connect to:
Master/Satellite Common Name (CN from your master/satellite node): deb10i2m1

Do you want to establish a connection to the parent node from this node? [Y/n]: 
Please specify the master/satellite connection information:
Master/Satellite endpoint host (IP address or FQDN): 172.17.0.2
Master/Satellite endpoint port [5665]: 

Add more master/satellite endpoints? [y/N]: 
Parent certificate information:

 Subject:     CN = deb10i2m1
 Issuer:      CN = Icinga CA
 Valid From:  Feb  4 19:12:57 2020 GMT
 Valid Until: Jan 31 19:12:57 2035 GMT
 Fingerprint: 0F 31 F7 30 C3 9E E3 73 56 9E D0 CC 41 16 A6 14 77 05 37 55 

Is this information correct? [y/N]: y

Please specify the request ticket generated on your Icinga 2 master (optional).
 (Hint: # icinga2 pki ticket --cn 'deb10i2a1'): 

No ticket was specified. Please approve the certificate signing request manually
on the master (see 'icinga2 ca list' and 'icinga2 ca sign --help' for details).
Please specify the API bind host/port (optional):
Bind Host []: 
Bind Port []: 

Accept config from parent node? [y/N]: 
Accept commands from parent node? [y/N]: 

Reconfiguring Icinga...
Disabling feature notification. Make sure to restart Icinga 2 for these changes to take effect.
Enabling feature api. Make sure to restart Icinga 2 for these changes to take effect.

Local zone name [deb10i2a1]: 
Parent zone name [master]: 

Default global zones: global-templates director-global
Do you want to specify additional global zones? [y/N]: 

Do you want to disable the inclusion of the conf.d directory [Y/n]: 
Disabling the inclusion of the conf.d directory...

Done.

Now restart your Icinga 2 daemon to finish the installation!

Now sign the certificate on the master.

# icinga2 ca list
Fingerprint                                                      | Timestamp                | Signed | Subject
-----------------------------------------------------------------|--------------------------|--------|--------
4b536c1309b2161f18b6b2993a273b3ea9c84a097bb131b0df2dca4170487701 | Feb  4 19:14:41 2020 GMT |        | CN = deb10i2a1

# icinga2 ca sign 4b536c1309b2161f18b6b2993a273b3ea9c84a097bb131b0df2dca4170487701

Restart Icinga 2 on master and agent.

Problem

Ensure that master and agent communicate. Stop Icinga 2 on the agent and run nc.

nc -w 3600 -v -l -p 5665

The master tries to reconnect but the connection should run into a timeout since the TLS handshake never succeeds. In the log on the master you'll see the following line.

[2020-02-04 19:33:58 +0000] information/ApiListener: Reconnecting to endpoint 'deb10i2a1' via host '172.17.0.3' and port '5665'

The default timeout of 10 seconds is never hit.

Analysis

As @Elias481 already mentioned is seems like the TLS timeout check was not implemented in the network stack rewrite.

dnsmichi commented 4 years ago

Yep, that's a known bug, thanks for reporting. AFAIK Boost ASIO doesn't provide such a timeout interface that's why it wasn't implemented during the rewrite. If you can spot a patch, much appreciated.

Edit: The stream timeout exists, but requires all subsequent operations to complete in that timeout window afterwards. For read/write operations, this will cause problems in our stack. https://www.boost.org/doc/libs/1_70_0/libs/beast/doc/html/beast/using_io/timeouts.html Likely we will need our own timeout handling with timers, as done here.

Elias481 commented 4 years ago

From what I undertood the documentation it is possible to prolong the timer (set a new timeout) for the next operations or disable it for the following ones (stream.expires_never();). But anyway I'm fine with any solution. Thanks.

PedroMSantosD commented 4 years ago

is there a recomendation on boost version to use on 2.11.3 ?