Closed stevie-sy closed 5 years ago
It just keeps happening :rage4:
Do you happen to have a log around the time of the crash? Maybe even a log from Nessus so we can see what it's doing?
I have to ask my collegs from the security group to get it. Give us a little bit time to consolidate the logs from Icinga, Apache and Nessus.
We also see this issue where the Nessus security scan crashes the Icinga2 service. I included the crash report and other information in https://github.com/Icinga/icinga2/issues/6562 (that was dupped to this issue).
What exactly does Nessus do in this specific case? Open a Tcp Socket, or doing more than a TLS handshake? Any Wireshark dumps to see the packets?
I want to do a little status report: I talked with my colleagues from the security. If they do a Nessus Scan with 30 requests per seconds Icinga will crash afterwards. When they reduce it to 5 requests per seconds Icinga will surive.
How it happens?
For us it looks like that nessus do a connection to the icinga port 5665. Icinga will close it, but nessus says "no" with a ACK-frame. It seems that the connection will never close with a FIN-frame. So the port will be open. At the first look it seems that Icinga will surive the scan.
But when you do a reload of the icinga daemon it happens (e.g. with a automatic deploy of a new config with the director). Icinga create a new process with a new pid and want to stop the old process. But this doesn't work. So with ps axu
you can see, that there are two processes with two pids and the old one do not disappear. If you do a systemctl Status icinga2
on the bash the status is reload and it won't chance.
Our Problem is that there are no log files like a crash log. At the `journalctl we don't find a entry for this.
`My colleagues try to reproduce this scenario whithout always start nessus. But at the Moment it doesn't work.
Maybe this information helps you for the moment.
@dnsmichi telepathy :-)
@dnsmichi Just thinking: Is the problem just result of this issue? #6517 I read what you wrote there. For me it looks like that it could be the same problem or something similar or a result of that.
My colleague will check this next week if there are TLS handshakes from the nessus Server in the icinga log.
It may be related, if the scanner doesn't close the TLS connection cleanly. That's why I want to see more logs and a tcpdump from that scanner - especially the end packets on such a connect.
Sorry for the delay, but now we have more logs about the problem. (I am the colleague from stevie-sy)
In this usecase our Windows Agent "MSLI01-036" (10.1.41.224) crashes when "NESSUS" (10.1.36.101) scans him. The Icinga parent zone is called "network" and their endpoints are "zmon-satellite3" and "zmon-satellite4".
a short timetable: 13:56 - Nessus scan starts 14:04 - "Windows Agent" is not connected to "zmon-satellite4" and all services that should deliver check results to "zmon-satellite4" are unknown. Services who deliver their check results to the "zmon-satellite3" are ok. 14:11 - Nessus scan stops 14:14 - manually stop and start Icinga on "Windows Agent" and the connection worked again
all satellites and the agent are already updated to Icinga 2.9.2
I forgot to click "comment" before vacation ... thanks a lot, that's exactly what I wanted to see :)
It boils down that Nessus sends some crafted TCP packets which are interpreted as netstring, but actually aren't. This is forced to Disconnect() immediately when parsing fails.
The majority of the scan uses HTTP requests though, whereas the requests are not authenticated.
[2018-09-27 14:02:50 +0200] warning/HttpServerConnection: Unauthorized request: GET /favicon.iso
[2018-09-27 14:03:02 +0200] information/ApiListener: New client connection from [::ffff:10.1.36.101]:50996 (no client certificate)
[2018-09-27 14:03:02 +0200] warning/JsonRpcConnection: Error while reading JSON-RPC message for identity '': Error: Invalid NetString (missing :)
[2018-09-27 14:03:02 +0200] warning/JsonRpcConnection: API client disconnected for identity ''
[2018-09-27 14:03:02 +0200] warning/JsonRpcConnection: API client disconnected for identity ''
[2018-09-27 14:03:04 +0200] information/HttpServerConnection: No messages for Http connection have been received in the last 10 seconds.
[2018-09-27 14:03:12 +0200] information/ApiListener: New client connection from [::ffff:10.1.36.101]:51016 (no client certificate)
[2018-09-27 14:03:12 +0200] information/HttpServerConnection: Request: GET / (from [::ffff:10.1.36.101]:51016, user: <unauthenticated>)
[2018-09-27 14:03:12 +0200] warning/HttpServerConnection: Unauthorized request: GET /
[2018-09-27 14:03:12 +0200] information/ApiListener: New client connection from [::ffff:10.1.36.101]:51018 (no client certificate)
[2018-09-27 14:03:12 +0200] information/HttpServerConnection: Request: GET /profilemanager (from [::ffff:10.1.36.101]:51018, user: <unauthenticated>)
[2018-09-27 14:03:12 +0200] warning/HttpServerConnection: Unauthorized request: GET /profilemanager
[2018-09-27 14:03:24 +0200] information/ApiListener: New client connection from [::ffff:10.1.36.101]:51042 (no client certificate)
[2018-09-27 14:03:24 +0200] information/HttpServerConnection: Request: GET / (from [::ffff:10.1.36.101]:51042, user: <unauthenticated>)
[2018-09-27 14:03:24 +0200] warning/HttpServerConnection: Unauthorized request: GET /
[2018-09-27 14:03:24 +0200] information/ApiListener: New client connection from [::ffff:10.1.36.101]:51044 (no client certificate)
[2018-09-27 14:03:24 +0200] information/HttpServerConnection: Request: POST /sdk (from [::ffff:10.1.36.101]:51044, user: <unauthenticated>)
[2018-09-27 14:03:24 +0200] information/ApiListener: New client connection from [::ffff:10.1.36.101]:51048 (no client certificate)
[2018-09-27 14:03:24 +0200] information/HttpServerConnection: Request: GET / (from [::ffff:10.1.36.101]:51048, user: <unauthenticated>)
[2018-09-27 14:03:24 +0200] warning/HttpServerConnection: Unauthorized request: GET /
[2018-09-27 14:03:26 +0200] information/ApiListener: New client connection from [::ffff:10.1.36.101]:51076 (no client certificate)
[2018-09-27 14:03:26 +0200] information/ApiListener: New client connection from [::ffff:10.1.36.101]:51082 (no client certificate)
[2018-09-27 14:03:26 +0200] information/HttpServerConnection: Request: GET / (from [::ffff:10.1.36.101]:51076, user: <unauthenticated>)
[2018-09-27 14:03:26 +0200] information/HttpServerConnection: Request: GET / (from [::ffff:10.1.36.101]:51082, user: <unauthenticated>)
[2018-09-27 14:03:26 +0200] warning/HttpServerConnection: Unauthorized request: GET /
[2018-09-27 14:03:26 +0200] warning/HttpServerConnection: Unauthorized request: GET /
In the end, it completely fails to disconnect the remaining connections and likely just stalls everything.
[2018-09-27 14:03:26 +0200] information/HttpServerConnection: Unable to disconnect Http client, I/O thread busy
ok, thanks for the answer and the explanation. With this I understand why the load increases and after automatic deplyment with the director Icinga crashes. We are glad that our logs are helping you. I hope you have a solution for this.
Not yet, but at least I know where to look like inside the code :)
https://github.com/Icinga/icinga2/blob/master/lib/remote/httpserverconnection.cpp#L78
Maybe it is related to #6514 where connections are not properly closed upon header request. I need to analyse further what exactly is sent in the raw pcap later.
The fix for #6517 likely improves the situation as well with a dynamic connection thread pool, instead of spawning endless threads. @stevie-sy can you test the snapshot packages by chance on such a client, with nessus scanning it?
Thank you, we test it as soon as possible
Please do so with 2.10.1 too :)
Yes we will! :-) At the moment we have a lot to do and some colleges are on vacation now. So we Need some more time to get a new result. But if we have one, we will tell you immediately
Did you get the chance to do so already?
Sorry, we didn't find time because of other problems we had to fix or looking for a solution. e.g. like i comment here https://github.com/Icinga/icinga2/issues/6514#issuecomment-440730449. But at the end we have the same result.
@Al2Klimov you've assigned this issue to me. What should we do?
As far as I understand the discussion right you didn't test some snapshot packages yet, did you?
Snapshoot no, but every update we got since we created the issue.
afterthought: the cause seems to be related to the API problem
Please test the snapshot packages.
@dnsmichi after my vacation and with our new test setup we can do this for you ;-) Also the other issue with the log files you wrote yesterday.
But for the moment my colleague and I are little busy :-(
This issue seems to have been addressed by #7005.
Hi @stevie-sy,
any chance you'll deploy the current snapshot packages on a test vm, and let your nessus scanner run against it?
Cheers, Michael
Hi @dnsmichi ! Of course and we want to help. which version from https://packages.icinga.com/epel/ should we test on our test Environment? Stefan
Hi,
you can either use the release-rpm which allows to enable the snapshot-repo, or you'll go by the snapshot rpms located here: https://packages.icinga.com/epel/7/snapshot/x86_64/
Note: You'll need EPEL enabled, which fetches Boost 1.66+.
yum -y install https://packages.icinga.com/epel/icinga-rpm-release-7-latest.noarch.rpm
yum -y install epel-release
yum makecache
yum install --enablerepo=icinga-snapshot-build icinga2
Outputs something like this:
======================================================================================================================================================
Package Arch Version Repository Size
======================================================================================================================================================
Installing:
icinga2 x86_64 2.10.4.517.g6a29861-0.2019.04.06+1.el7.icinga icinga-snapshot-builds 29 k
Installing for dependencies:
boost169-chrono x86_64 1.69.0-1.el7 epel 17 k
boost169-context x86_64 1.69.0-1.el7 epel 16 k
boost169-coroutine x86_64 1.69.0-1.el7 epel 16 k
boost169-date-time x86_64 1.69.0-1.el7 epel 21 k
boost169-program-options x86_64 1.69.0-1.el7 epel 125 k
boost169-regex x86_64 1.69.0-1.el7 epel 261 k
boost169-system x86_64 1.69.0-1.el7 epel 7.4 k
boost169-thread x86_64 1.69.0-1.el7 epel 44 k
icinga2-bin x86_64 2.10.4.517.g6a29861-0.2019.04.06+1.el7.icinga icinga-snapshot-builds 3.7 M
icinga2-common x86_64 2.10.4.517.g6a29861-0.2019.04.06+1.el7.icinga icinga-snapshot-builds 142 k
libedit x86_64 3.0-12.20121213cvs.el7 base 92 k
libicu x86_64 50.1.2-17.el7 base 6.9 M
Transaction Summary
======================================================================================================================================================
Install 1 Package (+12 Dependent packages)
Note: Snapshot-Builds run every night, when we've pushed git master during the day.
Cheers, Michael
Our colleagues from security have scheduled the scan for the weekend. On Monday we will know more .. The tension is increasing :-)
On the first overview from the scan: After deployment with the director on the config-master every node surived, except the master2-node. But I have to check the logs, because this is irritaing me a little bit: It Looks like that the last state is from last Friday after I updated to the last snapshot. but in the icinga2-log are a lot of entrys since that.
this is from the master1/config-master:
The restarts are deployments or after the update of icinga2.
BTW: Also logstash is running with the icinga-output-plugin. I send every hour a test snmp-trap. And also here everything is fine.
So for the first look: You did a great Job.
We did another test with todays snapshot. Everything fine during the scan. Icinga is still running. So thumbs up! Great Job! Congratulation! Bravo!
Many thanks for the test and the kind feedback, this helps a lot and strengthens our decision to move onwards with Boost Asio, Coroutine and Beast :-)
You're welcome. If it helps, we could also test another future version before you will release 2.11. Just let us know ;-)
Thanks, I'll get back to you once everything is implemented and merged :-)
Current Behavior
We use Icinga r2.9.1-1 in HA setup. When our security department scans our IT infrastructure with the Nessus Security Scanner for Vulnerabilities the icinga nodes crashes. systemctl says as status "reload" and icingaweb2 loses connection. We configured the service daemon for automatic reload like the tip in the dokumentation. But it seems, that it didn't help. Our old setup with version r2.8.4-1 without ha-setup survives the scan.
It look likes the now closed issue for windows: #6097
At the moment my colleagues from the security department slow down Nessus a little bit, so Icinga surived the last scan. But I don't think it's not a solution to slow down a security scanner, like fewer requests per second.
Your Environment
Director version (System - About): Git Master 71ad855 Icinga Web 2 version and modules (System - About): 2.6.1 Icinga 2 version (icinga2 --version): r.9.1-1 Operating System and version: CentOS 7 Webserver, PHP versions: Apache 2.4.6-80.el7, rh-php 7.1.8-1.el7
How did you slow Nessus down, which parameters you changed? Can you let me know because we are facing similar issues and since the new version of icinga is not released yet its creating troubles for us.
@tushyjw at the end it didn't really help. My colleague found some option while creating new scans (e.g. to do not so many scans per seconds). We are still waiting for 2.11. So for the moment you have these options:
How did you slow Nessus down, which parameters you changed? Can you let me know because we are facing similar issues and since the new version of icinga is not released yet its creating troubles for us.
We are seeing a similar/same problem. We are able to deal with the master by stopping before and restarting after the scan.
My question is about the clients. They are running r2.10.1-1 (the master is r2.10.5-1). I have seen the suggestion that r2.8.2-1 does not have the problem. Can I simply install 2.8.2-1 replacing 2.10.1-1?
thanks for any clues, GlenG
2.8.2 has different problems. I would suggest waiting for the 2.11 release.
Current Behavior
We use Icinga r2.9.1-1 in HA setup. When our security department scans our IT infrastructure with the Nessus Security Scanner for Vulnerabilities the icinga nodes crashes. systemctl says as status "reload" and icingaweb2 loses connection. We configured the service daemon for automatic reload like the tip in the dokumentation. But it seems, that it didn't help. Our old setup with version r2.8.4-1 without ha-setup survives the scan.
It look likes the now closed issue for windows: https://github.com/Icinga/icinga2/issues/6097
At the moment my colleagues from the security department slow down Nessus a little bit, so Icinga surived the last scan. But I don't think it's not a solution to slow down a security scanner, like fewer requests per second.
Your Environment