External output anomalies

remydb commented 10 years ago

Hello,

We have recently updated to Sagan 1.0.0 (we've tried both RC1 and RC2) and have encountered some anomalies in the external output. We call a php script using the 'parsable' output. This PHP script doesn't do much except parsing the output to an array and feeding it to our message queue. After updating to 1.0 we've noticed that some alerts get sent to our PHP script with information that isn't correct. This is the information that our PHP script receives, when parsed to an array (the event_id is an md5 hash we generate and add to the array):

Array
(
    [event_id] => a419d78c299132979c4e72a0a1721a23
    [id] => 5000406
    [message] => [OPENSSH] Accepted publickey
    [classification] => successful-user
    [drop] => False
    [priority] => 1
    [date] => 2014-02-28
    [time] => 13:00:59
    [source] => <srcip>
    [source_port] => 514
    [destination] => <dstip>
    [destination_port] => 22
    [facility] => auth
    [syslog_priority] => info
    [reference] => http://wiki.quadrantsec.com/bin/view/Main/5001126
    [syslog_message] =>  Accepted publickey for webdev from <srcip> port 55327 ssh2
)

For debugging purposes we started writing this array to a log file for every received alert. We then received an FTP bruteforce attack which started logging, creating many instances of the following alert:

Array
(
    [event_id] => bbd5d1156dc1e317b762d72e7080f67f
    [id] => 5000081
    [message] => [PROFTPD] Login failed accessing the FTP server
    [classification] => unsuccessful-user
    [drop] => False
    [priority] => 1
    [date] => 2014-02-28
    [time] => 12:19:45
    [source] => <srcip>
    [source_port] => 514
    [destination] => <dstip>
    [destination_port] => 21
    [facility] => authpriv
    [syslog_priority] => notice
    [reference] => http://wiki.quadrantsec.com/bin/view/Main/5001126
    [syslog_message] =>  0.0.0.0 (<srcip>[<srcip>]) - USER <username> (Login failed): Incorrect password.
)

This is to be expected. However, one of these messages looked different. To illustrate:

cat logphp.txt | grep -B 17 -A 1 <username>
Array
(
    [event_id] => 36781c515bf518fe8df01ecc8fc7313f
    [id] => 5000081
    [message] => [PROFTPD] Login failed accessing the FTP server
    [classification] => unsuccessful-user
    [drop] => False
    [priority] => 1
    [date] => 2014-02-28
    [time] => 12:26:04
    [source] => <srcip>
    [source_port] => 514
    [destination] => <dstip>
    [destination_port] => 21
    [facility] => authpriv
    [syslog_priority] => notice
    [reference] => http://wiki.quadrantsec.com/bin/view/Main/5001126
    [syslog_message] =>  0.0.0.0 (<srcip>[<srcip>]) - USER <username> (Login failed): Incorrect password.
)
--
Array
(
    [event_id] => a8afac0861a260dbd1b182526ad81fcc
    [id] => 5000081
    [message] => [OPENSSH] Accepted publickey
    [classification] => successful-user
    [drop] => False
    [priority] => 1
    [date] => 2014-02-28
    [time] => 12:26:06
    [source] => <srcip>
    [source_port] => 514
    [destination] => <dstip>
    [destination_port] => 21
    [facility] => auth
    [syslog_priority] => info
    [reference] => http://wiki.quadrantsec.com/bin/view/Main/5001126
    [syslog_message] =>  0.0.0.0 (<srcip>[<srcip>]) - USER <username> (Login failed): Incorrect password.
)
--
Array
(
    [event_id] => add431ca07fe5c94f4c726b2415c8e66
    [id] => 5000081
    [message] => [PROFTPD] Login failed accessing the FTP server
    [classification] => unsuccessful-user
    [drop] => False
    [priority] => 1
    [date] => 2014-02-28
    [time] => 12:26:06
    [source] => <srcip>
    [source_port] => 514
    [destination] => <dstip>
    [destination_port] => 21
    [facility] => authpriv
    [syslog_priority] => notice
    [reference] => http://wiki.quadrantsec.com/bin/view/Main/5001126
    [syslog_message] =>  0.0.0.0 (<srcip>[<srcip>]) - USER <username> (Login failed): Incorrect password.
)

As you can see the second alert contains exactly the same syslog message, however the message, classification, facility and syslog priority are incorrect when compared to the other two. Over the last few days we've seen this happening multiple times. So there seems to be a bug in the output being sent to the external program.

fsdaniel commented 10 years ago

Some additional information

Linux log01 3.2.0-25-generic #40-Ubuntu SMP Wed May 23 20:30:51 UTC 2012 x86_64

Distributor ID: Ubuntu Description: Ubuntu 12.04.4 LTS Release: 12.04 Codename: precise

liblognorm (log normalization) support is included.
libdnet (for unified2) support is included.
Compiled on Feb 26 2014 at 15:24:13.

sagan.conf

var FIFO /var/run/sagan/sagan.fifo
var RULE_PATH /logwatch/sagan-rules-1.0
var LOCKFILE /var/run/sagan/sagan.pid
var ALERTLOG /var/log/sagan/alert
var SAGANLOGPATH /var/log/sagan

var SSH_PORT 22
var HTTP_PORT 80
var HTTPS_PORT 443
var TELNET_PORT 23
var DNS_PORT 53
var SNMP_PORT 161
var POP3_PORT 110
var IMAP_PORT 143
var SMTP_PORT 25
var MYSQL_PORT 3306
var MSSQL_PORT 1433
var NTP_PORT 123
var OPENVPN_PORT 1194
var PPTP_PORT 1723
var FTP_PORT 21
var RSYNC_PORT 873
var SQUID_PORT 3128

sagan_host <removed>
sagan_port 514
max_processor_threads 800
ignore_list: /logwatch/ignore.txt
output external: /logwatch/feed-me.php parsable

include $RULE_PATH/classification.config
include $RULE_PATH/reference.config
include $RULE_PATH/gen-msg.map
include $RULE_PATH/protocol.map

normalize: cisco, $RULE_PATH/cisco-normalize.rulebase
normalize: openssh, $RULE_PATH/openssh-normalize.rulebase
normalize: smtp, $RULE_PATH/smtp-normalize.rulebase
normalize: dns, $RULE_PATH/dns-normalize.rulebase
normalize: imap, $RULE_PATH/imap-normalize.rulebase
normalize: su, $RULE_PATH/su-normalize.rulebase
normalize: vmware,  $RULE_PATH/vmware-normalize.rulebase
normalize: linux-kernel, $RULE_PATH/linux-kernel-normalize.rulebase
normalize: windows, $RULE_PATH/windows-normalize.rulebase
normalize: proftpd, $RULE_PATH/proftpd-normalize.rulebase

include $RULE_PATH/adtran.rules
include $RULE_PATH/apache.rules
include $RULE_PATH/apc-emu.rules
include $RULE_PATH/arp.rules
include $RULE_PATH/asterisk.rules
include $RULE_PATH/attack.rules
include $RULE_PATH/bash.rules
include $RULE_PATH/bonding.rules
include $RULE_PATH/bro-ids.rules
include $RULE_PATH/cacti-thold.rules
include $RULE_PATH/citrix.rules
include $RULE_PATH/courier.rules
include $RULE_PATH/digitalpersona.rules
include $RULE_PATH/fortinet-malware.rules
include $RULE_PATH/fortinet.rules
include $RULE_PATH/ftpd.rules
include $RULE_PATH/grsec.rules
include $RULE_PATH/honeyd.rules
include $RULE_PATH/hordeimp.rules
include $RULE_PATH/hostapd.rules
include $RULE_PATH/huawei.rules
include $RULE_PATH/imapd.rules
include $RULE_PATH/ipop3d.rules
include $RULE_PATH/kismet.rules
include $RULE_PATH/knockd.rules
include $RULE_PATH/linux-kernel.rules
include $RULE_PATH/milter.rules
include $RULE_PATH/mongodb.rules
include $RULE_PATH/mysql.rules
include $RULE_PATH/nfcapd-malware.rules
include $RULE_PATH/nfcapd.rules
include $RULE_PATH/nginx.rules
include $RULE_PATH/ntp.rules
include $RULE_PATH/openssh.rules
include $RULE_PATH/openvpn.rules
include $RULE_PATH/oracle.rules
include $RULE_PATH/ossec-mi.rules
include $RULE_PATH/ossec.rules
include $RULE_PATH/php.rules
include $RULE_PATH/postfix.rules
include $RULE_PATH/postgresql.rules
include $RULE_PATH/pptp.rules
include $RULE_PATH/procurve.rules
include $RULE_PATH/proftpd.rules
include $RULE_PATH/proxy-malware.rules
include $RULE_PATH/pure-ftpd.rules
include $RULE_PATH/racoon.rules
include $RULE_PATH/roundcube.rules
include $RULE_PATH/rsync.rules
include $RULE_PATH/samba.rules
include $RULE_PATH/sendmail.rules
include $RULE_PATH/snort.rules
include $RULE_PATH/solaris.rules
include $RULE_PATH/sonicwall.rules
include $RULE_PATH/squid.rules
include $RULE_PATH/ssh-tectia-server.rules
include $RULE_PATH/su.rules
include $RULE_PATH/symantec-ems.rules
include $RULE_PATH/syslog.rules
include $RULE_PATH/tcp.rules
include $RULE_PATH/telnet.rules
include $RULE_PATH/tripwire.rules
include $RULE_PATH/vmpop3d.rules
include $RULE_PATH/vmware.rules
include $RULE_PATH/vpopmail.rules
include $RULE_PATH/vsftpd.rules
include $RULE_PATH/web-attack.rules
include $RULE_PATH/weblabrinth.rules
include $RULE_PATH/windows-auth.rules
include $RULE_PATH/windows-malware.rules
include $RULE_PATH/windows-misc.rules
include $RULE_PATH/windows-mssql.rules
include $RULE_PATH/windows.rules
include $RULE_PATH/wordpress.rules
include $RULE_PATH/xinetd.rules
include $RULE_PATH/zeus.rules
include $RULE_PATH/true-custom.rules

beave commented 10 years ago

Thanks for the info. It looks like the output threat for external is getting over written with new data (ie - improper mutex? lack of mutex?). I'll take a look and see what I can figure out. Will let you know the results ASAP.

fsdaniel commented 10 years ago

We tried updating from the current git master and updated liblognorm to 1.0.1, we keep seeing the problem.

beave commented 10 years ago

Any tips on reproducing the problem? I tried locally, but failed to reproduce. I'll re-try to reproduce the results and get back with you ASAP.

beave commented 10 years ago

I don't think this is a liblognorm issue. I think it's related to the sagan-output.c / output-plugins/sagan-external.c. I'll look this afternoon and see if I can't narrow it down.

fsdaniel commented 10 years ago

We are running with 300 messages per second on average, perhaps you have some ideas on what data we could provide to test this better.

beave commented 10 years ago

I might have located the issue. I misplaced mutex initialization. I just pushed the fix up. Can you download the now RC3 and see if this corrects the issue? Thank you!

fsdaniel commented 10 years ago

Thanks! We have RC3 running right now and will let you know by the end of the day.

fsdaniel commented 10 years ago

Quick update; sadly this did not solve it.

beave commented 10 years ago

Ok. I got something else to try. Let me get to the office and I'll push it up. Stay tuned.

On April 16, 2014 8:18:37 AM EDT, Daniel Koopmans notifications@github.com wrote:

Quick update; sadly this did not solve it.

Reply to this email directly or view it on GitHub: https://github.com/beave/sagan/issues/33#issuecomment-40591893

Champ Clark III Quadrant Information Security

beave commented 10 years ago

Ok. Just pushed up another update. Can you give it a try. It's not a permanent fix, but if it works, it will at least point me in the right direction.

fsdaniel commented 10 years ago

Alright, we are now running the update will keep you posted.

fsdaniel commented 10 years ago

Sadly the changes had no effect, we are still seeing the issue.

beave commented 10 years ago

You downloaded and recompiled? I've essentially made it a single theaded operation so now I'm really confused on the corruption. Can you send me more examples of the corruption?

Thank you.

On April 17, 2014 4:37:26 AM EDT, Daniel Koopmans notifications@github.com wrote:

Sadly the changes had no effect, we are still seeing the issue.

Reply to this email directly or view it on GitHub: https://github.com/beave/sagan/issues/33#issuecomment-40692816

Champ Clark III Quadrant Information Security

fsdaniel commented 10 years ago

Correct, we made some adjustments this morning to run a unique mutex around the output and not use the same one defined. We are waiting for the results on that one also.

Mean while we will think up some adjustments provide you with more samples of data.

Thank you so far again for all the thought you put into this.

beave commented 10 years ago

I'll see if I can't replicate this on our servers. This is bugging the hell outta me. :)

fsdaniel commented 10 years ago

We are currently trying with our best C (very poor) to track the events within sagan so we can catch them going out both at the external processor and the alertlog, adding the id value in both will make it alot easier to find how or why things are out of sync i think.

fsdaniel commented 10 years ago

In addition the mutex change, has kept the issue at bay for now normally we see this happen every hour atleast, its been two hours now. But we rather wait till tomorrow to rule it out completely.

// / External program support / /**/ pthread_mutex_t SaganOutputNonThreadMutex2=PTHREAD_MUTEX_INITIALIZER;

pthread_mutex_lock(&SaganOutputNonThreadMutex2); if ( config->sagan_ext_flag ) sagan_ext_thread( Event ); pthread_mutex_unlock(&SaganOutputNonThreadMutex2);

beave commented 10 years ago

I was able to replicate:

ID:1:5000116 Message:[SENDMAIL] Relaying denied Syslog message: I/O error occurred while writing; fd='9', error='No space left on device (28)'

[] [1:5000116] [SENDMAIL] Relaying denied [] [Classification: suspicious-traffic] [Priority: 2] 2014-04-17 10:07:36 10.X.X.X:514 -> 10.X.X.X:514 mail notice Message: I/O error occurred while writing; fd='9', error='No space left on device (28)'

beave commented 10 years ago

I need to test more, but I found where "struct _Sagan_Processor_Info *processor_info_engine = NULL;" was being defined as a global in processors/sagan-engine.c. I moved that to a local variable and have been retesting. So far, this seems to fix this issue, but it certainly needs more testing before I can say 100% . It was definitely an issue either way. I just pushed up "external fix #3". Let me know how it works and I'll continue testing in house.

If it affects external and the ASCII alert log, it likely affects unified2. Which make this a much more serious problem.

fsdaniel commented 10 years ago

We updated our setup with your new build and will keep you posted.

remydb commented 10 years ago

The latest version has been running for a good number of hours now and so far we haven't come across any bad alerts, so it seems to be working properly so far. We'll keep a close eye on it for the time being.

beave commented 10 years ago

How is this looking? We've been running solid since yesterday. I'm interested in knowing your results! I'd love to close this ticket :)

beave commented 10 years ago

I'm closing this out. From our testing, and what remydb has relayed, this sounds resolved. If it rears it's ugly head again, we'll just open another ticket.

fsdaniel commented 10 years ago

My team has reported indeed that the issue has been solved, that is fantastic. Thanks alot!

beave commented 10 years ago

Hello. On a side note, the fix for this also caused a memory leak. I've pushed the fix for the memory leak up to github. Just FYI.

fsdaniel commented 10 years ago

Thank you, we did notice the changes you made last few days and we update the moment new things come out.

We are currently looking into an issue with getting excessive zombie processes from sagan when running external calls.

We will submit a new issue if we think this is something we cannot fix on our end.

sagan 16734 18.2 1.7 8686880 421140 pts/0 Sl+ 10:01 41:38 ./sagan -f /etc/sagan.conf sagan 25531 0.0 0.0 0 0 pts/0 Z+ 10:21 0:00 [sagan-to-mq.php] sagan 25532 0.0 0.3 8676992 79932 pts/0 S+ 10:21 0:00 ./sagan -f /etc/sagan.conf sagan 25730 0.0 0.0 0 0 pts/0 Z+ 10:22 0:00 [sagan-to-mq.php] sagan 25731 0.0 0.0 0 0 pts/0 Z+ 10:22 0:00 [sagan-to-mq.php] sagan 25732 0.0 0.3 8676992 81452 pts/0 S+ 10:22 0:00 ./sagan -f /etc/sagan.conf sagan 26609 0.0 0.0 0 0 pts/0 Z+ 10:25 0:00 [sagan-to-mq.php] sagan 26610 0.0 0.3 8677124 85124 pts/0 S+ 10:25 0:00 ./sagan -f /etc/sagan.conf sagan 31780 0.0 0.0 0 0 pts/0 Z+ 10:38 0:00 [sagan-to-mq.php] sagan 31782 0.0 0.4 8677648 103416 pts/0 S+ 10:38 0:00 ./sagan -f /etc/sagan.conf sagan 3586 0.0 0.0 0 0 pts/0 Z+ 10:50 0:00 [sagan-to-mq.php] sagan 3588 0.0 0.0 0 0 pts/0 Z+ 10:50 0:00 [sagan-to-mq.php] sagan 3589 0.0 0.4 8678176 119216 pts/0 S+ 10:50 0:00 ./sagan -f /etc/sagan.conf sagan 10392 0.0 0.0 0 0 pts/0 Z+ 11:06 0:00 [sagan-to-mq.php] sagan 10394 0.0 0.0 0 0 pts/0 Z+ 11:06 0:00 [sagan-to-mq.php] sagan 10395 0.0 0.5 8678836 141056 pts/0 S+ 11:06 0:00 ./sagan -f /etc/sagan.conf sagan 10428 0.0 0.0 0 0 pts/0 Z+ 11:06 0:00 [sagan-to-mq.php] sagan 10434 0.0 0.5 8678836 141084 pts/0 S+ 11:06 0:00 ./sagan -f /etc/sagan.conf sagan 11437 0.0 0.0 0 0 pts/0 Z+ 11:09 0:00 [sagan-to-mq.php] sagan 11440 0.0 0.0 0 0 pts/0 Z+ 11:09 0:00 [sagan-to-mq.php]

beave commented 9 years ago

I'll have to look deeper into this. I can replicate. I just need to determine why it's happening. Sorry for the delay.

beave commented 9 years ago

This has been fixed as part of https://github.com/beave/sagan/issues/36. This was largely due to a misplaced mutex_lock.

--<snip from issue #36>--

This was happening due to a misplaced mutex_lock(). I've tested the fix for this by "overloading" Sagan with events (10k events at a time) and monitoring to see if any "zombie" processes came up. The fix appeared to "stick". Please test when you have a chance and sorry for the delay. I'm going to close this. I'll re-open if you still have the same issue. I'm pretty certain it is fixed.

beave / sagan

External output anomalies #33