Issue when processing data through logstash

The1WhoPrtNocks commented 3 years ago

Hi,

I am trying to send my networks data to both domain_stats and freq_server for analysis and enrichment. I am sending on average 20 queries per second.

When this data is sent just to freq_server the VM idles and process the data like a dream. However as soon as I also send that same data to domain_stats the ingesting of the data post processing by ELK drops of a cliff.

We have investigated CPU, memory and Disk IO and is all seems reasonable on both the VM running domain_stats and ELK. The memory utilization for the caching is consistently under utilized based on sending the stats query. Looking at the local logging for both domain_stats and log-stash they seem to be up-to-date.

Has anyone else come across this issue, i have tried multiple configs rdap vs ISC ?

The1WhoPrtNocks commented 3 years ago

Seems to be a time related and allowing for both the DB and cache to build up their relevant data. Will come back and close over the weekend it this is the case.

The1WhoPrtNocks commented 3 years ago

So as you can see below we are running at 95% established and only 6.9% RDAP lookup. However there is still a ~15 min delay for data being stored, picture was taken at 17:00 UTC . This delay has reduced over time. This delay is not present at all for a Freq_server lookup and it is using the same log-stash config.

Appreciate this is "only" a 15 min delay, however I was planning on using this domain create data to alert and potentially even sink-hole requests.

MarkBaggett commented 3 years ago

Hi @The1WhoPrtNocks This is useful data. Id like to see what I can do to identify and resolve the problem. Can you provide the data from the http://domain-stats:port/stats page? Im curious what your database and cache hit rate are. Also, It would be useful to have some information from /showcache. I understand that is likely considered sensitive data. I really don't need the domains themselves BUT the stats that show how many domains you have and what the hit rate for the domain would be useful. If you are willing to either share that or summarize it for me my email is mbaggett at sans dot org.
Thanks.

The1WhoPrtNocks commented 3 years ago

Email sent

The1WhoPrtNocks commented 3 years ago

Hi Mark,

Thank you for your time over the past few days. It was unclear as it seems that the domain_stats lookup was the straw that broke the back. By removing other lookups and keeping just freq and domain_stats all is well.

So it seems that domain_stats tops out at ~80 e/s (events per second) being sent, sometimes spiking to 100-120 e/s. Whereas just running against freq_server it was able to spike to 2k e/s to clear the back log.

MarkBaggett commented 3 years ago

Reopening while we try to fix this speed issue. Here are some notes:

I've been publishing some updates to try and improve the speed. @The1WhoPrtNocks was initially consistently seen 80 e/s being processed.

I disabled session logging and added an cache to a slow function (reduce_domain) and saw an improvement to 500 e/s with burst up to 1k.

I've created another multiprocessing for to support multiple cores and a distributed cache across processors. @The1WhoPrtNocks has graciously agreed to test this alpha version of the code. We are working in the multiprocessed branch

The1WhoPrtNocks commented 3 years ago

There is a drastic improvement running the multiprocessed branch. during configuration I allocated 9 cores with 12 threads each. This then went on to create 9 workers as would be expected.

To clear a large back log it started and stayed pretty steady at 1k e/s . This then grew (not spiked) gradually to about 1.8k e/s . This testing was conducted with a blank database, which should account for the growth.

I will leave this to run for a while to update the database + cache, then conduct another test of clearing a large back log of events.

Seems like the 5-8 min to grow to 1.8k e/s was where it topped out naturally, on second stress test it sent straight to 1.8 e/s and stayed there.

MarkBaggett commented 3 years ago

Ok. I'm calling that a win. 1.8k is amazingly close to freq's 2k considering all of the network latency involved for those RDAP queries.

I cleaned up the code a bit and added some features. Id appreaciate your help getting this one tested and I will kill the old version in favor of the multiprocessing. Here are the instructions for installing the multiprocessing version..

https://github.com/MarkBaggett/domain_stats/blob/multiprocessed/README.md

New command "domain-stats-settings" lets you configure/reconfigure the daemons. You can change settings like the number of days for "established" tag. You can also enable freq_scores and set alert thresholds for each of the values.

Set value for ip_address. Default=0.0.0.0 Current=0.0.0.0 (Enter to keep Current): Set value for local_port. Default=5730 Current=5730 (Enter to keep Current): Set value for workers. Default=3 Current=3 (Enter to keep Current): Set value for threads_per_worker. Default=3 Current=3 (Enter to keep Current): Set value for timezone_offset. Default=0 Current=0 (Enter to keep Current): Set value for established_days_age. Default=730 Current=730 (Enter to keep Current): Set value for enable_freq_scores. Default=True Current=True (Enter to keep Current): Set value for mode. Default=rdap Current=rdap (Enter to keep Current): Set value for freq_avg_alert. Default=5.0 Current=5.0 (Enter to keep Current): Set value for freq_word_alert. Default=4.0 Current=4.0 (Enter to keep Current): Set value for log_detail. Default=0 Current=0 (Enter to keep Current):

The other new comaind "domain-stats" will launch the server. They both require you pass the path to a directory where the configs and database can be stored.

MarkBaggett / domain_stats

Issue when processing data through logstash #22