csirtgadgets / massive-octo-spice

DEPRECATED - USE v3 (bearded-avenger)
https://github.com/csirtgadgets/bearded-avenger-deploymentkit/wiki
GNU Lesser General Public License v3.0
227 stars 60 forks source link

timeout on bambenek dga-feed #341

Closed giovino closed 8 years ago

giovino commented 9 years ago

Testing CIFv2 pre-rc3 on an AWS m4.xlarge instance (4c, 16gb ram)... after successfully parsing all the shipped feeds, I enabled the bambenek dga-feed, it was humming along and then timed out on submission:

/opt/cif/bin/cif-smrt --testmode -c -d -r /etc/cif/rules/default/bambenekconsulting_com.yml -f dga-feed               
[2015-09-01T15:27:58,798Z][16901][INFO][main:234]: staring up...
[2015-09-01T15:27:58,800Z][16901][DEBUG][main:291]: running pid: 16927
[2015-09-01T15:27:58,800Z][16927][DEBUG][main:373]: cleaning up tmp: /var/smrt/cache
[2015-09-01T15:27:58,800Z][16927][INFO][main:379]: removing: /var/smrt/cache/20150901.log
[2015-09-01T15:27:58,814Z][16927][INFO][main:310]: checking for router...
[2015-09-01T15:27:59,520Z][16927][DEBUG][main:324]: osint.bambenekconsulting.com - dga-feed
[2015-09-01T15:27:59,520Z][16927][INFO][main:326]: processing: -r /etc/cif/rules/default/bambenekconsulting_com.yml -f dga-feed
[2015-09-01T15:27:59,522Z][16927][DEBUG][CIF::Smrt:99]: starting at: 2015-09-01T00:00:00Z
[2015-09-01T15:27:59,522Z][16927][DEBUG][CIF::Smrt:104]: fetching...
[2015-09-01T15:27:59,522Z][16927][DEBUG][CIF::Smrt::Fetcher:84]: pulling: http://osint.bambenekconsulting.com/feeds/dga-feed.txt
[2015-09-01T15:28:00,662Z][16927][DEBUG][CIF::Smrt::Fetcher:103]: status: 200 OK
[2015-09-01T15:28:00,733Z][16927][DEBUG][CIF::Smrt:111]: cache: /var/smrt/cache/osint.bambenekconsulting.com-dga-feed
[2015-09-01T15:28:00,733Z][16927][DEBUG][CIF::Smrt:114]: decoding..
[2015-09-01T15:28:00,837Z][16927][DEBUG][CIF::Smrt:84]: data is of type: application/octet-stream
[2015-09-01T15:28:00,870Z][16927][DEBUG][CIF::Smrt:118]: parsing...
[2015-09-01T15:28:09,424Z][16927][DEBUG][CIF::Smrt:124]: checking journal
[2015-09-01T15:28:09,427Z][16927][DEBUG][CIF::Smrt:226]: using log: /var/smrt/cache/20150901.log
[2015-09-01T15:30:16,881Z][16927][DEBUG][CIF::Smrt:127]: writing journal...
[2015-09-01T15:32:26,003Z][16927][DEBUG][CIF::Smrt:131]: processing events: 656416
[2015-09-01T15:42:29,121Z][16927][INFO][CIF::Smrt:175]: processed events: 625462
[2015-09-01T15:42:29,191Z][16927][DEBUG][main:355]: sending 625462 observables...
status: 500 at /opt/cif/bin/cif-smrt line 357.
        main::_main() called at /opt/cif/bin/cif-smrt line 236
[2015-09-01T15:47:10,001Z][16901][INFO][main:257]: shutting down...

if we choose not to support feeds of this size, we should probably note that in the bambenek feed config file.

giovino commented 9 years ago

it looks like some errors got logged a littler later... ~5min

cif-worker:

[2015-09-01T15:36:45,356Z][30937][INFO]: 500 messages sent to workers...
[2015-09-01T15:37:01,768Z][30937][INFO]: 500 messages sent to workers...
[2015-09-01T15:47:03,536Z][30942][ERROR]: cif-router timeout...
[2015-09-01T15:47:03,548Z][30944][ERROR]: cif-router timeout...
[2015-09-01T15:47:03,708Z][30947][ERROR]: cif-router timeout...
[2015-09-01T15:47:03,710Z][30943][ERROR]: cif-router timeout...
[2015-09-01T15:47:04,072Z][30945][ERROR]: cif-router timeout...
[2015-09-01T15:47:04,100Z][30946][ERROR]: cif-router timeout...
[2015-09-01T15:47:04,656Z][30949][ERROR]: cif-router timeout...
[2015-09-01T15:47:26,561Z][30937][INFO]: 500 messages sent to workers...
[2015-09-01T15:48:04,598Z][30937][INFO]: 500 messages sent to workers...

cif-starman:

[2015-09-01T15:47:04,117Z][2957][INFO]: rate: ~315.621031606845 o/s
[2015-09-01T15:47:07,091Z][31002][INFO]: generating ping request...
[2015-09-01T15:47:07,103Z][31002][INFO]: sending ping...
[Tue Sep  1 15:47:07 2015] [error] fork(): Cannot allocate memory at /opt/cif/bin/../lib/perl5/CIF/REST/Observables.pm line 107.

[2015-09-01T15:47:12,555Z][7865][INFO]: starting CIF::REST
[2015-09-01T15:47:16,817Z][2760][INFO]: sending: 3516
[2015-09-01T15:47:18,930Z][2760][INFO]: took: ~2.849019
[2015-09-01T15:47:18,931Z][2760][INFO]: rate: ~1234.10900383606 o/s
wesyoung commented 9 years ago

either that or make a default-large dir with a comment in the readme that says "these feeds are good, stable, etc, but require at-least 32g ram to process.... (?)

wesyoung commented 9 years ago

@giovino thoughts? closed?

furiel commented 9 years ago

Could you please give me some clue why do we need 32G of RAM to parse the feed file?

I cannot see why the size of the feed file matters. I was thinking if we can parse smaller files with much less memory, then we might for example split large files into smaller ones first, and then deal with the individual chunks serially. As these feed files are parsed line by line, the feed files could be split easily. That's why I think the file size should be irrelevant in terms of memory consumption here, else you would have already implemented such easy trick. At which point of parsing do we need so huge amount of RAM? Is there any way we can optimize this?

The only thing I can think of is some kind of memory leak problem that we have here, which slowly accumulates over time. Is this why we need to restart cif-workers and cif-router once in a week by a cronjob?

Why do we have the memory allocation error during fork() in giovino-s example?

Any thoughts would be appreciated!

giovino commented 9 years ago

FYI. I tested this feed on a CIF all-in-one host with 32 GB of ram.

  1. I saw total ram usage spike to ~14GB of ram usage when cif-starman was processing the dga feed.
  2. I saw total ram usage spike to ~18GB of ram usage when cif-smrt ran during subsequent processing.

Looking through the cif-router logs, it still looks angry:

$ ls -lh /var/log
-rw-r--r--  1 cif           cif           373M Sep 28 13:41 cif-router.log
[2015-09-28T13:41:44,735Z][18533][ERROR]: [Cxn] ** [http://localhost:9200]-[599] Could not write to socket: 'Connection reset by peer', called from sub Search::Elasticsearch::Transport::try {...}  at /usr/share/perl5
/Try/Tiny.pm line 81. With vars: {'status_code' => 599,'request' => {'ignore' => [],'qs' => {'refresh' => 1},'path' => '/cif.observables-2015.09.28/observables/_bulk','serialize' => 'bulk','method' => 'POST','body' =
> ['{"index":{"_id":"696a2c6a740b7d08f1067cf83a2abce483c7178876109af4e42af1a0f514b893"}}','
...
wesyoung commented 9 years ago

@furiel most of this has nothing to do with the actual parsing, it's what happens when cif-worker gets a hold of the data and starts "resolving" things (domain --> ip, asn, bgp prefix, spamhaus lookups, etc). things get messy from there because you can get from 800,000 domains to a factor of [sometimes] 10x more data to be processed, and because these don't always get "batched" properly [so we've found out], elasticsearch can start hogging more resources as it tries to keep up (hence the errors).

within that scope there are some mem leaks we weren't easily able to lock down due to perl and due to some original architecture designs (that and ES LOVES memory generally speaking). at some point you flood elasticsearch (and disk io) with too much data and they both fall apart.

that said, with every version we almost always run into some upper limit with a feed, and those lessons learned become original use-cases for the next version:

https://github.com/csirtgadgets/bearded-avenger

which takes into account not only the scale of these feeds, but some of the design choices that limited things like this in v2 (ie: you're right, it probably shouldn't take 32g ram or more to do this kind of work, knowing what we know now...).

its' not a great answer, but should shed some light as to where we're at and how we plan to address it moving forward...

furiel commented 9 years ago

I was unaware of v3. Now things are getting even more exciting. I looked into the code of v3 shortly, and I see you have rewritten the whole project into python! It must have been a hard work!

What is your plan with these versions? Is v2 a maintenance release and you focus on v3? Or both releases will live in parallel? If someone would like to contribute to the project, which release do you suggest to join? v2 or v3? In case you plan to abandon v2 when v3 is ready, then I would explore v3 instead.

giovino commented 9 years ago

@furiel

I created this to help illuminate the various CIF services that consume resources.

Building a CIF Server https://github.com/csirtgadgets/massive-octo-spice/wiki/Building-a-CIF-Server

This doesn't mean that CIF couldn't/shouldn’t be optimized at every corner but at some point each one of those services need resources and a multi-user system with heavy utilization (frequent and large queries) will need more resources.

furiel commented 9 years ago

Thanks @giovino. Nice hint where to look! I successfully fired up the server in a docker instance. Though I have only 8G ram (minus OS) available, I think with a fake feed server with a reduced copy of that large feed I can create a test environment to play around the code a bit.

wesyoung commented 9 years ago

@furiel typically what's happened, that as we move from beta to RC status, things go into maint mode. what that really means is "unless you have a really good reason to patch something that's broken, start into the next codebase where things are still taking shape".

generally, we gladly accept PR's for both, but now that we're into the RC stages, we're hesitant to touch anything "architecturally" short of a small patch, with a reproducible test case and is a generally accepted problem. not for any other reason than within the scope of this journey, it's just easier to fix those in the next release (and force ourselves to release faster...).

i spent a chunk of time getting the v3 codebase mostly "packagable", so its really just coming down to iteration on the guts and fixing some these "we know this is botched in v2 for a number of reasons [poor planning, perl, ...], lets address them in v3".

make sense? (i'd go for the v3 codebase in this case).

furiel commented 9 years ago

Thanks, all clear now! I might have questions later on, but I will open new threads in that case. You can close this ticket.

furiel commented 9 years ago

One question to @giovino. I managed to get hold of a server with 32G of ram and 8 cores (somewhere between small and large instance) and tried your execution, but I cannot reproduce the high memory usage.

I used EasyButton for a clean install, and executed the installation tests provided with in the guide. Then enabled the large dga feed by uncommenting the relevant lines in the config file. Then executed the command /opt/cif/bin/cif-smrt --testmode -c -d -r /etc/cif/rules/default/bambenekconsulting_com.yml -f dga-feed, the same you used above. I logged memory consumption by executing free -m command in every second. There was an increase of total memory usage from 6G to 8G, but I cannot see the 16G peak we were previously talking about.

I tried to reproduce 3 times leaving some time (12 hours) between the executions: once right after the installation tests executed yesterday, this morning an this evening, but none of had such peak, just a ~2G of increase, which was dropped back slowly after the command executed.

Do you have any clue what might be the difference between your execution and mine? Is there any extra steps I am missing?

giovino commented 9 years ago

@furiel

I do not know why you were not able to replicate what I saw. A few possibilities:

  1. (unlikely) the size of the feed was smaller when you tested, today it seems bigger:

    grep -v '#' dga-feed.txt | wc -l
    656328
  2. One of the updates since I last tested changed things? I browsed through the updates since Sept 1st, nothing stuck out as obvious.
furiel commented 9 years ago

@giovino I checked this topic again and I suspect it depends somehow on the feed file. I got high memory (20G from 8G) usage by passing --not-before "40 days ago" to the command. There is no such peak if I did not provide the --not-before option. Checking the code, I see default value is my $notbefore = 'today'; for the option.

Based on that I would expect to be considerably more old entries in the parse file, which could make such difference. But quickly counting the lines I see the majority of entries are from the current day, and older entries are much lower in number. So the situation is still unclear.

$ cat dga-feed.txt | wc -l 656342 $ cat dga-feed.txt | grep -e 2015-10-18 | wc -l 578943 $ cat dga-feed.txt | grep -e 2015-10-17 | wc -l 15477 $ date "+%Y-%m-%d" 2015-10-18