airbnb / binaryalert

BinaryAlert: Serverless, Real-time & Retroactive Malware Detection.
https://binaryalert.io
Apache License 2.0
1.4k stars 187 forks source link

Replace batcher with S3 inventory #131

Closed austinbyers closed 5 years ago

austinbyers commented 5 years ago

to: @ryandeivert cc: @airbnb/binaryalert-maintainers size: large resolves: #18 resolves: #46 resolves: #120

Background

The batcher function for retroactive analysis is error-prone (especially timeouts), can run for a very long time, and can be invoked multiple times, essentially DOSing your BinaryAlert deployment.

Changes

Lambda Functions

Terraform

CLI

Tests

Testing

$ ./manage.py --help
usage: manage.py [-h] [--version] command

positional arguments:
  command     apply          Apply any configuration/package changes with Terraform
              build          Build Lambda packages (saves *.zip files in terraform/)
              cb_copy_all    Copy all binaries from CarbonBlack Response into BinaryAlert
              clone_rules    Clone YARA rules from other open-source projects
              compile_rules  Compile all of the YARA rules into a single binary file
              configure      Update basic configuration, including region, prefix, and downloader settings
              deploy         Deploy BinaryAlert (equivalent to unit_test + build + apply)
              destroy        Teardown all of the BinaryAlert infrastructure
              live_test      Upload test files to BinaryAlert which should trigger YARA matches
              purge_queue    Purge the analysis SQS queue (e.g. to stop a retroactive scan)
              retro_fast     Enumerate the most recent S3 inventory for fast retroactive analysis
              retro_slow     Enumerate the entire S3 bucket for slow retroactive analysis
              unit_test      Run unit tests (*_test.py)

$ ./manage.py configure

$ ./manage.py deploy

$ ./manage.py live_test

$ time ./manage.py retro_fast
Reading inventory/.../EntireBucketDaily/2018-08-13T08-00Z/manifest.json
94679: requirements_top_level.txt
Done!

real    0m20.067s

$ time ./manage.py retro_slow
94682: requirements_top_level.txt
Done!

real    1m10.056s

$ ./manage.py cb_copy_all

$ ./manage.py purge_queue

Note that reading from the inventory (retro_fast) enqueues objects many times faster than enumerating them manually. It takes about 80 seconds to enumerate a million objects (with 32 processes on my laptop). This means a multi-million-object bucket will take a few minutes to enqueue for retroactive analysis, but IMO this is much better (and cheaper) than running the batcher Lambda function for several hours.

Reviewers

Apologies: this change is bigger than I intended - the CLI was becoming painfully difficult to manage. Most of cli/config.py and cli/manager.py (and their unit tests) are unchanged, except for the addition of inventory / queueing logic.

coveralls commented 5 years ago

Coverage Status

Coverage increased (+0.5%) to 92.189% when pulling 12692fdc361ab2b613b16f492a2caa11bd5da474 on austin-remove-batcher into ca049c589c6a27abad867a5240d131dbe2b829a5 on master.