to: @ryandeivert
cc: @airbnb/binaryalert-maintainers
size: large
resolves: #18
resolves: #46
resolves: #120
Background
The batcher function for retroactive analysis is error-prone (especially timeouts), can run for a very long time, and can be invoked multiple times, essentially DOSing your BinaryAlert deployment.
Changes
Lambda Functions
Remove the batcher Lambda function entirely
Build Lambda functions as a proper Python package to remove the hacky if __package__ import logic
Reduce the S3 connection timeout - the tail end of binary download latencies approach the 60 second default timeout, but there's no need to wait that long before retrying the connection
Terraform
Enable S3 inventory on the binary S3 bucket (Terraform support for this was only recently added)
Remove BatchEnqueueFailures alarm, since the batcher is gone
Remove the throttle alarms - throttles are more common when invoking Lambda via SQS and are automatically retried
Set a concurrency limit for both Lambda functions (analyzer and downloader). This prevents the whole account from running out of concurrency if there are millions of objects in the queue
CLI
There are 3 new CLI commands:
purge_queue: Purge the analyzer queue, immediately stopping any retroactive analysis
retro_fast: Add all objects from the latest S3 inventory manifest onto the analysis queue
retro_slow: Enumerate the bucket manually (like the batcher did before)
Retroactive scans use multiple processes in parallel to send messages to SQS
The deploy command no longer starts a retroactive scan
The monolithic manage.py script has been separated into different components in cli/
Tests
The individual test commands in .travis.yml have been moved to a standalone script tests/ci_tests.sh. This makes it easier for contributors to test their changes in exactly the same way that Travis will
Remove tests/ from coverage measurement. Adding unit tests artificially inflated the coverage measure due to the extra lines of code.
Testing
$ ./manage.py --help
usage: manage.py [-h] [--version] command
positional arguments:
command apply Apply any configuration/package changes with Terraform
build Build Lambda packages (saves *.zip files in terraform/)
cb_copy_all Copy all binaries from CarbonBlack Response into BinaryAlert
clone_rules Clone YARA rules from other open-source projects
compile_rules Compile all of the YARA rules into a single binary file
configure Update basic configuration, including region, prefix, and downloader settings
deploy Deploy BinaryAlert (equivalent to unit_test + build + apply)
destroy Teardown all of the BinaryAlert infrastructure
live_test Upload test files to BinaryAlert which should trigger YARA matches
purge_queue Purge the analysis SQS queue (e.g. to stop a retroactive scan)
retro_fast Enumerate the most recent S3 inventory for fast retroactive analysis
retro_slow Enumerate the entire S3 bucket for slow retroactive analysis
unit_test Run unit tests (*_test.py)
$ ./manage.py configure
$ ./manage.py deploy
$ ./manage.py live_test
$ time ./manage.py retro_fast
Reading inventory/.../EntireBucketDaily/2018-08-13T08-00Z/manifest.json
94679: requirements_top_level.txt
Done!
real 0m20.067s
$ time ./manage.py retro_slow
94682: requirements_top_level.txt
Done!
real 1m10.056s
$ ./manage.py cb_copy_all
$ ./manage.py purge_queue
Note that reading from the inventory (retro_fast) enqueues objects many times faster than enumerating them manually. It takes about 80 seconds to enumerate a million objects (with 32 processes on my laptop). This means a multi-million-object bucket will take a few minutes to enqueue for retroactive analysis, but IMO this is much better (and cheaper) than running the batcher Lambda function for several hours.
Reviewers
Apologies: this change is bigger than I intended - the CLI was becoming painfully difficult to manage. Most of cli/config.py and cli/manager.py (and their unit tests) are unchanged, except for the addition of inventory / queueing logic.
Coverage increased (+0.5%) to 92.189% when pulling 12692fdc361ab2b613b16f492a2caa11bd5da474 on austin-remove-batcher into ca049c589c6a27abad867a5240d131dbe2b829a5 on master.
to: @ryandeivert cc: @airbnb/binaryalert-maintainers size: large resolves: #18 resolves: #46 resolves: #120
Background
The batcher function for retroactive analysis is error-prone (especially timeouts), can run for a very long time, and can be invoked multiple times, essentially DOSing your BinaryAlert deployment.
Changes
Lambda Functions
if __package__
import logicTerraform
CLI
purge_queue
: Purge the analyzer queue, immediately stopping any retroactive analysisretro_fast
: Add all objects from the latest S3 inventory manifest onto the analysis queueretro_slow
: Enumerate the bucket manually (like the batcher did before)deploy
command no longer starts a retroactive scanmanage.py
script has been separated into different components incli/
Tests
.travis.yml
have been moved to a standalone scripttests/ci_tests.sh
. This makes it easier for contributors to test their changes in exactly the same way that Travis willtests/
from coverage measurement. Adding unit tests artificially inflated the coverage measure due to the extra lines of code.Testing
Note that reading from the inventory (
retro_fast
) enqueues objects many times faster than enumerating them manually. It takes about 80 seconds to enumerate a million objects (with 32 processes on my laptop). This means a multi-million-object bucket will take a few minutes to enqueue for retroactive analysis, but IMO this is much better (and cheaper) than running the batcher Lambda function for several hours.Reviewers
Apologies: this change is bigger than I intended - the CLI was becoming painfully difficult to manage. Most of
cli/config.py
andcli/manager.py
(and their unit tests) are unchanged, except for the addition of inventory / queueing logic.