Replace batcher with S3 inventory

to: @ryandeivert cc: @airbnb/binaryalert-maintainers size: large resolves: #18 resolves: #46 resolves: #120

Background

The batcher function for retroactive analysis is error-prone (especially timeouts), can run for a very long time, and can be invoked multiple times, essentially DOSing your BinaryAlert deployment.

Changes

Lambda Functions

Remove the batcher Lambda function entirely
Build Lambda functions as a proper Python package to remove the hacky if __package__ import logic
Reduce the S3 connection timeout - the tail end of binary download latencies approach the 60 second default timeout, but there's no need to wait that long before retrying the connection

Terraform

Enable S3 inventory on the binary S3 bucket (Terraform support for this was only recently added)
Remove BatchEnqueueFailures alarm, since the batcher is gone
Remove the throttle alarms - throttles are more common when invoking Lambda via SQS and are automatically retried
Set a concurrency limit for both Lambda functions (analyzer and downloader). This prevents the whole account from running out of concurrency if there are millions of objects in the queue

CLI

There are 3 new CLI commands:
- purge_queue: Purge the analyzer queue, immediately stopping any retroactive analysis
- retro_fast: Add all objects from the latest S3 inventory manifest onto the analysis queue
- retro_slow: Enumerate the bucket manually (like the batcher did before)
Retroactive scans use multiple processes in parallel to send messages to SQS
The deploy command no longer starts a retroactive scan
The monolithic manage.py script has been separated into different components in cli/

Tests

The individual test commands in .travis.yml have been moved to a standalone script tests/ci_tests.sh. This makes it easier for contributors to test their changes in exactly the same way that Travis will
Remove tests/ from coverage measurement. Adding unit tests artificially inflated the coverage measure due to the extra lines of code.

Testing

$ ./manage.py --help
usage: manage.py [-h] [--version] command

positional arguments:
  command     apply          Apply any configuration/package changes with Terraform
              build          Build Lambda packages (saves *.zip files in terraform/)
              cb_copy_all    Copy all binaries from CarbonBlack Response into BinaryAlert
              clone_rules    Clone YARA rules from other open-source projects
              compile_rules  Compile all of the YARA rules into a single binary file
              configure      Update basic configuration, including region, prefix, and downloader settings
              deploy         Deploy BinaryAlert (equivalent to unit_test + build + apply)
              destroy        Teardown all of the BinaryAlert infrastructure
              live_test      Upload test files to BinaryAlert which should trigger YARA matches
              purge_queue    Purge the analysis SQS queue (e.g. to stop a retroactive scan)
              retro_fast     Enumerate the most recent S3 inventory for fast retroactive analysis
              retro_slow     Enumerate the entire S3 bucket for slow retroactive analysis
              unit_test      Run unit tests (*_test.py)

$ ./manage.py configure

$ ./manage.py deploy

$ ./manage.py live_test

$ time ./manage.py retro_fast
Reading inventory/.../EntireBucketDaily/2018-08-13T08-00Z/manifest.json
94679: requirements_top_level.txt
Done!

real    0m20.067s

$ time ./manage.py retro_slow
94682: requirements_top_level.txt
Done!

real    1m10.056s

$ ./manage.py cb_copy_all

$ ./manage.py purge_queue

Note that reading from the inventory (retro_fast) enqueues objects many times faster than enumerating them manually. It takes about 80 seconds to enumerate a million objects (with 32 processes on my laptop). This means a multi-million-object bucket will take a few minutes to enqueue for retroactive analysis, but IMO this is much better (and cheaper) than running the batcher Lambda function for several hours.

Reviewers

Apologies: this change is bigger than I intended - the CLI was becoming painfully difficult to manage. Most of cli/config.py and cli/manager.py (and their unit tests) are unchanged, except for the addition of inventory / queueing logic.

airbnb / binaryalert