to: @chunyong-lin
cc: @ryandeivert @airbnb/binaryalert-maintainers
size: large
resolves #20
Background
The CarbonBlack downloader Lambda function can time out fairly frequently while waiting for the Response server. Even with Lambda's automatic retries, it's possible to completely lose data. To improve the reliability of the downloader, events can now be directed to a queue instead of to the downloader directly.
For example, if StreamAlert is used to notify the BinaryAlert downloader Lambda, now it can instead publish to the BinaryAlert downloader queue.
Changes
Terraform and CLI
Create a new SQS queue for buffering downloader events
A nice property of this is that the ./manage.py cb_copy_all command, which copies all binaries from CarbonBlack into BinaryAlert, is now infinitely simpler - it simply publishes messages to the queue.
Events which repeatedly fail to be processed from the downloader queue are sent to a new dead letter queue for debugging.
Adds new metric alarms for the downloader queue age and deliveries to the dead letter queue
Upgrade requirements (including the CarbonBlack API to version 1.3.6)
cbapi is no longer included as a .zip file in the repo - it works to pip install it during deployment
Lambda Functions
The dispatcher is now completely generic - it rotates among a list of SQS queues, polling from each and sending them to a configured Lambda function. For BinaryAlert, the dispatcher alternates between polling the analyzer and the downloader queue. But this same dispatcher function could be re-used for other projects.
The analyzer supports both invocation with an S3 event and from the dispatcher. This way, users can still connect the analyzer directly to S3 bucket(s) if they so choose.
The analyzer now raises a FileDownloadError (which is caught and logged) if a file couldn't be downloaded from S3 due to a 4XX error. This way, the analyzer won't keep retrying to scan files which don't exist, for example
CI Improvements
Static type-checking (mypy) now enforces type annotations for all function and variable definitions (when it can't be inferred)
mypy skips the tests folder
Pylint runs in a single process, which is better in containered CI (and allows it to catch more things)
Testing
Deploy to a test account (no downloader)
Enable the downloader and re-deploy: all resources update correctly (e.g. the dashboard is automatically updated to include downloader metrics)
Coverage decreased (-0.0002%) to 93.017% when pulling c5d5464d768b162c33f6e594a4714b1d932d80fe on austin-generic-dispatcher into ff6bdb3d99a0ba8a9187b42bcfe1f97cdf604c59 on master.
to: @chunyong-lin cc: @ryandeivert @airbnb/binaryalert-maintainers size: large resolves #20
Background
The CarbonBlack
downloader
Lambda function can time out fairly frequently while waiting for the Response server. Even with Lambda's automatic retries, it's possible to completely lose data. To improve the reliability of the downloader, events can now be directed to a queue instead of to the downloader directly.For example, if StreamAlert is used to notify the BinaryAlert downloader Lambda, now it can instead publish to the BinaryAlert downloader queue.
Changes
Terraform and CLI
./manage.py cb_copy_all
command, which copies all binaries from CarbonBlack into BinaryAlert, is now infinitely simpler - it simply publishes messages to the queue.cbapi
is no longer included as a.zip
file in the repo - it works topip install
it during deploymentLambda Functions
FileDownloadError
(which is caught and logged) if a file couldn't be downloaded from S3 due to a 4XX error. This way, the analyzer won't keep retrying to scan files which don't exist, for exampleCI Improvements
mypy
) now enforces type annotations for all function and variable definitions (when it can't be inferred)mypy
skips thetests
folderTesting
./manage.py cb_copy_all
./manage.py analyze_all