Update remote rules nightly

What would it take to run a nightly job to update the remote rule sets? Adding sources to the remote rules makes the unit_test fail.

/opt/binaryalert/rules/clone_rules.py

REMOTE_RULE_SOURCES = {
    'https://github.com/Neo23x0/signature-base.git': ['yara'],
    'https://github.com/YARA-Rules/rules.git': ['CVE_Rules'],
    'https://github.com/SupportIntelligence/Icewater.git': ['']
}

$ ./manage.py unit_test
.........................................................................F
======================================================================
FAIL: test_update_rules (tests.rules.update_rules_test.UpdateRulesTest)
Verify which rules files were saved and deleted.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python3.6/unittest/mock.py", line 1179, in patched
    return func(*args, **keywargs)
  File "/opt/binaryalert/tests/rules/update_rules_test.py", line 52, in test_update_rules
    self.assertEqual(expected_files, set(compile_rules._find_yara_files()))
AssertionError: Items in the second set but not the first:
'github.com/SupportIntelligence/Icewater.git/CVE_Rules/cloned.yara'

----------------------------------------------------------------------
Ran 74 tests in 18.957s

FAILED (failures=1)
TEST FAILED: Unit tests failed

Need a way to make sure all of the python libraries are available for the rules.

/opt/binaryalert/rules/clone_rules.py

REMOTE_RULE_SOURCES = {
    'https://github.com/Neo23x0/signature-base.git': ['yara'],
    'https://github.com/YARA-Rules/rules.git': [''],
    'https://github.com/SupportIntelligence/Icewater.git': ['']
}

$ ./manage.py compile_rules
Traceback (most recent call last):
  File "./manage.py", line 495, in <module>
    main()
  File "./manage.py", line 491, in main
    manager.run(args.command)
  File "./manage.py", line 352, in run
    getattr(self, command)()  # Command validation already happened in the ArgumentParser.
  File "./manage.py", line 421, in compile_rules
    compile_rules.compile_rules(COMPILED_RULES_FILENAME)
  File "/opt/binaryalert/rules/compile_rules.py", line 36, in compile_rules
    externals={'extension': '', 'filename': '', 'filepath': '', 'filetype': ''})
yara.SyntaxError: ./Mobile_Malware/Android_FakeApps.yar(101): invalid field name "app_name"

Would it be better to remove the rules or install the missing python libraries?

/opt/binaryalert/rules/compile_rules.py

    for line in yara_filepaths:
        try:
            test = yara.compile(RULES_DIR+'/'+line)
        except:
            os.remove(RULES_DIR+'/'+line)

    yara_filepaths = {relative_path: os.path.join(RULES_DIR, relative_path)
                      for relative_path in _find_yara_files()}

Compile requires enough memory to complete. These rules required a t2.small to build.

$ ./manage.py compile_rules
Traceback (most recent call last):
  File "./manage.py", line 495, in <module>
    main()
  File "./manage.py", line 491, in main
    manager.run(args.command)
  File "./manage.py", line 352, in run
    getattr(self, command)()  # Command validation already happened in the ArgumentParser.
  File "./manage.py", line 421, in compile_rules
    compile_rules.compile_rules(COMPILED_RULES_FILENAME)
  File "/opt/binaryalert/rules/compile_rules.py", line 45, in compile_rules
    externals={'extension': '', 'filename': '', 'filepath': '', 'filetype': ''})
MemoryError

Only a certain number of rules can apply before receiving this error.

$ ./manage.py apply
Traceback (most recent call last):
  File "./manage.py", line 495, in <module>
    main()
  File "./manage.py", line 491, in main
    manager.run(args.command)
  File "./manage.py", line 352, in run
    getattr(self, command)()  # Command validation already happened in the ArgumentParser.
  File "./manage.py", line 382, in apply
    subprocess.check_call(['terraform', 'apply', '-auto-approve=false'])
  File "/usr/lib64/python3.6/subprocess.py", line 291, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['terraform', 'apply', '-auto-approve=false']' returned non-zero exit status 1

Thanks for all the feedback! We've been wanting to improve the entire rule sourcing process for awhile now, and I'm excited to get some feedback in that regard. Before going into some specific issues you've brought up, let me ask this:

Would it be helpful if rules were stored separately? For example, if YARA rules were just stored in an S3 bucket, then they could be updated without re-deploying BinaryAlert. This adds some latency to the analyzers (which have to download/decompress the compiled rules file once per Lambda container), but then you would not need a terraform apply to update YARA rules. This is something that has come up before, and we'd love any discussion about this approach.

Another benefit of storing YARA rules in S3 is that a Lambda function could run on a regular interval (e.g. daily) to automatically update the rules, obviating the need for an EC2 instance entirely (serverless FTW!) The downside is that this automatically bundles untrusted rules files from the Internet. Malicious YARA rules could, for example, cause buffer overflows in YARA to gain control over the analyzer Lambda execution, which has access to the S3 bucket with your files. Thoughts?

Now, let's go into some specific issues you've encountered:

Changing rule sources should not break unit tests. This can be easily fixed, which I'm happy to do.
The androguard library is not included, which breaks some mobile YARA rules. This is a YARA library, not a Python library, and unfortunately, androguard has a somewhat tricky install process which involves modifying YARA source files and including the cuckoo library. We can add this to the backlog, but this leads to:
Rule source declaration should be more expressive. In particular, there should be a way to ignore specific rule files or even specific rules. Right now, you can only specify a top-level repo and a set of subfolders to clone. Since only .yar and .yara files are bundled by BinaryAlert, you can manually rename rules files something like .yara.DISABLED to ignore them, but it would be great if they were just removed during the clone process.
Rule compilation takes a fair amount of memory. I'm not sure of an easy way around this one. Pre-compiling the rules saves the analyzers a significant amount of time (and therefore $$), and so far we've been able to compile even tens of thousands of rules with only a few GB of RAM. I'll be sure to update the documentation to list the memory requirement (and minimal EC2 machine size).

Next week, I'll open individual issues to address the points above, as well as potentially separating the rules.

My vote would go toward storing the YARA rules in an S3 bucket making BinaryAlert serverless. Providing a method for collaboration that could remove the need for automatic updates as YARA rule management might use https://github.com/PUNCH-Cyber/YaraGuardian for example. Upon upload of the YARA rule to the S3 bucket, would it be possible to validate required Python libraries and memory impact to Lambda? Happy to test anything needed!!

The question seems to be if users will be scanning large S3 buckets or running BinaryAlert as an analysis workflow that is my use case.

TL;DR- using git,cron,and deploy.py I'm able to manage rulesets via relatively simple git commits. Large scale "feed BA everything" is my flavor of deployment

I agree that making yara rules more fluid is a needed feature. I've been able to implement a hacky rule sync by using an ec2 instance and basic cron scheduling, and git. Check commit hashes>if new>> sync from git locally && terraform deploy else exit. This allows my analysts to merge any rule changes they need without any interaction from any engineers, and because the deploy will error out on unit test failures, it's been relatively robust in my testing.

I think the added costs and latency assoiciated with direct s3 rule storage might be a bit prohibitive, but I haven't done the exact math to back that opinion up. Some middle ground might be a lambda that checks the git repo/s3 bucket/whatever rule source directly for changes, pulls down the ruleset, compiles/terraforms all of the rules, and re-deploys the environment. So the rules are still stored 'locally' with the lambdas as compiled files, and as uncompiled source in whichever source control you choose.

Also re: scanning large s3 vs analysis workflow. I'm installing BA as a detective control inline with any of our other controls. I'm basically trying to turn the analyst workflow from "look at all of these alerts, choose the ones that might be interesting, run yara against them to determine 'flavor', extract malware, detonate, analyze, mitigate" into "write yara rules to catch sketchy executions and emails, and let automation handle the, extract, detonate, and mitigate." So my usecase definitely tends towards 'large s3 buckets.' But I can see the want for both.

Closing this as part of the more expressive YARA rule cloning from #98 and #99, but feel free to open another issue about any specific follow-ups!

For example, if loading YARA rules from S3 is something that you would want, go ahead and open an issue for it. We've found the local rules in the repo to be the most effective so far

airbnb / binaryalert

Update remote rules nightly #95