New Excavate Module: Implement Yara

There appear to be some slow regexes in excavate:

[WARN] Executing <Task pending name='excavate._worker()' coro=<BaseModule._worker() running at /home/bls/Downloads/code/bbot/bbot/modules/base.py:629> wait_for=<Task pending name='excavate.handle_event(HTTP_RESPONSE("{\'url\': \'https://shop.tesla.com/category/vehicle-accessories\', \'timestamp\': \'202...", module=httpx, tags={\'cdn-akamai\', \'status-200\', \'ip-23-203-244-93\', \'http-title-tesla-vehicle-accessories\', \'endpoint\', \'in-scope\'}))' coro=<excavate.handle_event() running at /home/bls/Downloads/code/bbot/bbot/modules/internal/excavate.py:364> cb=[Task.task_wakeup()] created at /usr/lib/python3.11/asyncio/tasks.py:384> created at /usr/lib/python3.11/asyncio/tasks.py:384> took 0.103 seconds
[WARN] Executing <Task pending name='excavate.handle_event(HTTP_RESPONSE("{\'url\': \'https://shop.tesla.com/category/vehicle-accessories\', \'timestamp\': \'202...", module=httpx, tags={\'cdn-akamai\', \'status-200\', \'ip-23-203-244-93\', \'http-title-tesla-vehicle-accessories\', \'endpoint\', \'in-scope\'}))' coro=<excavate.handle_event() running at /home/bls/Downloads/code/bbot/bbot/modules/internal/excavate.py:392> cb=[Task.task_wakeup()] created at /usr/lib/python3.11/asyncio/tasks.py:384> took 9.984 seconds
[WARN] Executing <Task pending name='excavate.handle_event(HTTP_RESPONSE("{\'url\': \'https://shop.tesla.com/category/vehicle-accessories\', \'timestamp\': \'202...", module=httpx, tags={\'cdn-akamai\', \'status-200\', \'ip-23-203-244-93\', \'http-title-tesla-vehicle-accessories\', \'endpoint\', \'in-scope\'}))' coro=<excavate.handle_event() running at /home/bls/Downloads/code/bbot/bbot/modules/internal/excavate.py:392> cb=[Task.task_wakeup()] created at /usr/lib/python3.11/asyncio/tasks.py:384> took 0.654 seconds
[WARN] Executing <Task pending name='excavate.handle_event(HTTP_RESPONSE("{\'url\': \'https://shop.tesla.com/category/vehicle-accessories\', \'timestamp\': \'202...", module=httpx, tags={\'cdn-akamai\', \'status-200\', \'ip-23-203-244-93\', \'http-title-tesla-vehicle-accessories\', \'endpoint\', \'in-scope\'}))' coro=<excavate.handle_event() running at /home/bls/Downloads/code/bbot/bbot/modules/internal/excavate.py:392> cb=[Task.task_wakeup()] created at /usr/lib/python3.11/asyncio/tasks.py:384> took 0.443 seconds
[WARN] Executing <Task pending name='excavate.handle_event(HTTP_RESPONSE("{\'url\': \'https://shop.tesla.com/category/vehicle-accessories\', \'timestamp\': \'202...", module=httpx, tags={\'cdn-akamai\', \'status-200\', \'ip-23-203-244-93\', \'http-title-tesla-vehicle-accessories\', \'endpoint\', \'in-scope\'}))' coro=<excavate.handle_event() running at /home/bls/Downloads/code/bbot/bbot/modules/internal/excavate.py:392> cb=[Task.task_wakeup()] created at /usr/lib/python3.11/asyncio/tasks.py:384> took 0.154 seconds
[WARN] Executing <Task pending name='excavate.handle_event(HTTP_RESPONSE("{\'url\': \'https://shop.tesla.com/category/vehicle-accessories\', \'timestamp\': \'202...", module=httpx, tags={\'cdn-akamai\', \'status-200\', \'ip-23-203-244-93\', \'http-title-tesla-vehicle-accessories\', \'endpoint\', \'in-scope\'}))' coro=<excavate.handle_event() running at /home/bls/Downloads/code/bbot/bbot/modules/internal/excavate.py:392> cb=[Task.task_wakeup()] created at /usr/lib/python3.11/asyncio/tasks.py:384> took 1.628 seconds
[WARN] Executing <Task pending name='excavate.handle_event(HTTP_RESPONSE("{\'url\': \'https://shop.tesla.com/category/vehicle-accessories\', \'timestamp\': \'202...", module=httpx, tags={\'cdn-akamai\', \'status-200\', \'ip-23-203-244-93\', \'http-title-tesla-vehicle-accessories\', \'endpoint\', \'in-scope\'}))' coro=<excavate.handle_event() running at /home/bls/Downloads/code/bbot/bbot/modules/internal/excavate.py:392> cb=[Task.task_wakeup()] created at /usr/lib/python3.11/asyncio/tasks.py:384> took 0.274 seconds

Took a while to track down this bug but it appears to be caused by having a lot of scan targets. Every target DNS_NAME (if it's not a subdomain of any other target) gets converted into a regex (Scanner.dns_regexes) which is then used by excavate to extract subdomains from HTTP_RESPONSEs, etc:

[DBUG]         excavate.finished: False                                                                                                                                                       
[DBUG]             running: True                                                                                                                                                              
[DBUG]             tasks:                                                                                                                                                                     
[DBUG]                 - excavate.handle_event(HTTP_RESPONSE("{'url': 'https://evilcorp.com/file/pdf/newsletters/gratitud...", module=httpx, tags={'ip-66-150-222-9', 'endpoint', 'status-200', 'in-scope', 'extension-pdf'})) running for 51 minutes, 22 seconds:                                                                                                     
[DBUG]             incoming_queue_size: 1516                                                                                                                                                  
[DBUG]             outgoing_queue_size: 0

In BBOT 2.0, we need to make sure that these many regexes are as parallelized as possible. Because right now even with the asyncification of regexes, individual regexes within a single excavate.handle_event() are still run sequentially back-to-back, which will result in the same slowdown.

@liquidsec

After learning that Aho-Corasick was the inspiration for the original fgrep, we might consider integrating with grep or a similar tool instead of trying to implement this in python. These tools are stable and written in C, and do not require adding another python dependency.

Another alternative, which is much faster than grep, is sift.

A much better alternative might be Yara. Yara is written in C, has an official python library, and appears to release the GIL:

pip install yara-python

import yara
import concurrent.futures
import time

# Function to generate random data for testing
def generate_random_data(size):
    with open('/dev/urandom', 'rb') as f:
        return f.read(size)

# Function to perform Yara scan
def yara_scan(data, rules):
    matches = rules.match(data=data)
    return matches

# Yara rules as a string
yara_rules = """
rule PatternMatching
{
    strings:
        $pattern1 = "example1"
        $pattern2 = /example2/
        $pattern3 = /example3/
        $pattern4 = /a{3}/
        $pattern5 = /b{4}/
        $pattern6 = /c{4}/
        $pattern7 = /d{4}/
        $pattern8 = /e{4}/
        $pattern9 = /f{4}/

    condition:
        any of them
}
"""

# Compile Yara rules from string
rules = yara.compile(source=yara_rules)

# Generate random data for testing
data_size = 10**8  # 1 MB of random data
test_data = [generate_random_data(data_size) for _ in range(10)]

# Measure the time taken for serial execution
print('testing serial')
start_time = time.time()
for data in test_data:
    yara_scan(data, rules)

end_time = time.time()
print(f"Serial execution time: {end_time - start_time:.2f} seconds")

# Measure the time taken for parallel execution using ThreadPoolExecutor
print('testing threaded')
start_time = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    futures = [executor.submit(yara_scan, data, rules) for data in test_data]
    concurrent.futures.wait(futures)

end_time = time.time()
print(f"Parallel execution time: {end_time - start_time:.2f} seconds")

# get extracted strings
for future in futures:
    matches = future.result()
    for match in matches:
        for s in match.strings:
            print(f"{s.identifier}: {s.instances}")

Yara vs python regex benchmarks for reference:

yara regexes:
    1000 against 100M - .43 seconds
    2000 against 100M - .51 seconds
    5000 against 100M - .58 seconds
    10000 against 100M - .67 seconds

    1000 against 1000M - 4.45 seconds

    1 against 100M - 0.18 seconds
    1 against 1000M - 1.74 seconds

yara strings:
    1000 against 100M - .39 seconds
    5000 against 100M - .43 seconds
    10000 against 100M - .45 seconds

    1000 against 1000M - 4.11 seconds

python regexes:
    1,000 against 1M - 11.85 seconds
    100 against 10M - 12.92 seconds
    1,000 against 10M - 127.75 seconds

    1 against 100M - 1.85 seconds
    1 against 1000M - 19.13 seconds

blacklanternsecurity / bbot

New Excavate Module: Implement Yara #1252