Open TheTechromancer opened 2 months ago
Took a while to track down this bug but it appears to be caused by having a lot of scan targets. Every target DNS_NAME
(if it's not a subdomain of any other target) gets converted into a regex (Scanner.dns_regexes
) which is then used by excavate to extract subdomains from HTTP_RESPONSEs, etc:
[DBUG] excavate.finished: False
[DBUG] running: True
[DBUG] tasks:
[DBUG] - excavate.handle_event(HTTP_RESPONSE("{'url': 'https://evilcorp.com/file/pdf/newsletters/gratitud...", module=httpx, tags={'ip-66-150-222-9', 'endpoint', 'status-200', 'in-scope', 'extension-pdf'})) running for 51 minutes, 22 seconds:
[DBUG] incoming_queue_size: 1516
[DBUG] outgoing_queue_size: 0
In BBOT 2.0, we need to make sure that these many regexes are as parallelized as possible. Because right now even with the asyncification of regexes, individual regexes within a single excavate.handle_event()
are still run sequentially back-to-back, which will result in the same slowdown.
@liquidsec
After learning that Aho-Corasick was the inspiration for the original fgrep
, we might consider integrating with grep
or a similar tool instead of trying to implement this in python. These tools are stable and written in C, and do not require adding another python dependency.
Another alternative, which is much faster than grep, is sift.
A much better alternative might be Yara. Yara is written in C, has an official python library, and appears to release the GIL:
pip install yara-python
import yara
import concurrent.futures
import time
# Function to generate random data for testing
def generate_random_data(size):
with open('/dev/urandom', 'rb') as f:
return f.read(size)
# Function to perform Yara scan
def yara_scan(data, rules):
matches = rules.match(data=data)
return matches
# Yara rules as a string
yara_rules = """
rule PatternMatching
{
strings:
$pattern1 = "example1"
$pattern2 = /example2/
$pattern3 = /example3/
$pattern4 = /a{3}/
$pattern5 = /b{4}/
$pattern6 = /c{4}/
$pattern7 = /d{4}/
$pattern8 = /e{4}/
$pattern9 = /f{4}/
condition:
any of them
}
"""
# Compile Yara rules from string
rules = yara.compile(source=yara_rules)
# Generate random data for testing
data_size = 10**8 # 1 MB of random data
test_data = [generate_random_data(data_size) for _ in range(10)]
# Measure the time taken for serial execution
print('testing serial')
start_time = time.time()
for data in test_data:
yara_scan(data, rules)
end_time = time.time()
print(f"Serial execution time: {end_time - start_time:.2f} seconds")
# Measure the time taken for parallel execution using ThreadPoolExecutor
print('testing threaded')
start_time = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
futures = [executor.submit(yara_scan, data, rules) for data in test_data]
concurrent.futures.wait(futures)
end_time = time.time()
print(f"Parallel execution time: {end_time - start_time:.2f} seconds")
# get extracted strings
for future in futures:
matches = future.result()
for match in matches:
for s in match.strings:
print(f"{s.identifier}: {s.instances}")
Yara vs python regex benchmarks for reference:
yara regexes:
1000 against 100M - .43 seconds
2000 against 100M - .51 seconds
5000 against 100M - .58 seconds
10000 against 100M - .67 seconds
1000 against 1000M - 4.45 seconds
1 against 100M - 0.18 seconds
1 against 1000M - 1.74 seconds
yara strings:
1000 against 100M - .39 seconds
5000 against 100M - .43 seconds
10000 against 100M - .45 seconds
1000 against 1000M - 4.11 seconds
python regexes:
1,000 against 1M - 11.85 seconds
100 against 10M - 12.92 seconds
1,000 against 10M - 127.75 seconds
1 against 100M - 1.85 seconds
1 against 1000M - 19.13 seconds
There appear to be some slow regexes in excavate: