Open TheTechromancer opened 1 month ago
I was able to reproduce this with debug enabled. The culprit doesn't appear to be a module:
2024-08-14 23:20:49,341 [DEBUG] bbot.scanner scanner.py:1029 MODULE MEMORY USAGE:
2024-08-14 23:20:49,342 [DEBUG] bbot.scanner scanner.py:1029 - bucket_amazon: 2.50MB
2024-08-14 23:20:49,342 [DEBUG] bbot.scanner scanner.py:1029 - bucket_azure: 2.50MB
2024-08-14 23:20:49,342 [DEBUG] bbot.scanner scanner.py:1029 - bucket_firebase: 2.50MB
2024-08-14 23:20:49,342 [DEBUG] bbot.scanner scanner.py:1029 - bucket_google: 2.50MB
2024-08-14 23:20:49,342 [DEBUG] bbot.scanner scanner.py:1029 - json: 2.00MB
2024-08-14 23:20:49,342 [DEBUG] bbot.scanner scanner.py:1029 - dnsbrute_mutations: 933.06KB
Doing a second run with memray.
Memray reports pretty normal memory usage:
So it seems like either massdns/httpx is responsible, or else one of the engines is mibehaving. Running again with a ps monitor.
Looks like a memory leak in the web helper:
----------------------------------------
Timestamp: 2024-08-16 04:12:45
Top Memory Usage Processes:
PID USER %MEM MB COMMAND
459384 root 13.6% 1083.85 MB BBOT WebHelper
459376 root 5.3% 428.63 MB /root/.cache/pypoetry/virtualenvs/bbot-aKeTQkGo-py3.12/bin/python -c import sys; from importlib import import_module; sys.argv = ['/root/.cache/pypoetry/virtualenvs/bbot-aKeTQkGo-py3.12/bin/bbot', '-t', 'dell.com', '-p', 'subdomain-enum', 'web-basic']; sys.exit(import_module('bbot.cli').main())
5710 root 1.4% 117.17 MB tmux
459392 root 1.1% 87.38 MB BBOT DNSHelper
459432 root 1.0% 80.30 MB bbot process pool
459437 root 0.9% 72.26 MB bbot process pool
459480 root 0.8% 71.32 MB bbot process pool
459669 nobody 0.6% 50.22 MB massdns -r /root/.bbot/cache/25152d64f6205b681a2b170969063ff7030fd01c -s 1000 -t A -o J -q
459670 root 0.5% 40.71 MB httpx -silent -json -include-response -threads 50 -timeout 5 -retries 1 -header User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.2151.97 -response-size-to-read 5242880 -r 1.1.1.1,1.0.0.1
399 root 0.4% 37.25 MB /usr/bin/containerd
----------------------------------------
The culprit seems to be httpx:
...although this might actually be an extremely large response, i.e. from wayback.
It really does look like a memory leak in httpx:
There is a discussion for a similar issue: https://github.com/encode/httpx/discussions/2785.
Still happening even after the fix. Very mysterious. I tried taking all the URLs from the offending scan (that was killed for memory reasons) and ran them through this script:
import sys
import asyncio
import httpx
import time
from datetime import datetime
async def fetch_url(client, url):
start_time = time.time()
try:
response = await client.get(url)
return url, response.status_code, time.time() - start_time
except Exception as e:
return url, str(e), time.time() - start_time
async def process_urls(urls, client, active_tasks, max_concurrent=20):
url_iter = iter(urls)
# Initial population of the task pool
for _ in range(min(max_concurrent, len(urls))):
url = next(url_iter)
task = asyncio.create_task(fetch_url(client, url))
active_tasks[task] = (url, time.time())
# Process tasks and replace them as they complete
while active_tasks:
done, pending = await asyncio.wait(active_tasks.keys(), return_when=asyncio.FIRST_COMPLETED)
for completed_task in done:
yield await completed_task
del active_tasks[completed_task]
try:
url = next(url_iter)
new_task = asyncio.create_task(fetch_url(client, url))
active_tasks[new_task] = (url, time.time())
except StopIteration:
pass # No more URLs to process
async def print_status(active_tasks):
while active_tasks:
print(f"\nStatus at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}:")
for task, (url, start_time) in active_tasks.items():
duration = time.time() - start_time
print(f"Task for URL: {url}, Running for {duration:.2f} seconds")
await asyncio.sleep(5)
async def main():
with open(sys.argv[-1], 'r') as file:
urls = [line.strip() for line in file]
import ssl
ssl_context_noverify = ssl.create_default_context()
ssl_context_noverify.check_hostname = False
ssl_context_noverify.verify_mode = ssl.CERT_NONE
ssl_context_noverify.options &= ~ssl.OP_NO_SSLv2 & ~ssl.OP_NO_SSLv3
ssl_context_noverify.set_ciphers("ALL:@SECLEVEL=0")
ssl_context_noverify.options |= 0x4 # Add the OP_LEGACY_SERVER_CONNECT option
transport = httpx.AsyncHTTPTransport(retries=0, verify=ssl_context_noverify)
async with httpx.AsyncClient(transport=transport) as client:
active_tasks = {}
status_printer = asyncio.create_task(print_status(active_tasks))
async for url, status, duration in process_urls(urls, client, active_tasks):
print(f"Completed: URL: {url}, Status: {status}, Duration: {duration:.2f} seconds")
status_printer.cancel()
if __name__ == "__main__":
asyncio.run(main())
Not a single issue; memory usage didn't peak above 100MB. This probably means we can rule out the possibility of any of those URLs being a single huge file that would crash the process.
I bet if I can attach to the WebHelper
when it's misbehaving, I can figure out the cause.
I thought it might be wayback
, but the memory still leaks even with wayback disabled. Could it be cookies taking up all that space?
Next thing to test: run all the requests with their original args/kwargs.
Is it because we're forgetting to clean out self.client_tasks
?
Fixed in 49c3c7b.
Reopening as this still seems to be an issue.
Today I ran a long-running scan, waited until it hit 1GB memory, then attached to it with gdb and went hunting for the http engine. Inside its thread, I found only one task:
{<Task pending name='Task-62' coro=<EngineServer.worker() running at /root/bbot/bbot/core/engine.py:489> wait_for=<Future pending cb=[_AsyncSocket._remove_finished_future(event_list=deque([_FutureEvent(...re._NoTimer'>)]), event=_FutureEvent(...re._NoTimer'>))() at /root/.cache/pypoetry/virtualenvs/bbot-aKeTQkGo-py3.12/lib/python3.12/site-packages/zmq/_future.py:475, Task.task_wakeup()]> cb=[_run_until_complete_cb() at /usr/lib/python3.12/asyncio/base_events.py:182]>}
It's the main worker task:
File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
self._bootstrap_inner()
File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
self.run()
File "/root/bbot/bbot/core/helpers/process.py", line 27, in run
super().run()
File "/usr/lib/python3.12/threading.py", line 1010, in run
self._target(*self._args, **self._kwargs)
File "/root/bbot/bbot/core/engine.py", line 302, in server_process
asyncio.run(engine_server.worker())
File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/usr/lib/python3.12/asyncio/base_events.py", line 674, in run_until_complete
self.run_forever()
File "/usr/lib/python3.12/asyncio/base_events.py", line 641, in run_forever
self._run_once()
File "/usr/lib/python3.12/asyncio/base_events.py", line 1949, in _run_once
event_list = self._selector.select(timeout)
File "/usr/lib/python3.12/selectors.py", line 468, in select
fd_event_list = self._selector.poll(timeout, max_ev)
File "/root/bbot/bbot/core/engine.py", line 489, in worker
client_id, binary = await self.socket.recv_multipart()
Which was idle, and waiting for the next command:
{'self': <bbot.core.helpers.web.engine.HTTPEngine object at 0x73784c875610>, 'client_id': b'\x00\x13\x7fkQ', 'binary': b'\x80\x04\x95?\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x01c\x94K\x00\x8c\x01a\x94\x8c+https://internetdb.shodan.io/143.166.196.21\x94\x85\x94u.', 'message': {'c': 0, 'a': ('https://internetdb.shodan.io/143.166.196.21',)}, 'cmd': 0, 'args': ('https://internetdb.shodan.io/143.166.196.21',), 'kwargs': {}, 'command_name': 'request', 'command_fn': <bound method HTTPEngine.request of <bbot.core.helpers.web.engine.HTTPEngine object at 0x73784c875610>>, 'coroutine': <coroutine object EngineServer.run_and_return at 0x737838669b40>, 'task': <Task finished name='Task-479548' coro=<EngineServer.run_and_return() done, defined at /root/bbot/bbot/core/engine.py:410> result=None>}
There were no current tasks or child tasks:
(Pdb) httpengine = tasks[0].get_coro().cr_frame.f_locals["self"]
(Pdb) httpengine
<bbot.core.helpers.web.engine.HTTPEngine object at 0x73784c875610>
(Pdb) len(httpengine.tasks)
0
(Pdb) len(httpengine.child_tasks)
0
This debunks the theory that our task tracking is causing the memory leak.
So what is the cause?
First, let's look at the httpx clients:
(Pdb) httpengine.web_client
<bbot.core.helpers.web.client.BBOTAsyncClient object at 0x73784c23b950>
(Pdb) import sys
(Pdb) sys.getsizeof(httpengine.web_client)
48
(Pdb) httpengine.web_client._transport
<httpx.AsyncHTTPTransport object at 0x7378556910a0>
(Pdb) httpengine.web_client._transport._pool
<AsyncConnectionPool [Requests: 0 active, 0 queued | Connections: 0 active, 0 idle]>
(Pdb) httpengine.web_clients
{1: <bbot.core.helpers.web.client.BBOTAsyncClient object at 0x73784c23b950>}
Nothing out of the ordinary there.
So far we have two seemingly conflicting reports of where the leak is. Both of them are inside httpx, the first is in httpcore._async.http11._receive_response_body()
, the other is in httpcore._backends.anyio.streams.tls.do_handshake()
.
In any case, both of these point to a bug deep inside the httpcore
library.
Can we learn anything from the garbage collector?
Garbage collector:
(Pdb) import gc
(Pdb) gc.isenabled()
True
(Pdb) gc.get_threshold()
(700, 10, 10)
(Pdb) gc.collect()
19626
(Pdb) gc.get_count()
(27, 0, 0)
(Pdb) len(gc.garbage)
0
(Pdb) len(gc.get_objects())
1082490
According to Claude, this means:
- The garbage collector is working (it's enabled and just collected a significant number of objects).
- There are no objects that the garbage collector considers uncollectable at this moment.
- Despite the recent collection, there's still a large number of objects (over 1 million) being tracked by the garbage collector.
Counted the types:
219378 <class 'set'>
217845 <class 'dict'>
98791 <class 'radixtarget.tree.base.RadixTreeNode'>
97228 <class 'list'>
71061 <class 'ipaddress.IPv4Address'>
52571 <class 'ipaddress.IPv4Network'>
41906 <class 'tuple'>
31085 <class 'bbot.core.event.base.DNS_NAME'>
27189 <class 'function'>
24253 <class 'dns.rdtypes.IN.A.A'>
22596 <class 'dns.name.Name'>
20753 <class 'bbot.core.event.base.IP_ADDRESS'>
17194 <class 'tldextract.tldextract.Trie'>
8827 <class 'dns.rdtypes.ANY.PTR.PTR'>
8654 <class 'weakref.ReferenceType'>
8263 <class 'ipaddress.IPv6Address'>
7600 <class 'urllib.parse.ParseResult'>
7494 <class 'ipaddress.IPv6Network'>
7110 <class 'cell'>
7054 <class 'dns.rdtypes.ANY.CNAME.CNAME'>
6297 <class 'dns.rdtypes.IN.AAAA.AAAA'>
5537 <class 'bbot.core.event.base.URL_UNVERIFIED'>
5461 <class 'builtin_function_or_method'>
3789 <class 'collections.defaultdict'>
3539 <class 'getset_descriptor'>
3486 <class 'dns.rrset.RRset'>
3378 <class 'type'>
3211 <class 'method'>
2695 <class 'omegaconf.nodes.AnyNode'>
2695 <class 'omegaconf.base.Metadata'>
2544 <class 'bbot.core.helpers.async_helpers._Lock'>
2191 <class 'bbot.core.event.base.OPEN_TCP_PORT'>
1979 <class 'bbot.core.event.base.URL'>
1925 <class 'wrapper_descriptor'>
1626 <class 'method_descriptor'>
1480 <class 'dns.message.QueryMessage'>
1478 <class 'dns.resolver.Answer'>
1478 <class 'dns.message.ChainingResult'>
1387 <class 'member_descriptor'>
1237 <class 'module'>
1208 <class '_frozen_importlib.ModuleSpec'>
1148 <class 'property'>
1122 <class 'dns.rdtypes.ANY.NS.NS'>
1104 <class '_frozen_importlib_external.SourceFileLoader'>
1055 <class 'omegaconf.base.ContainerMetadata'>
992 <class 'bbot.core.event.base.HTTP_RESPONSE'>
949 <class 'frozenset'>
942 <class 'omegaconf.dictconfig.DictConfig'>
791 <class 'typing.ForwardRef'>
747 <class 'classmethod'>
600 <class 'collections.deque'>
563 <class 'antlr4.atn.Transition.EpsilonTransition'>
557 <class 'typing._UnionGenericAlias'>
538 <class 'typing._GenericAlias'>
531 <class 'staticmethod'>
521 <class 'collections._tuplegetter'>
492 <class 'dns.rdtypes.ANY.TXT.TXT'>
472 <class '_abc._abc_data'>
428 <class 'coroutine'>
411 <class 're.Pattern'>
402 <class 'abc.ABCMeta'>
357 <class 'enum.property'>
350 <class 'dns.rdtypes.ANY.MX.MX'>
341 <class 'pycparser.ply.yacc.MiniProduction'>
318 <class 'antlr4.atn.ATNState.BasicState'>
317 <class '_cython_3_0_10.cython_function_or_method'>
305 <class '_contextvars.Context'>
274 <class 'dataclasses.Field'>
256 <class 'cffi.model.PointerType'>
251 <class 'async_generator'>
199 <class 'pathlib.PosixPath'>
199 <class 'asyncio.locks.Lock'>
191 <class 'rich.style.Style'>
189 <class 'soupsieve.css_types.SelectorList'>
186 <class 'contextlib._AsyncGeneratorContextManager'>
176 <class '_cffi_backend.CType'>
162 <class 'inspect.Parameter'>
152 <class 'pkg_resources.DistInfoDistribution'>
152 <class 'pkg_resources.PathMetadata'>
151 <class 'logging.Logger'>
149 <class '_frozen_importlib_external.FileFinder'>
136 <class 'soupsieve.css_types.Selector'>
132 <class '_asyncio.Task'>
132 <class 'asyncio.locks.Condition'>
130 <class 'dns.rdtypes.ANY.SOA.SOA'>
128 <class 'enum.EnumType'>
128 <class 'antlr4.IntervalSet.IntervalSet'>
128 <class 'urllib.parse.SplitResult'>
124 <class '_asyncio.Future'>
123 <class '_asyncio.FutureIter'>
117 <class 'typing.TypeVar'>
114 <enum 'SocketOption'>
113 <class 'omegaconf.listconfig.ListConfig'>
107 <class 'cffi.model.FunctionPtrType'>
104 <class 'baddns.lib.signature.BadDNSSignature'>
102 <class 'antlr4.atn.Transition.AtomTransition'>
Here I'm assuming 98791 <class 'radixtarget.tree.base.RadixTreeNode'>
is cloudcheck. 219378 <class 'set'>
and 217845 <class 'dict'>
do seem pretty high. But there's absolutely nothing here to indicate a leak in httpx, or at least one that's related to garbage collection.
Could this mean a memory leak in the C-level code?
After running a big scan on dell.com
, memory usage was only at 1.1GB after 2.5 hours. By this point there had been over 100K web requests, and the memray results suggest there's not actually a memory leak at all. 1.1GB is very modest for a target as big as dell.com, and based on the memray output, it was mostly events, which aligns with the numbers I was seeing in the queues:
[INFO] premature_gregory: Modules running (incoming:processing:outgoing) speculate(74,548:0:997), baddns(71,136:8:0), bucket_azure(44,807:1:0), bucket_amazon(44,659:1:0), bucket_firebase(44,587:1:0), bucket_google(44,155:1:0), httpx(12,151:501:0), sslcert(5,840:25:0), as
n(326:1:0), iis_shortnames(174:8:0), git(134:1:0), robots(126:1:0), dnsresolve(10:25:0)
[INFO] premature_gregory: Events produced so far: DNS_NAME: 46032, IP_ADDRESS: 7112, URL: 1394, OPEN_TCP_PORT: 1300, TECHNOLOGY: 910, ASN: 265, URL_HINT: 242, CODE_REPOSITORY: 150, FINDING: 48, SOCIAL: 47, STORAGE_BUCKET: 42, VULNERABILITY: 31, EMAIL_ADDRESS: 13, SCAN: 1
, ORG_STUB: 1, AZURE_TENANT: 1, PROTOCOL: 1
[INFO] premature_gregory: 997 events in queue (3,738 processed in the past 15 seconds)
This scan was run with threads only, so I'm going to run it again with processes and see if the results are any different.
This was the command:
poetry run memray run --aggregate -o round4.bin /root/.cache/pypoetry/virtualenvs/bbot-aKeTQkGo-py3.12/bin/bbot -t dell.com -p subdomain-enum web-basic -c web.debug=true engine.debug=true -em wayback -ef aggressive
Confirmed, there is no memory leak. @aconite33 10K targets is just a lot. Unless the targets are related or you're trying to discover relationships between them, they should be scanned separately.
Command to reproduce?
bbot -t hosts.txt -m telerik portscan -y --allow-deadly
Memory usage gets really high during some scans. Until we get a
debug.log
for one of these scans, it's a mystery as to why.@amiremami