Memory Bug - Githubissues

TheTechromancer commented 1 month ago

Memory usage gets really high during some scans. Until we get a debug.log for one of these scans, it's a mystery as to why.

@amiremami

TheTechromancer commented 1 month ago

I was able to reproduce this with debug enabled. The culprit doesn't appear to be a module:

2024-08-14 23:20:49,341 [DEBUG] bbot.scanner scanner.py:1029 MODULE MEMORY USAGE:
2024-08-14 23:20:49,342 [DEBUG] bbot.scanner scanner.py:1029     - bucket_amazon: 2.50MB
2024-08-14 23:20:49,342 [DEBUG] bbot.scanner scanner.py:1029     - bucket_azure: 2.50MB
2024-08-14 23:20:49,342 [DEBUG] bbot.scanner scanner.py:1029     - bucket_firebase: 2.50MB
2024-08-14 23:20:49,342 [DEBUG] bbot.scanner scanner.py:1029     - bucket_google: 2.50MB
2024-08-14 23:20:49,342 [DEBUG] bbot.scanner scanner.py:1029     - json: 2.00MB
2024-08-14 23:20:49,342 [DEBUG] bbot.scanner scanner.py:1029     - dnsbrute_mutations: 933.06KB

Doing a second run with memray.

TheTechromancer commented 1 month ago

Memray reports pretty normal memory usage:

So it seems like either massdns/httpx is responsible, or else one of the engines is mibehaving. Running again with a ps monitor.

TheTechromancer commented 1 month ago

Looks like a memory leak in the web helper:

----------------------------------------
Timestamp: 2024-08-16 04:12:45
Top Memory Usage Processes:
PID USER %MEM MB COMMAND
459384 root 13.6% 1083.85 MB BBOT WebHelper 
459376 root 5.3% 428.63 MB /root/.cache/pypoetry/virtualenvs/bbot-aKeTQkGo-py3.12/bin/python -c import sys; from importlib import import_module; sys.argv = ['/root/.cache/pypoetry/virtualenvs/bbot-aKeTQkGo-py3.12/bin/bbot', '-t', 'dell.com', '-p', 'subdomain-enum', 'web-basic']; sys.exit(import_module('bbot.cli').main()) 
5710 root 1.4% 117.17 MB tmux 
459392 root 1.1% 87.38 MB BBOT DNSHelper 
459432 root 1.0% 80.30 MB bbot process pool 
459437 root 0.9% 72.26 MB bbot process pool 
459480 root 0.8% 71.32 MB bbot process pool 
459669 nobody 0.6% 50.22 MB massdns -r /root/.bbot/cache/25152d64f6205b681a2b170969063ff7030fd01c -s 1000 -t A -o J -q 
459670 root 0.5% 40.71 MB httpx -silent -json -include-response -threads 50 -timeout 5 -retries 1 -header User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.2151.97 -response-size-to-read 5242880 -r 1.1.1.1,1.0.0.1 
399 root 0.4% 37.25 MB /usr/bin/containerd 
----------------------------------------

TheTechromancer commented 1 month ago

The culprit seems to be httpx:

...although this might actually be an extremely large response, i.e. from wayback.

TheTechromancer commented 1 month ago

It really does look like a memory leak in httpx:

TheTechromancer commented 1 month ago

There is a discussion for a similar issue: https://github.com/encode/httpx/discussions/2785.

TheTechromancer commented 1 month ago

Fixed in https://github.com/blacklanternsecurity/bbot/pull/1667.

TheTechromancer commented 3 weeks ago

Still happening even after the fix. Very mysterious. I tried taking all the URLs from the offending scan (that was killed for memory reasons) and ran them through this script:

import sys
import asyncio
import httpx
import time
from datetime import datetime

async def fetch_url(client, url):
    start_time = time.time()
    try:
        response = await client.get(url)
        return url, response.status_code, time.time() - start_time
    except Exception as e:
        return url, str(e), time.time() - start_time

async def process_urls(urls, client, active_tasks, max_concurrent=20):
    url_iter = iter(urls)

    # Initial population of the task pool
    for _ in range(min(max_concurrent, len(urls))):
        url = next(url_iter)
        task = asyncio.create_task(fetch_url(client, url))
        active_tasks[task] = (url, time.time())

    # Process tasks and replace them as they complete
    while active_tasks:
        done, pending = await asyncio.wait(active_tasks.keys(), return_when=asyncio.FIRST_COMPLETED)

        for completed_task in done:
            yield await completed_task
            del active_tasks[completed_task]

            try:
                url = next(url_iter)
                new_task = asyncio.create_task(fetch_url(client, url))
                active_tasks[new_task] = (url, time.time())
            except StopIteration:
                pass  # No more URLs to process

async def print_status(active_tasks):
    while active_tasks:
        print(f"\nStatus at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}:")
        for task, (url, start_time) in active_tasks.items():
            duration = time.time() - start_time
            print(f"Task for URL: {url}, Running for {duration:.2f} seconds")
        await asyncio.sleep(5)

async def main():
    with open(sys.argv[-1], 'r') as file:
        urls = [line.strip() for line in file]

    import ssl

    ssl_context_noverify = ssl.create_default_context()
    ssl_context_noverify.check_hostname = False
    ssl_context_noverify.verify_mode = ssl.CERT_NONE
    ssl_context_noverify.options &= ~ssl.OP_NO_SSLv2 & ~ssl.OP_NO_SSLv3
    ssl_context_noverify.set_ciphers("ALL:@SECLEVEL=0")
    ssl_context_noverify.options |= 0x4  # Add the OP_LEGACY_SERVER_CONNECT option

    transport = httpx.AsyncHTTPTransport(retries=0, verify=ssl_context_noverify)

    async with httpx.AsyncClient(transport=transport) as client:
        active_tasks = {}
        status_printer = asyncio.create_task(print_status(active_tasks))

        async for url, status, duration in process_urls(urls, client, active_tasks):
            print(f"Completed: URL: {url}, Status: {status}, Duration: {duration:.2f} seconds")

        status_printer.cancel()

if __name__ == "__main__":
    asyncio.run(main())

Not a single issue; memory usage didn't peak above 100MB. This probably means we can rule out the possibility of any of those URLs being a single huge file that would crash the process.

TheTechromancer commented 3 weeks ago

I bet if I can attach to the WebHelper when it's misbehaving, I can figure out the cause.

TheTechromancer commented 3 weeks ago

I thought it might be wayback, but the memory still leaks even with wayback disabled. Could it be cookies taking up all that space?

Next thing to test: run all the requests with their original args/kwargs.

TheTechromancer commented 3 weeks ago

Is it because we're forgetting to clean out self.client_tasks?

TheTechromancer commented 3 weeks ago

Fixed in 49c3c7b.

TheTechromancer commented 1 week ago

Reopening as this still seems to be an issue.

Today I ran a long-running scan, waited until it hit 1GB memory, then attached to it with gdb and went hunting for the http engine. Inside its thread, I found only one task:

{<Task pending name='Task-62' coro=<EngineServer.worker() running at /root/bbot/bbot/core/engine.py:489> wait_for=<Future pending cb=[_AsyncSocket._remove_finished_future(event_list=deque([_FutureEvent(...re._NoTimer'>)]), event=_FutureEvent(...re._NoTimer'>))() at /root/.cache/pypoetry/virtualenvs/bbot-aKeTQkGo-py3.12/lib/python3.12/site-packages/zmq/_future.py:475, Task.task_wakeup()]> cb=[_run_until_complete_cb() at /usr/lib/python3.12/asyncio/base_events.py:182]>}

It's the main worker task:

  File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/root/bbot/bbot/core/helpers/process.py", line 27, in run
    super().run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/root/bbot/bbot/core/engine.py", line 302, in server_process
    asyncio.run(engine_server.worker())
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/usr/lib/python3.12/asyncio/base_events.py", line 674, in run_until_complete
    self.run_forever()
  File "/usr/lib/python3.12/asyncio/base_events.py", line 641, in run_forever
    self._run_once()
  File "/usr/lib/python3.12/asyncio/base_events.py", line 1949, in _run_once
    event_list = self._selector.select(timeout)
  File "/usr/lib/python3.12/selectors.py", line 468, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
  File "/root/bbot/bbot/core/engine.py", line 489, in worker
    client_id, binary = await self.socket.recv_multipart()

Which was idle, and waiting for the next command:

{'self': <bbot.core.helpers.web.engine.HTTPEngine object at 0x73784c875610>, 'client_id': b'\x00\x13\x7fkQ', 'binary': b'\x80\x04\x95?\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x01c\x94K\x00\x8c\x01a\x94\x8c+https://internetdb.shodan.io/143.166.196.21\x94\x85\x94u.', 'message': {'c': 0, 'a': ('https://internetdb.shodan.io/143.166.196.21',)}, 'cmd': 0, 'args': ('https://internetdb.shodan.io/143.166.196.21',), 'kwargs': {}, 'command_name': 'request', 'command_fn': <bound method HTTPEngine.request of <bbot.core.helpers.web.engine.HTTPEngine object at 0x73784c875610>>, 'coroutine': <coroutine object EngineServer.run_and_return at 0x737838669b40>, 'task': <Task finished name='Task-479548' coro=<EngineServer.run_and_return() done, defined at /root/bbot/bbot/core/engine.py:410> result=None>}

There were no current tasks or child tasks:

(Pdb) httpengine = tasks[0].get_coro().cr_frame.f_locals["self"]
(Pdb) httpengine
<bbot.core.helpers.web.engine.HTTPEngine object at 0x73784c875610>
(Pdb) len(httpengine.tasks)
0
(Pdb) len(httpengine.child_tasks)
0

This debunks the theory that our task tracking is causing the memory leak.

So what is the cause?

TheTechromancer commented 1 week ago

First, let's look at the httpx clients:

(Pdb) httpengine.web_client
<bbot.core.helpers.web.client.BBOTAsyncClient object at 0x73784c23b950>
(Pdb) import sys
(Pdb) sys.getsizeof(httpengine.web_client)
48
(Pdb) httpengine.web_client._transport
<httpx.AsyncHTTPTransport object at 0x7378556910a0>
(Pdb) httpengine.web_client._transport._pool
<AsyncConnectionPool [Requests: 0 active, 0 queued | Connections: 0 active, 0 idle]>
(Pdb) httpengine.web_clients
{1: <bbot.core.helpers.web.client.BBOTAsyncClient object at 0x73784c23b950>}

Nothing out of the ordinary there.

So far we have two seemingly conflicting reports of where the leak is. Both of them are inside httpx, the first is in httpcore._async.http11._receive_response_body(), the other is in httpcore._backends.anyio.streams.tls.do_handshake().

In any case, both of these point to a bug deep inside the httpcore library.

Can we learn anything from the garbage collector?

TheTechromancer commented 1 week ago

Garbage collector:

(Pdb) import gc
(Pdb) gc.isenabled()
True
(Pdb) gc.get_threshold()
(700, 10, 10)
(Pdb) gc.collect()
19626
(Pdb) gc.get_count()
(27, 0, 0)
(Pdb) len(gc.garbage)
0
(Pdb) len(gc.get_objects())
1082490

According to Claude, this means:

The garbage collector is working (it's enabled and just collected a significant number of objects).

There are no objects that the garbage collector considers uncollectable at this moment.

Despite the recent collection, there's still a large number of objects (over 1 million) being tracked by the garbage collector.

TheTechromancer commented 1 week ago

Counted the types:

219378  <class 'set'>
217845  <class 'dict'>
98791   <class 'radixtarget.tree.base.RadixTreeNode'>
97228   <class 'list'>
71061   <class 'ipaddress.IPv4Address'>
52571   <class 'ipaddress.IPv4Network'>
41906   <class 'tuple'>
31085   <class 'bbot.core.event.base.DNS_NAME'>
27189   <class 'function'>
24253   <class 'dns.rdtypes.IN.A.A'>
22596   <class 'dns.name.Name'>
20753   <class 'bbot.core.event.base.IP_ADDRESS'>
17194   <class 'tldextract.tldextract.Trie'>
8827    <class 'dns.rdtypes.ANY.PTR.PTR'>
8654    <class 'weakref.ReferenceType'>
8263    <class 'ipaddress.IPv6Address'>
7600    <class 'urllib.parse.ParseResult'>
7494    <class 'ipaddress.IPv6Network'>
7110    <class 'cell'>
7054    <class 'dns.rdtypes.ANY.CNAME.CNAME'>
6297    <class 'dns.rdtypes.IN.AAAA.AAAA'>
5537    <class 'bbot.core.event.base.URL_UNVERIFIED'>
5461    <class 'builtin_function_or_method'>
3789    <class 'collections.defaultdict'>
3539    <class 'getset_descriptor'>
3486    <class 'dns.rrset.RRset'>
3378    <class 'type'>
3211    <class 'method'>
2695    <class 'omegaconf.nodes.AnyNode'>
2695    <class 'omegaconf.base.Metadata'>
2544    <class 'bbot.core.helpers.async_helpers._Lock'>
2191    <class 'bbot.core.event.base.OPEN_TCP_PORT'>
1979    <class 'bbot.core.event.base.URL'>
1925    <class 'wrapper_descriptor'>
1626    <class 'method_descriptor'>
1480    <class 'dns.message.QueryMessage'>
1478    <class 'dns.resolver.Answer'>
1478    <class 'dns.message.ChainingResult'>
1387    <class 'member_descriptor'>
1237    <class 'module'>
1208    <class '_frozen_importlib.ModuleSpec'>
1148    <class 'property'>
1122    <class 'dns.rdtypes.ANY.NS.NS'>
1104    <class '_frozen_importlib_external.SourceFileLoader'>
1055    <class 'omegaconf.base.ContainerMetadata'>
992 <class 'bbot.core.event.base.HTTP_RESPONSE'>
949 <class 'frozenset'>
942 <class 'omegaconf.dictconfig.DictConfig'>
791 <class 'typing.ForwardRef'>
747 <class 'classmethod'>
600 <class 'collections.deque'>
563 <class 'antlr4.atn.Transition.EpsilonTransition'>
557 <class 'typing._UnionGenericAlias'>
538 <class 'typing._GenericAlias'>
531 <class 'staticmethod'>
521 <class 'collections._tuplegetter'>
492 <class 'dns.rdtypes.ANY.TXT.TXT'>
472 <class '_abc._abc_data'>
428 <class 'coroutine'>
411 <class 're.Pattern'>
402 <class 'abc.ABCMeta'>
357 <class 'enum.property'>
350 <class 'dns.rdtypes.ANY.MX.MX'>
341 <class 'pycparser.ply.yacc.MiniProduction'>
318 <class 'antlr4.atn.ATNState.BasicState'>
317 <class '_cython_3_0_10.cython_function_or_method'>
305 <class '_contextvars.Context'>
274 <class 'dataclasses.Field'>
256 <class 'cffi.model.PointerType'>
251 <class 'async_generator'>
199 <class 'pathlib.PosixPath'>
199 <class 'asyncio.locks.Lock'>
191 <class 'rich.style.Style'>
189 <class 'soupsieve.css_types.SelectorList'>
186 <class 'contextlib._AsyncGeneratorContextManager'>
176 <class '_cffi_backend.CType'>
162 <class 'inspect.Parameter'>
152 <class 'pkg_resources.DistInfoDistribution'>
152 <class 'pkg_resources.PathMetadata'>
151 <class 'logging.Logger'>
149 <class '_frozen_importlib_external.FileFinder'>
136 <class 'soupsieve.css_types.Selector'>
132 <class '_asyncio.Task'>
132 <class 'asyncio.locks.Condition'>
130 <class 'dns.rdtypes.ANY.SOA.SOA'>
128 <class 'enum.EnumType'>
128 <class 'antlr4.IntervalSet.IntervalSet'>
128 <class 'urllib.parse.SplitResult'>
124 <class '_asyncio.Future'>
123 <class '_asyncio.FutureIter'>
117 <class 'typing.TypeVar'>
114 <enum 'SocketOption'>
113 <class 'omegaconf.listconfig.ListConfig'>
107 <class 'cffi.model.FunctionPtrType'>
104 <class 'baddns.lib.signature.BadDNSSignature'>
102 <class 'antlr4.atn.Transition.AtomTransition'>

TheTechromancer commented 1 week ago

Here I'm assuming 98791 <class 'radixtarget.tree.base.RadixTreeNode'> is cloudcheck. 219378 <class 'set'> and 217845 <class 'dict'> do seem pretty high. But there's absolutely nothing here to indicate a leak in httpx, or at least one that's related to garbage collection.

Could this mean a memory leak in the C-level code?

TheTechromancer commented 1 week ago

After running a big scan on dell.com, memory usage was only at 1.1GB after 2.5 hours. By this point there had been over 100K web requests, and the memray results suggest there's not actually a memory leak at all. 1.1GB is very modest for a target as big as dell.com, and based on the memray output, it was mostly events, which aligns with the numbers I was seeing in the queues:

[INFO] premature_gregory: Modules running (incoming:processing:outgoing) speculate(74,548:0:997), baddns(71,136:8:0), bucket_azure(44,807:1:0), bucket_amazon(44,659:1:0), bucket_firebase(44,587:1:0), bucket_google(44,155:1:0), httpx(12,151:501:0), sslcert(5,840:25:0), as
n(326:1:0), iis_shortnames(174:8:0), git(134:1:0), robots(126:1:0), dnsresolve(10:25:0)                                                                                                                                                                                        
[INFO] premature_gregory: Events produced so far: DNS_NAME: 46032, IP_ADDRESS: 7112, URL: 1394, OPEN_TCP_PORT: 1300, TECHNOLOGY: 910, ASN: 265, URL_HINT: 242, CODE_REPOSITORY: 150, FINDING: 48, SOCIAL: 47, STORAGE_BUCKET: 42, VULNERABILITY: 31, EMAIL_ADDRESS: 13, SCAN: 1
, ORG_STUB: 1, AZURE_TENANT: 1, PROTOCOL: 1                                                                                                                                                                                                                                    
[INFO] premature_gregory: 997 events in queue (3,738 processed in the past 15 seconds)

This scan was run with threads only, so I'm going to run it again with processes and see if the results are any different.

This was the command:

poetry run memray run --aggregate -o round4.bin /root/.cache/pypoetry/virtualenvs/bbot-aKeTQkGo-py3.12/bin/bbot -t dell.com -p subdomain-enum web-basic -c web.debug=true engine.debug=true -em wayback -ef aggressive

TheTechromancer commented 1 week ago

Confirmed, there is no memory leak. @aconite33 10K targets is just a lot. Unless the targets are related or you're trying to discover relationships between them, they should be scanned separately.

TheTechromancer commented 6 days ago

Command to reproduce?

bbot -t hosts.txt -m telerik portscan -y --allow-deadly

blacklanternsecurity / bbot

Memory Bug #1615