libcdb: improve the search speed of `search_by_symbol_offsets`

the-soloist commented 1 month ago

While using search_by_symbol_offsets, I found that the search speed for build_id was significantly slower compared to other hash types.

# https://github.com/Gallopsled/pwntools/blob/dev/pwnlib/libcdb.py#L26-L30
HASHES = {
    'build_id': lambda path: enhex(ELF(path, checksec=False).buildid or b''),
    'sha1': sha1filehex,
    'sha256': sha256filehex,
    'md5': md5filehex,
}

The reason for this is that ELF loads too many things. I attempted to replace it with ELFFile, which noticeably improved the speed, but it introduced redundant functionality. I couldn't think of a simple way to implement it, so I added a hash_type parameter to search_by_symbol_offsets, with a default setting of md5 to speed up search_by_symbol_offsets, and provide users with a controllable option.

I'm testing on the following code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
from elftools.elf.elffile import ELFFile
from pwn import *

context.log_level = "info"
context.local_libcdb = "/path/to/libc-database"

def _buildid(path):
    elf = ELFFile(open(path, "rb"))
    section = elf.get_section_by_name('.note.gnu.build-id')
    if section:
        return enhex(section.data()[16:])
    return b""

log.waitfor("searching build_id")
os.system("rm -rf ~/.cache/.pwntools-cache-*")
time_start = time.time()
path = libcdb.search_by_build_id("70a4c953a01ddc232969c27031e7f948338ca137", offline_only=True, unstrip=False)
libc = ELF(path, checksec=False)
print(f"cost {time.time() - time_start}s", libc)

log.waitfor("searching symbol offsets (build_id)")
os.system("rm -rf ~/.cache/.pwntools-cache-*")
time_start = time.time()
path = libcdb.search_by_symbol_offsets({'puts': 0xa30, 'printf': 0x8f0}, offline_only=True, unstrip=False, hash_type="build_id")
libc = ELF(path, checksec=False)
print(f"cost {time.time() - time_start}s", libc)

log.waitfor("searching symbol offsets (md5)")
os.system("rm -rf ~/.cache/.pwntools-cache-*")
time_start = time.time()
path = libcdb.search_by_symbol_offsets({'puts': 0xa30, 'printf': 0x8f0}, offline_only=True, unstrip=False, hash_type="md5")
libc = ELF(path, checksec=False)
print(f"cost {time.time() - time_start}s", libc)

log.success("patch libcdb.HASHES")
libcdb.HASHES["build_id"] = _buildid

log.waitfor("searching build_id (with ELFFile)")
os.system("rm -rf ~/.cache/.pwntools-cache-*")
time_start = time.time()
path = libcdb.search_by_build_id("70a4c953a01ddc232969c27031e7f948338ca137", offline_only=True, unstrip=False)
libc = ELF(path, checksec=False)
print(f"cost {time.time() - time_start}s", libc)

log.waitfor("searching symbol offsets (build_id with ELFFile)")
os.system("rm -rf ~/.cache/.pwntools-cache-*")
time_start = time.time()
path = libcdb.search_by_symbol_offsets({'puts': 0xa30, 'printf': 0x8f0}, offline_only=True, unstrip=False, hash_type="build_id")
libc = ELF(path, checksec=False)
print(f"cost {time.time() - time_start}s", libc)

and found another question https://github.com/Gallopsled/pwntools/issues/2414

peace-maker commented 1 month ago

I think we can avoid walking the local database directory again here in the first place instead. When finding a match in the local libc-database, we know the id and thus the filename of the libc we want to return. Maybe allow the id to be searched in search_by_hash and special case it in the local_database provider.

the-soloist commented 1 month ago

I agree that handling id separately within the providers is a good approach, it allows the use of libcdb's caching feature. However, this will cause some variable name to lose its original meaning (it's not hash type). I've tried writing some code, could you give me some suggestions?

Arusekk commented 3 weeks ago

I'm not sure I like hash_type="id" (maybe hash_type="filename" would be better?). I think the build ID should be the default, it should just be parsed quicker, maybe we can have a separate function for extracting build id (at C speed ideally), but come on, reading only the first page of a file should be quicker than reading all of it, especially on HDDs; also, build-id does not change if you strip/unstrip or move the file around. If our ELF implementation is a bottleneck, we can resort to implementing separate functionality just for turbofast build-id extraction.

Gallopsled / pwntools

libcdb: improve the search speed of `search_by_symbol_offsets` #2413