getsentry / pdb

A parser for Microsoft PDB (Program Database) debugging information
https://docs.rs/pdb/
Apache License 2.0
367 stars 68 forks source link

How to get unmangled function names? #107

Closed nico-abram closed 1 year ago

nico-abram commented 2 years ago

Hi

I'm trying to get all the functions in a pdb file, their lengths, and their unmangled names (I believe the term used in pdbs might be "unique names") for the cargo-bloat tool.

This crate's ProcedureSymbol type does not have unmangled names. From what I've seen reading the LLVM docs on PDB files and using the llvm-pdbutil, they're not actually included in symbol records. Is there a recommended/reliable way of getting unmangled names? Right now what I'm doing is first collecting all PublicSymbols and then trying to find a matching public symbol. But, at least for rustc/cargo generated PDBs, this seems to miss a lot of functions that have ProcedureSymbol records and do not have a matching PublicSymbol record.

Is this approach fine, and I should try to find a way/file an issue with rust to try to get it to generate better PDBs, or is there some other way I can already use this crate to get these unmangled names, or is there something that can be added to this crate?

The code I'm using follows


use pdb::FallibleIterator;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let dir = std::path::Path::new("D:\\your\\path\\to\\pdb\\folder");
    let file_name = "cargo-bloat";

    let exe_path = dir.join(file_name).with_extension("exe");
    let exe_size = std::fs::metadata(&exe_path)?.len();
    let (_, text_size) = binfarce::pe::parse(&std::fs::read(&exe_path).unwrap())?.symbols()?;

    let pdb_path = dir.join(file_name.replace("-", "_")).with_extension("pdb");
    let file = std::fs::File::open(&pdb_path)?;
    let mut pdb = pdb::PDB::open(file)?;

    let dbi = pdb.debug_information()?;
    let symbol_table = pdb.global_symbols()?;

    let mut total_parsed_size = 0usize;
    let mut demangled_total_parsed_size = 0usize;
    let mut out_symbols = vec![];

    // Collect the PublicSymbols
    let mut public_symbols = vec![];

    let mut symbols = symbol_table.iter();
    while let Ok(Some(symbol)) = symbols.next() {
        match symbol.parse() {
            Ok(pdb::SymbolData::Public(data)) => {
                if data.code || data.function {
                    public_symbols.push((data.offset, data.name.to_string().into_owned()));
                }
                if data.name.to_string().contains("try_small_punycode_decode") {
                    dbg!(&data);
                }
            }
            _ => {}
        }
    }

    let mut modules = dbi.modules()?;
    while let Some(module) = modules.next()? {
        let info = match pdb.module_info(&module)? {
            Some(info) => info,
            None => continue,
        };
        let mut symbols = info.symbols()?;
        while let Some(symbol) = symbols.next()? {
            if let Ok(pdb::SymbolData::Public(data)) = symbol.parse() {
                if data.code || data.function {
                    public_symbols.push((data.offset, data.name.to_string().into_owned()));
                }
                if data.name.to_string().contains("try_small_punycode_decode") {
                    dbg!(&data);
                }
            }
        }
    }

    let cmp_offsets = |a: &pdb::PdbInternalSectionOffset, b: &pdb::PdbInternalSectionOffset| {
        a.section.cmp(&b.section).then(a.offset.cmp(&b.offset))
    };
    public_symbols.sort_unstable_by(|a, b| cmp_offsets(&a.0, &b.0));

    // Now find the Procedure symbols in all modules
    // and if possible the matching PublicSymbol record with the mangled name
    let mut handle_proc = |proc: pdb::ProcedureSymbol| {
        let mangled_symbol = public_symbols
            .binary_search_by(|probe| {
                let low = cmp_offsets(&probe.0, &proc.offset);
                let high = cmp_offsets(&probe.0, &(proc.offset + proc.len));

                use std::cmp::Ordering::*;
                match (low, high) {
                    // Less than the low bound -> less
                    (Less, _) => Less,
                    // More than the high bound -> greater
                    (_, Greater) => Greater,
                    _ => Equal,
                }
            })
            .ok()
            .map(|x| &public_symbols[x]);
        // Uncomment to verify binary search isn't screwing up anything
        /*
        let mangled_symbol = public_symbols
            .iter()
            .filter(|probe| probe.0 >= proc.offset && probe.0 <= (proc.offset + proc.len))
            .take(1)
            .next();
        */

        let demangled_name = proc.name.to_string().into_owned();
        out_symbols.push((proc.len as usize, demangled_name, mangled_symbol));

        total_parsed_size += proc.len as usize;
        if mangled_symbol.is_some() {
            demangled_total_parsed_size += proc.len as usize;
        }
    };

    let mut symbols = symbol_table.iter();
    while let Ok(Some(symbol)) = symbols.next() {
        if let Ok(pdb::SymbolData::Procedure(proc)) = symbol.parse() {
            handle_proc(proc);
        }
    }
    let mut modules = dbi.modules()?;
    while let Some(module) = modules.next()? {
        let info = match pdb.module_info(&module)? {
            Some(info) => info,
            None => continue,
        };

        let mut symbols = info.symbols()?;

        while let Some(symbol) = symbols.next()? {
            if let Ok(pdb::SymbolData::Procedure(proc)) = symbol.parse() {
                handle_proc(proc);
            }
        }
    }

    println!(
        "exe size:{}\ntext size:{}\nsize of fns found: {}\nratio:{}\nsize of fns with mangles found: {}\nratio:{}",
        exe_size,
        text_size,
        total_parsed_size,
        total_parsed_size as f32 / text_size as f32,
        demangled_total_parsed_size,
        demangled_total_parsed_size as f32 / text_size as f32
    );

    Ok(())
}```
mitsuhiko commented 2 years ago

By "unmangled" you mean "not mangled" I presume? In PDB the situation is a bit odd as inline and non inline symbols are quite different. You can look at what symbolic does. For inlines we're resolving this around here: https://github.com/getsentry/symbolic/blob/c03080a1d75bf66bcbee6b2a9c9df84266d7a581/symbolic-debuginfo/src/pdb.rs#L1052-L1057

For actually mangled names, we demangle on the fly later as these are known to massively blow up in size: https://github.com/getsentry/symbolic/blob/50a4d2eff93a4b529bd5120c47924dcbc8a4275c/symbolic-demangle/src/lib.rs#L163-L178 (uses msvc_demangler).

mstange commented 2 years ago

To demangle "decorated" global symbols, use msvc_demangler. To emit function arguments for procedures, use pdb_addr2line::TypeFormatter::format_function. To emit namespaces and function arguments for inlines, use pdb_addr2line::TypeFormatter::format_id. To get function names for code addresses, use pdb_addr2line::Context::find_frames.

nico-abram commented 1 year ago

Yes, I meant "not mangled". I presume msvc_demangler is only useful if working with C/C++ symbols generated by msvc (Or a compatible compiler like clang-cl)? I was working with a rustc-generated PDB when I opened this, in which case I don't think the symbols in question use that mangling scheme.

My problem was I did not know if a given PDB was using the V0 or legacy rust mangling schemes, so I couldn't really reliably demangle. I was hoping the PDB itself contained the undecorated/not-mangled names, but that doesn't seem to be the case. So I just assumed the V0 rust mangling scheme and demangled them, which should work fine the vast majority of the time.

Closing this, thanks for the help.