getsentry / pdb

A parser for Microsoft PDB (Program Database) debugging information
https://docs.rs/pdb/
Apache License 2.0
375 stars 67 forks source link

Add OMAP-based address translation #17

Closed Shaddy closed 5 years ago

Shaddy commented 6 years ago

There is an error parsing PDB for Windows7 kernel binary, something related to the offset. If I do with a Windows 10 ntoskrnl.exe is OK.

Parsing NtWaitForSingleObject says that is at offset 0x4aeb0 and section C (12), which is wrong. It should say that the offset is 0x000ac8c0. I've tested with other symbols with same success.

Here is the attached files so you can test it:

windows7kernel.zip

I will try to figure out whats hapenning but im not that familiar with the PDB internals.

willglynn commented 6 years ago

Hmm. Dissecting the symbol record by hand and using only the microsoft-pdb headers for reference, I get:

0062b4bc  22 00 0e 11 02 00 00 00  b0 ae 04 00 0c 00 4e 74  |".............Nt|
0062b4cc  57 61 69 74 46 6f 72 53  69 6e 67 6c 65 4f 62 6a  |WaitForSingleObj|
0062b4dc  65 63 74 00                                       |ect.|

This agrees with the pdb crate:

$ cargo run --example pdb_symbols  ntoskrnl.pdb | grep NtWaitForSingleObject
…
c   4aeb0   function    NtWaitForSingleObject

I conclude that the PDB really does say that NtWaitForSingleObject is at 000c:0004aeb0.

Trying to collate this PDB to that executable, I note that they both have the same GUID –3844dbb9-2017-4967-be7a-a4a2c20430fa – but the PDB's PDBInformation has { signature 1290245416, age 5 }, while the executable's ImageDebugDirectory has { time_date_stamp 1290245402 } and CodeviewPDB70DebugInfo has { age 2 }. I'm not certain that's significant, but… maybe this PDB doesn't exactly correspond to that executable?

Shaddy commented 6 years ago

It could be, but I've dumped the symbol using another parser (rabin2 from radare2 framework):

$ rabin2 -P ntoskrnl.pdb | grep NtWaitForSingle
0x003768c0  2  PAGE  NtWaitForSingleObject

Which corresponds to the real offset of the function. Also, I've used IDA forcing to use that PDB and addresses were well recognized. There is something really weird behind the scenes :/

willglynn commented 6 years ago

Oh. Oh.

The DBI stream in this PDB has a debug_header_size of 22, which corresponds to a DbgDataHdr struct, which is an array of u16s corresponding to these indices. This PDB has two sets of section headers – dbgtypeSectionHdr = 0x0010 and dbgtypeSectionHdrOrig = 0x0007 – with the data contained in streams dbgtypeOmapToSrc = 0x0008 and dbgtypeOmapFromSrc = 0x0009 to assist in translation.

Okay, so: taking the 000c:0004aeb0 from the symbol record into an address isn't just a matter of adding a base address. Instead, for this PDB, one must translate it into an RVA using the original section headers, search the dbgtypeOmapFromSrc's stream for an appropriate OMAP_DATA element covering that address, then remap the original RVA into the address space described by the new section headers by using the offset from that OMAP_DATA entry.

The LLVM PDB DBI page says:

Original Section Header Data - DbgStreamArray[10]. Assumed to be similar to DbgStreamArray[5], but has not been observed in practice.

…whelp, it's observed now.

Shaddy commented 6 years ago

Wow, at some point I was figuring out that there should be an intermediate conversion. Are those structs available through pdb.rs or should I just parse by myself?

willglynn commented 6 years ago

This is all news to me, so pdb doesn't yet parse this. Action items would be:

I'm thinking we want an AddressTranslator: something that can convert segment + offset into RVA and/or file position. I've written code to do this, but I used the PE section headers from the executable; this issue illustrates both that this approach is not strictly correct, and that this functionality definitely belongs in the pdb crate.

AddressTranslator would need a &DebugInformation to determine which operating mode to use. In the usual case, it would open the section header stream and translate segment + offset into RVA or file position with basic arithmetic. In the case of this PDB, AddressTranslator would need to open the original section header stream and translate using the forward OMAP stream instead.

Shaddy commented 6 years ago

This sounds like there is more work than I thought. Regarding the translation, I thought exactly the same while using PDB: Depending on an external parser over the executable to extract some already (by PDB) available information was weird.

I'll be reading about PDB and this project just in case I could help in some way. Thanks for your time and your fast answers.

willglynn commented 6 years ago
  • Parse DbgDataHdr in the pdb::dbi module,

pdb::dbi::DebugInformation now knows about DBIExtraStreams in the new omap branch:

DebugInformation {
  stream: Stream { source_view: ReadView(1295079 bytes) },
  header: Header { … },
  header_len: 64,
  extra_streams: DBIExtraStreams {
    fpo: 65535,
    exception: 65535,
    fixup: 65535,
    omap_to_src: 8,
    omap_from_src: 9,
    section_headers: 10,
    token_rid_map: 65535,
    xdata: 5,
    pdata: 6,
    new_fpo: 65535,
    original_section_headers: 7
  }
}
willglynn commented 6 years ago

Add a pdb::pe module that can parse PE IMAGE_SECTION_HEADERs (streams 7 & 10),

IMAGE_SECTION_HEADER parsing landed in the omap branch as 7d1684638ed1eceb6516136b9e99bdd1ada9dc7e.

willglynn commented 6 years ago

Everything is gross, but I am pleased to report that this code --

    assert_eq!(pubsym.segment, 0x000c);
    assert_eq!(pubsym.offset, 0x0004aeb0);

    let addr = sections[pubsym.segment as usize - 1].virtual_address + pubsym.offset;
    eprintln!("{:#x} => {:#x}", addr, table.lookup(addr));

-- combines the symbol table entry and original section headers, and passes the result into a binary search on the OMAP table in stream 9 --

index 177836: 0x30ee1f
index 266754: 0x438b3b
index 222295: 0x3a8940
index 200065: 0x3612e7
index 188950: 0x337783
index 183393: 0x322e74
index 180614: 0x319136
index 179225: 0x313fb9
index 178530: 0x3114ed
index 178183: 0x30ff65
index 178009: 0x30f67c
index 177922: 0x30f238
index 177879: 0x30f02e
index 177857: 0x30eef7
index 177846: 0x30ee6f
index 177851: 0x30ee9b
index 177853: 0x30eeb0
index 177853: 0x30eeb0 => 0x3768c0
0x30eeb0 => 0x3768c0

-- which ultimately returns the same RVA that you reported upthread.

I need to clean this up and pack it into an AddressTranslator, but :tada:, it works.

luser commented 5 years ago

FYI there's code in Breakpad that deals with OMAP tables if you want to compare notes: https://chromium.googlesource.com/breakpad/breakpad/+/master/src/common/windows/omap.cc

luser commented 5 years ago

Google's syzygy tool also has code for handling OMAP tables, and a pile of other PDB-reading code that doesn't use the DIA SDK (which might be useful for reference): https://github.com/google/syzygy/blob/master/syzygy/pdb/omap.cc

I believe syzygy can be used to rewrite PE binaries and generate OMAP tables, although I can't find a succinct example.

jan-auer commented 5 years ago

OMAP address translation has been released with 0.2.2, this issue can be closed now.

CR3Swapper commented 3 months ago

Do you know if DIA/debughelp.dll uses any other parts of the pdb to do these omap translations? Im going to reference this repo heavily as i recreate the omap streams. We have a binary rewriting/transformation framework and i want to rebuild these omap streams for people so that transformed/obfuscated binaries can still use a pdb to debug.

Are there any other components of the pdb involved with omap translation besides these streams?

Oh. Oh.

The DBI stream in this PDB has a debug_header_size of 22, which corresponds to a DbgDataHdr struct, which is an array of u16s corresponding to these indices. This PDB has two sets of section headers – dbgtypeSectionHdr = 0x0010 and dbgtypeSectionHdrOrig = 0x0007 – with the data contained in streams dbgtypeOmapToSrc = 0x0008 and dbgtypeOmapFromSrc = 0x0009 to assist in translation.

Okay, so: taking the 000c:0004aeb0 from the symbol record into an address isn't just a matter of adding a base address. Instead, for this PDB, one must translate it into an RVA using the original section headers, search the dbgtypeOmapFromSrc's stream for an appropriate OMAP_DATA element covering that address, then remap the original RVA into the address space described by the new section headers by using the offset from that OMAP_DATA entry.

The LLVM PDB DBI page says:

Original Section Header Data - DbgStreamArray[10]. Assumed to be similar to DbgStreamArray[5], but has not been observed in practice.

…whelp, it's observed now.

JustasMasiulis commented 3 months ago

Do you know if DIA/debughelp.dll uses any other parts of the pdb to do these omap translations?

Section/omf map is another one that's used in address translation.

Im going to reference this repo heavily as i recreate the omap streams

If you want full correctness, you'll likely want to reverse DIA. I don't think that there is any open source code that goes out of its way to do the address translation like Microsoft/DIA does.

There are tons of branches and edge cases handled in DIA code, here is a snippet from my personal attempts to do it correctly from a few years ago (with a couple of safety checks removed):

std::optional<uint32_t> translate_address( uint32_t segment, uint32_t offset ) {
        const auto segment_index = segment - 1;
        const auto frame         = _segment_frame( segment_index ); // go through section/omf map if present
        if ( omap_from ) {
            if ( frame ) {
                if ( original_section_headers ) {
                    offset += original_section_headers[frame - 1].virtual_address;
                } else {
                    // use section map or else new section headers
                    offset += _synthesize_image_offset( segment_index ); 
                }
            }

            // my/DIA logic differs from PDB crate in OMAP entry search as well.
            return _resolve_trough_omap( omap_from, num_omap_from, offset, false ); 
        } else {
            if ( frame )
                // section map or else 0
                offset += _segment_offset( segment_index )
                    // use new section headers or else section map
                    + _synthesize_section_va( section_headers, frame - 1 );

            return offset;
        }
    }

Have fun!

CR3Swapper commented 3 months ago

Do you know if DIA/debughelp.dll uses any other parts of the pdb to do these omap translations?

Section/omf map is another one that's used in address translation.

Brutal, is that section headers stream the same as "section headers stream" from the DBIExtraStream?

https://llvm.org/docs/PDB/DbiStream.html#optional-debug-header-stream

image

Looks like i have a hot date with IDA... I've never seen anything more over engineered than this file format.

Edit:

Looks like they are two seperate streams entirely. Virtual insanity. Job security through obscurity....

image

Anyways... do you know if this section map works like this^? where-in-which logical entries point back into the section map itself to the actual descriptor? Can i just remove all entries in the section map and thus force DIA to use section headers to do address translation or would that break other shit?

luser commented 3 months ago

Do you know if DIA/debughelp.dll uses any other parts of the pdb to do these omap translations? Im going to reference this repo heavily as i recreate the omap streams. We have a binary rewriting/transformation framework and i want to rebuild these omap streams for people so that transformed/obfuscated binaries can still use a pdb to debug.

If I were you I would look at syzygy (linked in a previous comment), which does binary rewriting and is already similar to what you're trying to achieve.

CR3Swapper commented 3 months ago

Do you know if DIA/debughelp.dll uses any other parts of the pdb to do these omap translations? Im going to reference this repo heavily as i recreate the omap streams. We have a binary rewriting/transformation framework and i want to rebuild these omap streams for people so that transformed/obfuscated binaries can still use a pdb to debug.

If I were you I would look at syzygy (linked in a previous comment), which does binary rewriting and is already similar to what you're trying to achieve.

syzygy is indeed a great reference, they have good comments for the omap streams.

Sadly i dont think they recreate these streams though. Maybe im wrong but i cant seem to see where they write that information back into the pdb.

They have this pdb mutator concept:

https://github.com/google/syzygy/blob/master/syzygy/pdb/pdb_mutator.cc

and all the mutators are in here:

https://github.com/google/syzygy/tree/master/syzygy/pdb/mutators

Edit:

googles crashpad also has good comments. i think that was linked before. Sadly as @JustasMasiulis mentioned it is indeed true that DIA uses other components of the pdb during translation. Going to go paul walker mode on these components.

JustasMasiulis commented 3 months ago

Can i just remove all entries in the section map and thus force DIA to use section headers to do address translation or would that break other shit?

Depends... DIA/PDB has a lot of redundancy and deleting the section/OMF segment map would likely have no impact for 99.99% of binaries and that's kind of evident by the fact that there are a bunch of open-source PDB parsing codebases that all do address translation differently and it kind of works for everyone.

I would suggest spending some time reversing the AddressMap class in DIA binaries and writing an extensive test harness with MSDIA instead of an open source library first to make sure that your transformation are correct.

CR3Swapper commented 3 months ago

Can i just remove all entries in the section map and thus force DIA to use section headers to do address translation or would that break other shit?

Depends... DIA/PDB has a lot of redundancy and deleting the section/OMF segment map would likely have no impact for 99.99% of binaries and that's kind of evident by the fact that there are a bunch of open-source PDB parsing codebases that all do address translation differently and it kind of works for everyone.

I would suggest spending some time reversing the AddressMap class in DIA binaries and writing an extensive test harness with MSDIA instead of an open source library first to make sure that your transformation are correct.

After having a very pleasant dinner date with ms Ida I can say that the omap streams are being rebuilt correctly and that DIA can resolve the addresses. Visual studios debugger, x64dbg both display correct symbol information. Also ms Ida has her own PDB parser.

For a demo i moved the first function to some padding bytes. I nuked the section map substream, if that becomes a problem later in the future ill have another date with ms Ida.

image

I would just like to take a moment to say thank you to @JustasMasiulis @luser for coming back to this issue 5 years after it was closed.

image