Implement OMAP-based address translation

getsentry / pdb

A parser for Microsoft PDB (Program Database) debugging information

https://docs.rs/pdb/

Apache License 2.0

375 stars 67 forks source link

Implement OMAP-based address translation #35

Closed jan-auer closed 5 years ago

jan-auer commented 5 years ago

This picks up on the work done and discussed at https://github.com/willglynn/pdb/issues/17 and adds the missing AddressTranslator.

Additionally, this PR slightly restructures when and how the optional debug header ("extra streams") are read. This is now performed lazily only when needed to speed up parsing of the initial PDB struct. While parsing the optional header is not a ton of work, it's still avoidable if one is only interested in some of the other streams.

There are also a bunch of new public methods on the PDB, namely: omap_from_src, omap_to_src, original_sections. The use of AddressTranslator is still preferred, but in case someone wants to use those directly, they get a handle to this. I suppose it might be cleaner to put all these methods in a different place, but since they need to read and modify the inner msf, this was most straight forward for now.

I'm still not a big fan that reading sections and original_sections is not done in a zero-copy way. We should probably pick a solution before merging this in:

Read them only once and store them in the PDB, just like the extra streams. Then, just return a slice.
Change ImageSectionHeader into a view on the original data (slice), similar to RawString and add methods for all currently public fields.

jan-auer commented 5 years ago

Also, I'm thinking that address translation (and consequently OMAPTable) could return Option<u32> to make failed lookups more explicit.

jan-auer commented 5 years ago

@willglynn Where did you get the idea about rvaTo == 0 being special? I saw the same logic in IDA, but it is never mentioned in any documentation: https://github.com/ax330d/ida_pdb_loader/blob/90ffc5a2b2cb559153699897c2a2694742f1c3af/pdbparse/omap.py#L27-L30

Neither syzygy nor Breakpad seem to make a difference. Also, I think the current handling of zeros is confusing: The interface suggests that zero can be a valid RVA (rightfully so), but the lookup does not rebase if it is zero. I would propose to remove the check for zero. I might be bikeshedding here though, since probably no symbol will ever point into the PE header.

willglynn commented 5 years ago

@jan-auer I'm not sure exactly what I was reading, but I think that came from exploring the OMAP-containing .pdb fixture rather than reading alternate implementations. When I unconditionally performed the usual arithmetic with the 0 entries, symbol lookups were returning nonsense RVAs inside or near the PE header, so I guess I made it return 0 if the table contained 0? In that case, I disagree with my comment and now think that this really should be an Option; a 0 in the table is a different kind of result which we should force the caller to interpret.

jan-auer commented 5 years ago

Alright, let's change this (and the AddressTranslator) to Option<u32> then 👍

jan-auer commented 5 years ago

I like the idea of having a type for section offsets and RVAs, was actually already considering that. Whether or not virtual addresses before OMAP translation should be represented, I'm not so sure. You're making an excellent point by viewing OMAPs as implementation detail, but in that case we could (or even should) hide them completely.

It might be a simpler API to just offer one translator that can handle both directions and abstracts over the decision whether or not to use OMAPs. Still, it needs to return Options (or results) since the lookups in the section tables might fail.

So probably this could do:

struct Rva(pub u32);
struct SectionOffset { pub offset: u32, pub section: u16 } // note the order coincides with struct layouts
// convenience methods fn {to,from}_{rva,segment_offset}(&AddressTranslator<'_>)

impl AddressTranslator<'_> {
    fn to_rva(offset: SectionOffset) -> Option<Rva> { ... }
    fn from_rva(rva: Rva) -> Option<SectionOffset> { ... }
}

struct ThreadStorageSymbol {
    pub global: bool,
    pub type_index: TypeIndex,
    pub offset: SegmentOffset, // <- instead of separate components 
} 
// no more  fn ThreadStorageSymbol::rva(), this is on SectionOffset now

willglynn commented 5 years ago

Yeah! I think we're onto something.

I agree we can discard post-OMAP section offsets. If the user cares about that, they probably have their own mapping of PE sections already – in which case the RVA returned from the OMAP table is plenty useful and they don't need our help mapping an RVA back to a PE section. Pre-OMAP RVAs are similar; if someone managed to get such an object file, they'd have their own sections and could make do with the pre-OMAP section offsets. That leaves us two types instead of four.

Section offset -> RVA depends on original_section_headers and omap_from_src, while RVA -> section offset depends on section_headers and omap_to_src. These sets are disjoint, which makes me wonder if it's appropriate to stuff them all into a single AddressTranslator. Are there significant use cases where the user only needs to convert in one direction? If so, is it worth having two objects instead of one, specifically to optimize where the PDB uses OMAP and the user has such a use case?

Typing out that question: no, let's not. AddressTranslator should handle both translations in all cases. We determine while creating an AddressTranslator if we're using OMAP. If no, we load just the section_headers; if yes, we load both OMAP tables and both sets of section headers.

In all cases, the only correct way to cross-reference PDB section offsets to the object file is via AddressTranslator, since arbitrary PDBs might use OMAP. This suggests we should name SectionOffset something scary: InternalSectionOffset? IncorrectSectionOffset? MisleadingSectionOffset? I want to make it difficult for the user to assume that the section offsets inside the PDB can be cross-referenced to the sections in their PE file. The type name and the field names should attempt to dispel that notion, and the docs for this type should describe the issue.

jan-auer commented 5 years ago

Section offset -> RVA depends on original_section_headers and omap_from_src, while RVA -> section offset depends on section_headers and omap_to_src

This should be symmetrical under any circumstance, to avoid confusion. But, reading on...

We determine while creating an AddressTranslator if we're using OMAP. [...] In all cases, the only correct way to cross-reference PDB section offsets to the object file is via AddressTranslator

Yes! That sounds good, and it's also how AddressTranslator is built already. The only thing missing is the conversion back to internal offsets. The internal offset struct will still have public members, so anyone can use the raw values if need be, but the docstring will refer to AddressTranslator as a single source of truth.

As for the name -- agreed that might need a better name. However, at this point I'm still a tiny bit confused. If I understand correctly, then the headers in original_section_headers are not the ones in the final PE file, after it has been rearranged. I'll have to double check that. Throwing in OriginalOffset and PdbOffset. Let me contemplate for a while, I'll try to put something expressive in the next commit.

As a final remark: Given all these arguments, I'd vouch for keeping OMAPs exposed in the public interface for anyone who wants to do funky stuff with them. Doesn't harm to have them there.

willglynn commented 5 years ago

My understanding:

                          ┌──────────────────────┐
                ┌─────────│     omap_to_src      │◀────────┐
                │         └──────────────────────┘         │
                ▼                                          │
┌──────────────────────────────┐           ┌──────────────────────────────┐
│         Original PE          │           │        Rearranged PE         │
├──────────────────────────────┤           ├──────────────────────────────┤
│   original_section_headers   │           │       section_headers        │
└──────────────────────────────┘           └──────────────────────────────┘
                │                                          ▲
                │        ┌──────────────────────┐          │
                └───────▶│    omap_from_src     │──────────┘
                         └──────────────────────┘

The original compiler produced object files with embedded debugging information, and the original linker combined them into a PE and a PDB. Then magic occurred, and the original PE was transformed into a rearranged PE. This all works at the RVA level. The rearranged PE has sections just like the original PE, but they're not guaranteed to have any relation to each other, just as bytes in the original PE are not guaranteed to exist in the rearranged PE or vice-versa.

The magic process breaks all the references to the PE from the PDB. Rather than fixing the PDB, they added OMAP tables which explain how to cross-reference RVAs in the original PE to and from RVAs in the rearranged PE. It's very likely that the original PE file was viewed as an intermediate result of the build process and discarded, which is fine except that the PDB still refers to section offsets inside that original PE, and the OMAP tables only work in terms of RVAs. So, the PDB gets a copy of the original PE's section headers too.

Correctly relating PDB-specified section offsets to the final PE thus requires 1) translating the section offset to an RVA using the original PE's section headers, then 2) translating that original (pre-OMAP) RVA to a rearranged (post-OMAP) RVA through the omap_from_src table. If we want a section offset, we then have to transform that rearranged RVA using the rearranged PE section headers.

The pdb crate didn't involve itself in section offset <-> RVA translations up to this point because I wasn't aware of OMAP. (Of course the offsets inside the PDB point to sections in the executable, anything else would be lunacy!) Except the PDB does not describe the user's executable, there's two sets of section headers and two kinds of section offsets, there's two kinds of RVAs, translations are fallible…

willglynn commented 5 years ago

Revisiting my earlier statement:

I agree we can discard post-OMAP section offsets. If the user cares about that, they probably have their own mapping of PE sections already – in which case the RVA returned from the OMAP table is plenty useful and they don't need our help mapping an RVA back to a PE section.

There's utility in modeling post-OMAP section offsets for the common case where the PDB doesn't use OMAP. The OMAP path has to go original_section_headers -> omap_from_src -> section_headers, but the non-OMAP path is a no-op. If we only supported going from PDB-internal section offsets to RVAs, we'd convert using section_headers and force the user to convert back instead of just doing nothing.

Is this a compelling reason to have two section offset types?

Another question: are we sure the PDB only references pre-OMAP section offsets? I know most of them reference the original PE and must be translated, but are all of them like that? Any exception would also warrant having both pre- and post-OMAP section offset types.

willglynn commented 5 years ago

Let's say we have four types:

PdbInternalSectionOffset
PdbInternalRva
Rva
SectionOffset

Expanding my diagram to include these types:

           PdbInternalRva ┌──────────────────────┐       Rva
                ┌─────────│     omap_to_src      │◀────────┐
                │         └──────────────────────┘         │
                ▼                                          │
┌──────────────────────────────┐           ┌──────────────────────────────┐
│         Original PE          │           │        Rearranged PE         │
├──────────────────────────────┤           ├──────────────────────────────┤
│   original_section_headers   │           │       section_headers        │
│   PdbInternalSectionOffset   │           │        SectionOffset         │
└──────────────────────────────┘           └──────────────────────────────┘
                │                                          ▲
                │        ┌──────────────────────┐          │
                └───────▶│    omap_from_src     │──────────┘
          PdbInternalRva └──────────────────────┘        Rva

We definitely need to expose PdbInternalSectionOffset since that's what all the section and offset values in the PDB actually represent. We definitely need to expose Rva since that's the most direct output from the omap_from_src lookup. This comment is me exploring whether or not we should also expose the real SectionOffset type. In any case, I can't make a cogent argument for exposing PdbInternalRva except for symmetry.

For completeness' sake, it's also worth considering how these same types would fit into the case where there's a single PE with one set of section_headers and no OMAP tables:

           PdbInternalRva ┌──────────────────────┐       Rva
                ┌─────────│        no-op         │◀────────┐
                │         └──────────────────────┘         │
                ▼                                          │
┌─────────────────────────────────────────────────────────────────────────┐
│                                   PE                                    │
├──────────────────────────────┬───────────┬──────────────────────────────┤
│       section_headers        │           │       section_headers        │
│   PdbInternalSectionOffset   │◀──no-op──▶│        SectionOffset         │
└──────────────────────────────┘           └──────────────────────────────┘
                │                                          ▲
                │                                          │
                │        ┌──────────────────────┐          │
                └───────▶│        no-op         │──────────┘
          PdbInternalRva └──────────────────────┘        Rva

I find this diagram very frustrating, but it illustrates how easy it is to write incorrect code which usually works. At least this model of four types – two or more of which are exported and enter the users' awareness – correctly captures the inherent complexity here.

jan-auer commented 5 years ago

Let's maybe establish some terminology:

Original PE and address space: This is the initial target as it was built by the compiler. Since the PDB is generated during this phase, all sections and offsets refer to this address space.
Transformed PE and address space: This is the optimized version, produced by something like project Vulcan, with sections rearranged. section_headers are moved to original_section_headers, and the rearranged sections along some OMAPs have been added. omap_from_src maps from original to transformed address space, and omap_to_src the other direction.

So yes, I agree with your observervation that there would theoretically be four types in total: Offsets and RVAs in the original and transformed address space. Since only Microsoft and maybe Syzygy seem to perform this optimization, in most cases both address spaces will be equivalent.

However, in an abstract way, we can say that:

(offset, segment) pairs only ever refer to the original address space (section headers), since the contents of the debug streams (etc.) are not rewritten. We can be pretty certain about that.
For debugging, one is always interested in the transformed addresses since those are the only ones occurring during execution. As you pointed out earlier, the original PE merely is an intermediary build artefact which happens to leave behind traces in the debug information.

These were the points I was confused about -- might anyone ever care about the original address space? After thinking about it a little longer, I think no. However, I realized that someone might be interested in the transformed offsets.

Now, I've played around with this for a while, and arrived at a similar point like your proposal above:

struct Rva(pub u32);
struct SectionOffset { pub offset: u32, pub section: u16 }
struct OriginalRva(pub u32);
struct OriginalSectionOffset { pub offset: u32, pub section: u16 }
// + conversions between all of these via AddressTranslator

However, I also noticed this complicates the whole system (and implementation) quite a bit without providing sufficient benefits other than symmetry. Also, I'm still convinced that it's a good idea to hide omaps and the transformation as much as possible, which is why I'm now gravitating towards:

Only keep Rva (transformed address space) and OriginalSectionOffset (exposed by structs like PublicSymbol) for now. AddressTranslator allows to convert between the two.
As soon as there is a clear use case for SectionOffset, we can add this type and provide conversions to and from Rva. For PDBs without OMAPs these conversions would be identical to OriginalOffset.
Kill OriginalRva completely. From the perspective of the final binary, this is a fully useless value.

willglynn commented 5 years ago

:+1:

Let's do this with two types: an RVA in the user's space and a section offset in the original space. We can always expand the interface later as needs warrant – if real-world usage could be more convenient, if the no-ops in the non-OMAP case fix a hotspot, if we find some new PDB stream which references one of the other two types, or whatever.

Bikeshedding: I don't find OriginalSectionOffset scary enough, particularly since the user is probably not aware that their executable is not the original. From the user's perspective, they have the only executable; the original section headers and all the references to those sections only exist inside the PDB. That's what led me to PdbInternalSectionOffset – these values really are meaningless outside the PDB. I'd even like to rename original_section_headers to pdb_internal_section_headers for consistency.

I realized that if the user has a pre-OMAP PE, they also have a pre-OMAP PDB, since both were produced together. In other words, the pdb user either has a non-transformed executable and basic PDB, or a post-OMAP executable and PDB with OMAP tables. There's no gap where the PDB has OMAP tables and the original executable still matters to the user in any way. We can try to pretend that such an executable never existed, label all pre-OMAP things as "PDB internal", and maybe even keep original_section_headers private within the crate.

jan-auer commented 5 years ago

So I ended up adding all types again. I don't know, feels more complete that way and I needed them internally anyway. Small change to make some of them pub(crate) - I'll leave that decision up to you.

FWIW: Everyone dealing with PDBs where the original and actual addresses/offsets differ will also know about the difference because they had to go through the transformation process. I think the rest can be covered with documentation. InternalPdbOffset sounds kind of wrong, because it's not an offset into the PDB, nor has it anything to do with the PDB per se. The same goes for section headers.

What do you mean by pre-OMAP -- from context I suppose that's "original"? Anyway, I have had the same observation about the correlation between original headers and OMAPs and the address translator has always made this assumption if you look closely. It is now a little more explicit tho, while still being abstracted away.

Now, I would propose to just keep the methods for OMAPs and section headers / original headers for anyone who wants to go that deep and deal with the raw values. Who knows what other use cases there are. Yet, the API with OriginalSectionOffset, Rva and AddressMap should be bullet-proof enough not to be misused.

Thoughts?

willglynn commented 5 years ago

FWIW: Everyone dealing with PDBs where the original and actual addresses/offsets differ will also know about the difference because they had to go through the transformation process.

I didn't. My introduction to OMAP was kernel debugging. I had no idea that the Windows kernel had been transformed; it was just an opaque blob shipped by Microsoft for which I had a PDB. The PDB didn't match the offsets in the file and I didn't know why.

Even if the user did trigger their own transformation, I think it's ideal to hide the transformation. If the user wants to debug their original executable, they can debug it using the original PDB: we won't have OMAP tables, we'll have one set of section headers, and pdb can expose the executable's section_headers and matching RVAs. If the user wants to debug their transformed executable, they can debug it with the transformed PDB: we will have OMAP tables, we'll have two sets of section headers, and pdb can expose the executable's section_headers and matching RVAs. The original_section_headers are needed for address translation but nothing more; they're just a detail of how the PDB was affected by the transformation process.

InternalPdbOffset sounds kind of wrong, because it's not an offset into the PDB, nor has it anything to do with the PDB per se. The same goes for section headers.

What do you mean by pre-OMAP -- from context I suppose that's "original"?

Yeah, exactly. Here "original" is "the executable that was destroyed by transformation, but to which this PDB still refers". That's why I favor PdbInternal as a label: there was an original executable, but it's gone now, and the only evidence that remains are the references inside the PDB.

jan-auer commented 5 years ago

Alright, I don't have too strong opinions about the names. Just keep in mind that other software dealing with PDBs is still around -- including the microsoft-pdb code base -- using different nomenclature then. This goes especially for fields in the original structs (e.g. original_section_headers or omap_from_src) which we should leave. The abstractions can have whatever name makes them more expressive.

Before I make the final touches, anything else that you'd like to see changed based on the latest change set? I believe the current API sufficiently hides the OMAPS in default usage while allowing users to go deeper if need be. See the example in the documentation (of course, still with the old names):

willglynn commented 5 years ago

This goes especially for fields in the original structs (e.g. original_section_headers or omap_from_src) which we should leave.

Agreed; the primary goal in naming should always be clarity. There's a cost for renaming things other people have already named, and we're much more free to structure and name new things.

Before I make the final touches, anything else that you'd like to see changed based on the latest change set? I believe the current API sufficiently hides the OMAPS in default usage while allowing users to go deeper if need be. See the example in the documentation (of course, still with the old names):

The example code looks great!

I hesitate to expose the OMAP tables in the crate's public interface. Everything I can think of requires comparing addresses in one space or another (which can be convenient via impl Ord), pointer arithmetic (which can be convenient via impl Add<isize>), and converting (which is convenient via AddressMap). The tables themselves feel like an implementation detail.

There might be use cases that need the OMAP tables, but I don't know what they are, and I think I'd like to wait for someone to ask for that feature. It might be that exposing the tables directly is the right solution, or it might be that we're missing some other abstraction instead.

jan-auer commented 5 years ago

There's a cost for renaming things other people have already named, and we're much more free to structure and name new things.

Cool beans, I'll go with your suggestion of PdbInternalRva and PdbInternalSectionOffset (quite a long name 🤣). Those values then correspond to original_sections, which are not exposed at the moment, though, until there is a stronger use case for them.

I hesitate to expose the OMAP tables in the crate's public interface. [...] There might be use cases that need the OMAP tables

This crate should probably offer both low- and high-level access to structures in the PDB, just like goblin for other formats. In that sense, it might be worth just having it there for completeness and also for education. After all, the docs of this crate are also good docs on the PDB format in general.

It's your call, though, and I see the point that OMAPs are somewhat implementation details. Let's hide it then, for now. 😄 I can move documentation to the AddressMap type and expose it top-level.

In the end, if we find out that they should be exposed, documentation can still establish that PdbInternal* belongs to original_sections.

jan-auer commented 5 years ago

@willglynn Could we get this merged?

willglynn commented 5 years ago

Merged, thanks!

CR3Swapper commented 3 months ago

Just want to say thank you guys, 5 years later this thread has saved me a lot of time and effort trying to understand the omap stream format.