Closed jan-auer closed 5 years ago
Also, I'm thinking that address translation (and consequently OMAPTable
) could return Option<u32>
to make failed lookups more explicit.
@willglynn Where did you get the idea about rvaTo == 0
being special? I saw the same logic in IDA, but it is never mentioned in any documentation:
https://github.com/ax330d/ida_pdb_loader/blob/90ffc5a2b2cb559153699897c2a2694742f1c3af/pdbparse/omap.py#L27-L30
Neither syzygy nor Breakpad seem to make a difference. Also, I think the current handling of zeros is confusing: The interface suggests that zero can be a valid RVA (rightfully so), but the lookup does not rebase if it is zero. I would propose to remove the check for zero. I might be bikeshedding here though, since probably no symbol will ever point into the PE header.
@jan-auer I'm not sure exactly what I was reading, but I think that came from exploring the OMAP-containing .pdb
fixture rather than reading alternate implementations. When I unconditionally performed the usual arithmetic with the 0 entries, symbol lookups were returning nonsense RVAs inside or near the PE header, so I guess I made it return 0 if the table contained 0? In that case, I disagree with my comment and now think that this really should be an Option
; a 0 in the table is a different kind of result which we should force the caller to interpret.
Alright, let's change this (and the AddressTranslator
) to Option<u32>
then π
I like the idea of having a type for section offsets and RVAs, was actually already considering that. Whether or not virtual addresses before OMAP translation should be represented, I'm not so sure. You're making an excellent point by viewing OMAPs as implementation detail, but in that case we could (or even should) hide them completely.
It might be a simpler API to just offer one translator that can handle both directions and abstracts over the decision whether or not to use OMAPs. Still, it needs to return Option
s (or results) since the lookups in the section tables might fail.
So probably this could do:
struct Rva(pub u32);
struct SectionOffset { pub offset: u32, pub section: u16 } // note the order coincides with struct layouts
// convenience methods fn {to,from}_{rva,segment_offset}(&AddressTranslator<'_>)
impl AddressTranslator<'_> {
fn to_rva(offset: SectionOffset) -> Option<Rva> { ... }
fn from_rva(rva: Rva) -> Option<SectionOffset> { ... }
}
struct ThreadStorageSymbol {
pub global: bool,
pub type_index: TypeIndex,
pub offset: SegmentOffset, // <- instead of separate components
}
// no more fn ThreadStorageSymbol::rva(), this is on SectionOffset now
Yeah! I think we're onto something.
I agree we can discard post-OMAP section offsets. If the user cares about that, they probably have their own mapping of PE sections already β in which case the RVA returned from the OMAP table is plenty useful and they don't need our help mapping an RVA back to a PE section. Pre-OMAP RVAs are similar; if someone managed to get such an object file, they'd have their own sections and could make do with the pre-OMAP section offsets. That leaves us two types instead of four.
Section offset -> RVA depends on original_section_headers
and omap_from_src
, while RVA -> section offset depends on section_headers
and omap_to_src
. These sets are disjoint, which makes me wonder if it's appropriate to stuff them all into a single AddressTranslator
. Are there significant use cases where the user only needs to convert in one direction? If so, is it worth having two objects instead of one, specifically to optimize where the PDB uses OMAP and the user has such a use case?
Typing out that question: no, let's not. AddressTranslator
should handle both translations in all cases. We determine while creating an AddressTranslator
if we're using OMAP. If no, we load just the section_headers
; if yes, we load both OMAP tables and both sets of section headers.
In all cases, the only correct way to cross-reference PDB section offsets to the object file is via AddressTranslator
, since arbitrary PDBs might use OMAP. This suggests we should name SectionOffset
something scary: InternalSectionOffset
? IncorrectSectionOffset
? MisleadingSectionOffset
? I want to make it difficult for the user to assume that the section offsets inside the PDB can be cross-referenced to the sections in their PE file. The type name and the field names should attempt to dispel that notion, and the docs for this type should describe the issue.
Section offset -> RVA depends on
original_section_headers
andomap_from_src
, while RVA -> section offset depends onsection_headers
andomap_to_src
This should be symmetrical under any circumstance, to avoid confusion. But, reading on...
We determine while creating an
AddressTranslator
if we're using OMAP. [...] In all cases, the only correct way to cross-reference PDB section offsets to the object file is viaAddressTranslator
Yes! That sounds good, and it's also how AddressTranslator
is built already. The only thing missing is the conversion back to internal offsets. The internal offset struct will still have public members, so anyone can use the raw values if need be, but the docstring will refer to AddressTranslator
as a single source of truth.
As for the name -- agreed that might need a better name. However, at this point I'm still a tiny bit confused. If I understand correctly, then the headers in original_section_headers
are not the ones in the final PE file, after it has been rearranged. I'll have to double check that. Throwing in OriginalOffset
and PdbOffset
. Let me contemplate for a while, I'll try to put something expressive in the next commit.
As a final remark: Given all these arguments, I'd vouch for keeping OMAPs exposed in the public interface for anyone who wants to do funky stuff with them. Doesn't harm to have them there.
My understanding:
ββββββββββββββββββββββββ
βββββββββββ omap_to_src βββββββββββ
β ββββββββββββββββββββββββ β
βΌ β
ββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββ
β Original PE β β Rearranged PE β
ββββββββββββββββββββββββββββββββ€ ββββββββββββββββββββββββββββββββ€
β original_section_headers β β section_headers β
ββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββ
β β²
β ββββββββββββββββββββββββ β
βββββββββΆβ omap_from_src ββββββββββββ
ββββββββββββββββββββββββ
The original compiler produced object files with embedded debugging information, and the original linker combined them into a PE and a PDB. Then magic occurred, and the original PE was transformed into a rearranged PE. This all works at the RVA level. The rearranged PE has sections just like the original PE, but they're not guaranteed to have any relation to each other, just as bytes in the original PE are not guaranteed to exist in the rearranged PE or vice-versa.
The magic process breaks all the references to the PE from the PDB. Rather than fixing the PDB, they added OMAP tables which explain how to cross-reference RVAs in the original PE to and from RVAs in the rearranged PE. It's very likely that the original PE file was viewed as an intermediate result of the build process and discarded, which is fine except that the PDB still refers to section offsets inside that original PE, and the OMAP tables only work in terms of RVAs. So, the PDB gets a copy of the original PE's section headers too.
Correctly relating PDB-specified section offsets to the final PE thus requires 1) translating the section offset to an RVA using the original PE's section headers, then 2) translating that original (pre-OMAP) RVA to a rearranged (post-OMAP) RVA through the omap_from_src
table. If we want a section offset, we then have to transform that rearranged RVA using the rearranged PE section headers.
The pdb
crate didn't involve itself in section offset <-> RVA translations up to this point because I wasn't aware of OMAP. (Of course the offsets inside the PDB point to sections in the executable, anything else would be lunacy!) Except the PDB does not describe the user's executable, there's two sets of section headers and two kinds of section offsets, there's two kinds of RVAs, translations are fallibleβ¦
Revisiting my earlier statement:
I agree we can discard post-OMAP section offsets. If the user cares about that, they probably have their own mapping of PE sections already β in which case the RVA returned from the OMAP table is plenty useful and they don't need our help mapping an RVA back to a PE section.
There's utility in modeling post-OMAP section offsets for the common case where the PDB doesn't use OMAP. The OMAP path has to go original_section_headers
-> omap_from_src
-> section_headers
, but the non-OMAP path is a no-op. If we only supported going from PDB-internal section offsets to RVAs, we'd convert using section_headers
and force the user to convert back instead of just doing nothing.
Is this a compelling reason to have two section offset types?
Another question: are we sure the PDB only references pre-OMAP section offsets? I know most of them reference the original PE and must be translated, but are all of them like that? Any exception would also warrant having both pre- and post-OMAP section offset types.
Let's say we have four types:
PdbInternalSectionOffset
PdbInternalRva
Rva
SectionOffset
Expanding my diagram to include these types:
PdbInternalRva ββββββββββββββββββββββββ Rva
βββββββββββ omap_to_src βββββββββββ
β ββββββββββββββββββββββββ β
βΌ β
ββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββ
β Original PE β β Rearranged PE β
ββββββββββββββββββββββββββββββββ€ ββββββββββββββββββββββββββββββββ€
β original_section_headers β β section_headers β
β PdbInternalSectionOffset β β SectionOffset β
ββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββ
β β²
β ββββββββββββββββββββββββ β
βββββββββΆβ omap_from_src ββββββββββββ
PdbInternalRva ββββββββββββββββββββββββ Rva
We definitely need to expose PdbInternalSectionOffset
since that's what all the section
and offset
values in the PDB actually represent. We definitely need to expose Rva
since that's the most direct output from the omap_from_src
lookup. This comment is me exploring whether or not we should also expose the real SectionOffset
type. In any case, I can't make a cogent argument for exposing PdbInternalRva
except for symmetry.
For completeness' sake, it's also worth considering how these same types would fit into the case where there's a single PE with one set of section_headers
and no OMAP tables:
PdbInternalRva ββββββββββββββββββββββββ Rva
βββββββββββ no-op βββββββββββ
β ββββββββββββββββββββββββ β
βΌ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PE β
ββββββββββββββββββββββββββββββββ¬ββββββββββββ¬βββββββββββββββββββββββββββββββ€
β section_headers β β section_headers β
β PdbInternalSectionOffset ββββno-opβββΆβ SectionOffset β
ββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββ
β β²
β β
β ββββββββββββββββββββββββ β
βββββββββΆβ no-op ββββββββββββ
PdbInternalRva ββββββββββββββββββββββββ Rva
I find this diagram very frustrating, but it illustrates how easy it is to write incorrect code which usually works. At least this model of four types β two or more of which are exported and enter the users' awareness β correctly captures the inherent complexity here.
Let's maybe establish some terminology:
section_headers
are moved to original_section_headers
, and the rearranged sections along some OMAPs have been added. omap_from_src
maps from original to transformed address space, and omap_to_src
the other direction.So yes, I agree with your observervation that there would theoretically be four types in total: Offsets and RVAs in the original and transformed address space. Since only Microsoft and maybe Syzygy seem to perform this optimization, in most cases both address spaces will be equivalent.
However, in an abstract way, we can say that:
(offset, segment)
pairs only ever refer to the original address space (section headers), since the contents of the debug streams (etc.) are not rewritten. We can be pretty certain about that.These were the points I was confused about -- might anyone ever care about the original address space? After thinking about it a little longer, I think no. However, I realized that someone might be interested in the transformed offsets.
Now, I've played around with this for a while, and arrived at a similar point like your proposal above:
struct Rva(pub u32);
struct SectionOffset { pub offset: u32, pub section: u16 }
struct OriginalRva(pub u32);
struct OriginalSectionOffset { pub offset: u32, pub section: u16 }
// + conversions between all of these via AddressTranslator
However, I also noticed this complicates the whole system (and implementation) quite a bit without providing sufficient benefits other than symmetry. Also, I'm still convinced that it's a good idea to hide omaps and the transformation as much as possible, which is why I'm now gravitating towards:
Rva
(transformed address space) and OriginalSectionOffset
(exposed by structs like PublicSymbol
) for now. AddressTranslator
allows to convert between the two.SectionOffset
, we can add this type and provide conversions to and from Rva
. For PDBs without OMAPs these conversions would be identical to OriginalOffset
.OriginalRva
completely. From the perspective of the final binary, this is a fully useless value.:+1:
Let's do this with two types: an RVA in the user's space and a section offset in the original space. We can always expand the interface later as needs warrant β if real-world usage could be more convenient, if the no-ops in the non-OMAP case fix a hotspot, if we find some new PDB stream which references one of the other two types, or whatever.
Bikeshedding: I don't find OriginalSectionOffset
scary enough, particularly since the user is probably not aware that their executable is not the original. From the user's perspective, they have the only executable; the original section headers and all the references to those sections only exist inside the PDB. That's what led me to PdbInternalSectionOffset
βΒ these values really are meaningless outside the PDB. I'd even like to rename original_section_headers
to pdb_internal_section_headers
for consistency.
I realized that if the user has a pre-OMAP PE, they also have a pre-OMAP PDB, since both were produced together. In other words, the pdb
user either has a non-transformed executable and basic PDB, or a post-OMAP executable and PDB with OMAP tables. There's no gap where the PDB has OMAP tables and the original executable still matters to the user in any way. We can try to pretend that such an executable never existed, label all pre-OMAP things as "PDB internal", and maybe even keep original_section_headers
private within the crate.
So I ended up adding all types again. I don't know, feels more complete that way and I needed them internally anyway. Small change to make some of them pub(crate)
- I'll leave that decision up to you.
FWIW: Everyone dealing with PDBs where the original and actual addresses/offsets differ will also know about the difference because they had to go through the transformation process. I think the rest can be covered with documentation. InternalPdbOffset
sounds kind of wrong, because it's not an offset into the PDB, nor has it anything to do with the PDB per se. The same goes for section headers.
What do you mean by pre-OMAP -- from context I suppose that's "original"? Anyway, I have had the same observation about the correlation between original headers and OMAPs and the address translator has always made this assumption if you look closely. It is now a little more explicit tho, while still being abstracted away.
Now, I would propose to just keep the methods for OMAPs and section headers / original headers for anyone who wants to go that deep and deal with the raw values. Who knows what other use cases there are. Yet, the API with OriginalSectionOffset
, Rva
and AddressMap
should be bullet-proof enough not to be misused.
Thoughts?
FWIW: Everyone dealing with PDBs where the original and actual addresses/offsets differ will also know about the difference because they had to go through the transformation process.
I didn't. My introduction to OMAP was kernel debugging. I had no idea that the Windows kernel had been transformed; it was just an opaque blob shipped by Microsoft for which I had a PDB. The PDB didn't match the offsets in the file and I didn't know why.
Even if the user did trigger their own transformation, I think it's ideal to hide the transformation. If the user wants to debug their original executable, they can debug it using the original PDB: we won't have OMAP tables, we'll have one set of section headers, and pdb
can expose the executable's section_headers
and matching RVAs. If the user wants to debug their transformed executable, they can debug it with the transformed PDB: we will have OMAP tables, we'll have two sets of section headers, and pdb
can expose the executable's section_headers
and matching RVAs. The original_section_headers
are needed for address translation but nothing more; they're just a detail of how the PDB was affected by the transformation process.
InternalPdbOffset
sounds kind of wrong, because it's not an offset into the PDB, nor has it anything to do with the PDB per se. The same goes for section headers.What do you mean by pre-OMAP -- from context I suppose that's "original"?
Yeah, exactly. Here "original" is "the executable that was destroyed by transformation, but to which this PDB still refers". That's why I favor PdbInternal
as a label: there was an original executable, but it's gone now, and the only evidence that remains are the references inside the PDB.
Alright, I don't have too strong opinions about the names. Just keep in mind that other software dealing with PDBs is still around -- including the microsoft-pdb code base -- using different nomenclature then. This goes especially for fields in the original structs (e.g. original_section_headers
or omap_from_src
) which we should leave. The abstractions can have whatever name makes them more expressive.
Before I make the final touches, anything else that you'd like to see changed based on the latest change set? I believe the current API sufficiently hides the OMAPS in default usage while allowing users to go deeper if need be. See the example in the documentation (of course, still with the old names):
This goes especially for fields in the original structs (e.g.
original_section_headers
oromap_from_src
) which we should leave.
Agreed; the primary goal in naming should always be clarity. There's a cost for renaming things other people have already named, and we're much more free to structure and name new things.
Before I make the final touches, anything else that you'd like to see changed based on the latest change set? I believe the current API sufficiently hides the OMAPS in default usage while allowing users to go deeper if need be. See the example in the documentation (of course, still with the old names):
The example code looks great!
I hesitate to expose the OMAP tables in the crate's public interface. Everything I can think of requires comparing addresses in one space or another (which can be convenient via impl Ord
), pointer arithmetic (which can be convenient via impl Add<isize>
), and converting (which is convenient via AddressMap
). The tables themselves feel like an implementation detail.
There might be use cases that need the OMAP tables, but I don't know what they are, and I think I'd like to wait for someone to ask for that feature. It might be that exposing the tables directly is the right solution, or it might be that we're missing some other abstraction instead.
There's a cost for renaming things other people have already named, and we're much more free to structure and name new things.
Cool beans, I'll go with your suggestion of PdbInternalRva
and PdbInternalSectionOffset
(quite a long name π€£). Those values then correspond to original_sections
, which are not exposed at the moment, though, until there is a stronger use case for them.
I hesitate to expose the OMAP tables in the crate's public interface. [...] There might be use cases that need the OMAP tables
This crate should probably offer both low- and high-level access to structures in the PDB, just like goblin
for other formats. In that sense, it might be worth just having it there for completeness and also for education. After all, the docs of this crate are also good docs on the PDB format in general.
It's your call, though, and I see the point that OMAPs are somewhat implementation details. Let's hide it then, for now. π I can move documentation to the AddressMap
type and expose it top-level.
In the end, if we find out that they should be exposed, documentation can still establish that PdbInternal*
belongs to original_sections
.
@willglynn Could we get this merged?
Merged, thanks!
Just want to say thank you guys, 5 years later this thread has saved me a lot of time and effort trying to understand the omap stream format.
This picks up on the work done and discussed at https://github.com/willglynn/pdb/issues/17 and adds the missing
AddressTranslator
.Additionally, this PR slightly restructures when and how the optional debug header ("extra streams") are read. This is now performed lazily only when needed to speed up parsing of the initial PDB struct. While parsing the optional header is not a ton of work, it's still avoidable if one is only interested in some of the other streams.
There are also a bunch of new public methods on the PDB, namely:
omap_from_src
,omap_to_src
,original_sections
. The use ofAddressTranslator
is still preferred, but in case someone wants to use those directly, they get a handle to this. I suppose it might be cleaner to put all these methods in a different place, but since they need to read and modify the innermsf
, this was most straight forward for now.I'm still not a big fan that reading
sections
andoriginal_sections
is not done in a zero-copy way. We should probably pick a solution before merging this in:ImageSectionHeader
into a view on the original data (slice), similar toRawString
and add methods for all currently public fields.