Washi1337 / AsmResolver

A library for creating, reading and editing PE files and .NET modules.
https://docs.washi.dev/asmresolver/
MIT License
861 stars 127 forks source link

Read/write support PDB files #59

Open Washi1337 opened 4 years ago

Washi1337 commented 4 years ago

Related #91

Washi1337 commented 3 years ago

Formats

There seems to be multiple PDB formats in use.

PDB 7

Uses MSF format and seems to be used the most by compilers such as VC++ and CLang/LLVM. This will require a lot of new models to be added to AsmResolver and probably is worth a separate package (perhaps called AsmResolver.Pdb)

Official spec seems to be very lacking. Microsoft only "published" parts of the definitions. LLVM has some docs on it as well which seem more useful than Microsoft's. Wikipedia's docs are also very sparse (https://en.wikipedia.org/wiki/Program_database).

We might be able to look for reference implementations such as pdbparse.

PDB 2

Very little information seems to be available about this format. [Wikipedia] (https://en.wikipedia.org/wiki/Program_database) mentions the existence of it and some details but not a lot. It seems to be that it resembles some of PDB v7 though, but will need samples for this.

pdbparse implements this format as well.

Portable PDB format

This seems to be emitted by the Roslyn compilers from the new .NET SDK, and closely resembles the .NET's metadata directory, but with the extension of a #Pdb stream as well as some extra tables in the #~ stream. Official spec here. Lots of the existing .NET metadata models in AsmResolver can probably be reused.

Portable CILDB

Partition V of ECMA-335 specifies another format called CILDB. This documentation is good but I am not sure which compilers emit these types of files as I have not seen any sample with a PDB file like this.

Design choices

Given the complexity of these formats, it might be best to introduce new packages that can handle these types of files. A couple big important design choices need to be made fairly quick however. These are mainly related to where the implementations of these formats live.

For portable PDBs, given that it is really just an extension to the already existing .NET metadata file format, it probably would make sense if we added the raw metadata table row structs to the AsmResolver.PE.DotNet.Metadata namespace (located in the AsmResolver.PE package) to stay consistent with the rest of the metadata table models. However, for higher level interpretation of these tables (e.g. interpretation of blobs and name indices), it would make more sense if it is put either in AsmResolver.DotNet, or in a separate package.

We could introduce a separate package called AsmResolver.Pdb. This makes sense given the complexity of PDB v2 and v7 (it implements a file system). However, if we introduce such a package, its name may be confusing as it would assume it supports any of the PDB formats, including the Portable PDB file format. If we include support for Portable PDB in this new package, that might result in a dependency to AsmResolver.PE or even AsmResolver.DotNet. Especially the last one is not desirable, since it would mean that users that are only interested in reading native PEs with symbols to also reference AsmResolver.DotNet which adds another 300kb worth of code that they will never use.

Another idea is to introduce multiple PDB related packages instead. It could perhaps look like:

There might also be the possibility to merge the PDB2 and PDB7 versions into one single package as these formats seem to resemble each other somewhat.

The great benefit of this approach is that it follows more the modular design style of AsmResolver in general, as AsmResolver.PortablePdb and AsmResolver.CilDb will be able to depend on AsmResolver.DotNet without the others also needing to do that. The obvious downside is that the number of new packages increases a lot, and users of AsmResolver might not like that.

zziger commented 2 years ago

Any news regarding implementation of that?

Washi1337 commented 2 years ago

Unfortunately, no concrete implementations yet. As it is right now, other features and bug reports have gained precedence over completely new features such as PDB file support. If there is a demand for PDB support however, I may bring this feature up the backlog.

ds5678 commented 2 years ago

I would be interested in helping with this. My interest lies mostly in the PDB 7 format and a little in the Portable PDB format. Even though it increases the package count, I am in favor of your suggestion to do 4 or 5 packages for this. It seems cleaner and most software publishes as a single file.

Washi1337 commented 2 years ago

@ds5678, Thanks for taking interest. Next to the package design, a couple additional big questions still need to be answered as well, which will also probably answer indirectly which new packages we will finally end up with. Some thoughts below:

This feature will probably require quite a bit of prep-work before actual actual coding and integration can take place. One big aspect we need to figure out is whether it is possible to find some kind of unifying API design that abstracts at least some parts of every format into a higher dimension. At least for read-support this would be preferable, as this would simplify usage of the packages a lot. However, too much simplification can also lead to certain features of some formats be forgotten / hidden, which we need to be careful about. One thing I can predict already is that writers are most likely going to have to define their own contracts for their respective formats within their respective packages. I don't think we can (or want to) find a unifying API for this given the vast amount of differences between these formats.

Pdb7 and PortablePdb are the ones used the most nowadays it seems, this is definitely where we should put focus on first. Another question is how (if at all) these packages should be integrated somehow in AsmResolver.DotNet, especially given the fact that .NET uses both Pdb7 (legacy .NET framework) and PortablePdb (.NET Core / .NET). For example, the names of local variables of method bodies are stored in these symbols. Other libraries (such as dnlib and Cecil) provide a Name property for their representative class of local variables that pulls data from the pdb. Do we want something similar as well, or will this inevitably lead to tighter coupling of the packages, something I think we really should try to avoid.