Open Washi1337 opened 4 years ago
There seems to be multiple PDB formats in use.
Uses MSF format and seems to be used the most by compilers such as VC++ and CLang/LLVM. This will require a lot of new models to be added to AsmResolver and probably is worth a separate package (perhaps called AsmResolver.Pdb
)
Official spec seems to be very lacking. Microsoft only "published" parts of the definitions. LLVM has some docs on it as well which seem more useful than Microsoft's. Wikipedia's docs are also very sparse (https://en.wikipedia.org/wiki/Program_database).
We might be able to look for reference implementations such as pdbparse.
Very little information seems to be available about this format. [Wikipedia] (https://en.wikipedia.org/wiki/Program_database) mentions the existence of it and some details but not a lot. It seems to be that it resembles some of PDB v7 though, but will need samples for this.
pdbparse implements this format as well.
This seems to be emitted by the Roslyn compilers from the new .NET SDK, and closely resembles the .NET's metadata directory, but with the extension of a #Pdb
stream as well as some extra tables in the #~
stream. Official spec here. Lots of the existing .NET metadata models in AsmResolver can probably be reused.
Partition V of ECMA-335 specifies another format called CILDB. This documentation is good but I am not sure which compilers emit these types of files as I have not seen any sample with a PDB file like this.
Given the complexity of these formats, it might be best to introduce new packages that can handle these types of files. A couple big important design choices need to be made fairly quick however. These are mainly related to where the implementations of these formats live.
For portable PDBs, given that it is really just an extension to the already existing .NET metadata file format, it probably would make sense if we added the raw metadata table row structs to the AsmResolver.PE.DotNet.Metadata
namespace (located in the AsmResolver.PE
package) to stay consistent with the rest of the metadata table models. However, for higher level interpretation of these tables (e.g. interpretation of blobs and name indices), it would make more sense if it is put either in AsmResolver.DotNet
, or in a separate package.
We could introduce a separate package called AsmResolver.Pdb
. This makes sense given the complexity of PDB v2 and v7 (it implements a file system). However, if we introduce such a package, its name may be confusing as it would assume it supports any of the PDB formats, including the Portable PDB file format. If we include support for Portable PDB in this new package, that might result in a dependency to AsmResolver.PE
or even AsmResolver.DotNet
. Especially the last one is not desirable, since it would mean that users that are only interested in reading native PEs with symbols to also reference AsmResolver.DotNet
which adds another 300kb worth of code that they will never use.
Another idea is to introduce multiple PDB related packages instead. It could perhaps look like:
AsmResolver.Pdb
: for a common PDB file format abstractionsAsmResolver.Pdb.Pdb2
: For PDB2AsmResolver.Pdb.Pdb7
: For PDB7AsmResolver.Pdb.PortablePdb
: For portable PDBs.AsmResolver.Pdb.CilDb
: For CILDBThere might also be the possibility to merge the PDB2 and PDB7 versions into one single package as these formats seem to resemble each other somewhat.
The great benefit of this approach is that it follows more the modular design style of AsmResolver in general, as AsmResolver.PortablePdb
and AsmResolver.CilDb
will be able to depend on AsmResolver.DotNet
without the others also needing to do that. The obvious downside is that the number of new packages increases a lot, and users of AsmResolver might not like that.
Any news regarding implementation of that?
Unfortunately, no concrete implementations yet. As it is right now, other features and bug reports have gained precedence over completely new features such as PDB file support. If there is a demand for PDB support however, I may bring this feature up the backlog.
I would be interested in helping with this. My interest lies mostly in the PDB 7 format and a little in the Portable PDB format. Even though it increases the package count, I am in favor of your suggestion to do 4 or 5 packages for this. It seems cleaner and most software publishes as a single file.
@ds5678, Thanks for taking interest. Next to the package design, a couple additional big questions still need to be answered as well, which will also probably answer indirectly which new packages we will finally end up with. Some thoughts below:
This feature will probably require quite a bit of prep-work before actual actual coding and integration can take place. One big aspect we need to figure out is whether it is possible to find some kind of unifying API design that abstracts at least some parts of every format into a higher dimension. At least for read-support this would be preferable, as this would simplify usage of the packages a lot. However, too much simplification can also lead to certain features of some formats be forgotten / hidden, which we need to be careful about. One thing I can predict already is that writers are most likely going to have to define their own contracts for their respective formats within their respective packages. I don't think we can (or want to) find a unifying API for this given the vast amount of differences between these formats.
Pdb7 and PortablePdb are the ones used the most nowadays it seems, this is definitely where we should put focus on first. Another question is how (if at all) these packages should be integrated somehow in AsmResolver.DotNet
, especially given the fact that .NET uses both Pdb7 (legacy .NET framework) and PortablePdb (.NET Core / .NET). For example, the names of local variables of method bodies are stored in these symbols. Other libraries (such as dnlib and Cecil) provide a Name
property for their representative class of local variables that pulls data from the pdb. Do we want something similar as well, or will this inevitably lead to tighter coupling of the packages, something I think we really should try to avoid.
Related #91