Modifying PDBs - Githubissues

Is it possible to modify parts of a PDB or rewrite it entirely with this? Browsing through the docs, it looks like it only reads PDBs.

The reason I ask is that I'm considering rewriting my C++ tool Ducible in Rust using this crate. The tool rewrites PDBs to remove non-deterministic data. By far, the greatest effort was in deciphering the PDB format from Microsoft's pile of :hankey:. (It's good that the LLVM guys have documented this a little bit.) So, I'd be happy to switch to a good PDB parsing library and gain Rust's ability to produce rock-solid software.

If you think writing PDBs falls into the purview of this library and it isn't too difficult to add, I could take a stab at implementing it with some guidance.

I'm currently doing it by having an abstract stream type where the stream could either be on-disk or in-memory. Then, an MSF stream can be replaced with in an in-memory stream before writing everything back out to disk. In this way, streams are essentially copy-on-write. Doing it like this in Rust could be difficult with the ownership system, so I don't think this is the best approach. I'm definitely open to any good ideas about how to do this.

P.S. Thanks for writing this library. The PDB format is a real pain in the arse to parse.

Is it possible to modify parts of a PDB or rewrite it entirely with this? Browsing through the docs, it looks like it only reads PDBs.

Correct, this library is read-only.

If you think writing PDBs falls into the purview of this library and it isn't too difficult to add, I could take a stab at implementing it with some guidance.

The bad news is that writing PDBs would be all new code. The good news is that it would be adjacent to code which can already read PDBs, and it's easy to test symmetrical transforms. I think it makes more sense to have PDB-writing code here than in a separate library.

There's two main pieces:

The MSF layer, wrapped by the PDB object. This is all read-centric and would need corresponding logic to go the other direction, and it'll be different enough that should be a separate PDBBuilder. PDB's API is geared towards random/concurrent reads of a subset of the file; PDBBuilder's API would probably be geared towards sequential writes of the entire file.
The individual record data structures. There's code to read all of them from on-disk bytes into a struct; we would need corresponding code to go from struct back to bytes. Instead of writing this by hand, it's probably better to do #10 and replace the handwritten code with machine-generated code that read and write all the structs.

If you want to stab at this, #10 is the place to start, and it would definitely be good to approach that with read/write fidelity in mind.

This would prompt some user-facing API changes. Take this code for example:

        S_LDATA32 | S_LDATA32_ST |
        S_GDATA32 | S_GDATA32_ST |
        S_LMANDATA | S_LMANDATA_ST |
        S_GMANDATA | S_GMANDATA_ST => {
            Ok(SymbolData::DataSymbol(DataSymbol {
                global: match kind { S_GDATA32 | S_GDATA32_ST | S_GMANDATA | S_GMANDATA_ST => true, _ => false },
                managed: match kind { S_LMANDATA | S_LMANDATA_ST | S_GMANDATA | S_GMANDATA_ST => true, _ => false },
                type_index: buf.parse_u32()?,
                offset:     buf.parse_u32()?,
                segment:    buf.parse_u16()?,
            }))
        }

The global and managed flags together communicate L vs G and M vs `, but the_STvs ` distinction is discarded. That's okay right now since they're equivalent for reading, but it's a bug if we need to read/write symbol records without changing them.

What happens if struct DataSymbol stores kind: u16? Well, it would fix the data loss issue. It also suggests discarding global and managed in favor of methods that check kind; users would need to call .global() instead of .global, but hey. With those changes, DataSymbol's parsing code could then be #[derive]d from the data structure itself, since the struct is u16, u32, u32, u16 just like the on-disk format – and it means that the generating code can be #[derive]d too.

Thank you for the detailed response!

This sounds like a good design to me. For my use-case, I'm fine with sequential writes to a PDB. In fact, I want all the streams to be sequential instead of having their pages scattered all over the file. I think concurrent writes are possible, but it's definitely more complicated and not something I want to implement. (IIRC, pages are written atomically using the pairs of pages in the free page map.)

I'm not yet sure exactly what the PDBBuilder API will look like. It should probably mirror the PDB API, writing out the same things that PDB reads in. It might also be cleaner to have an MSFBuilder that simply writes out the streams and providing a commit() function that writes out the stream table at the end of the file (like this).

I'll take a look at solving #10 first though.

The simplifying conceit used by pdb is that MSF streams can be represented as contiguous byte slices. This means that there's no complex I/O layer scattered around – all the data in a stream is already a &[u8] by the time anything starts parsing it. The in-tree Source accomplishes this by reading a whole stream into a Vec<u8>, though I also have an out-of-tree implementation that uses mmap() to get a &[u8] without making a copy.

Carrying this idea through to the output side would suggest passing around a &mut [u8] and asking data structures to serialize themselves into that. But… we know exactly how long the stream is before we start to read it. Is that true of writing as well?

If we did know how long a stream would be, it's straightforward to imagine an implementation for fn create_stream(&mut self, len: usize) -> &mut [u8]. That implementation could even be trivially parallelized – memory map the output file, have create_stream() do all the MSF-related recordkeeping, and let the user mutate the stream contents until they're done.

If we don't know how long every stream will be, then we need a way to get a growable &mut [u8] – the bytes need to point to the MSF's storage, and the MSF needs to know the stream's length in order to write the proper records later. The straightforward way to implement this would be to support only a single simultaneous growable stream since we can always allocate more pages at the end of the file. Note that this still wouldn't require serializing writes, only serializing allocations.

Thinking aloud some more: the main reason pdb can't memory map everything on read is because MSF has a page size of 4K, while the Windows virtual memory subsystem has an allocation granularity of 64K. Windows can't map arbitrary MSF streams into &[u8]s. But… for writes, pdb would control stream allocation, and if pdb stuck to 64K boundaries, maybe this could work. You get a GrowableStream which you can use as a &mut [u8]. If you ask it to grow, it's supposed to grow into a longer &mut [u8]. Growing would normally require bumping an internal length counter, but if your request exceeds capacity, the stream gets more space from the MSF and constructs a new contiguous view. If there's only one writer, each stream will always be contiguous on-disk since we will always just append to the end of the file, but supporting multiple concurrent writers wouldn't be out of the way if we secretly aligned stream allocations on larger boundaries.

FYI, while looking at the Microsoft source again (no thanks to you :wink:) I noted this comment about the "storing the stream table page list in a list of pages":

// This organization enables efficient two-phase commit.  At commit time,
// after one or more streams have been written (to new pages), a new
// StrmTbl stream is written and the new FPM is written.  Then, a single
// write to hdr swaps the roles of the two FPM sets and atomically
// updates the MSF to reflect the new location of the StrmTbl stream.

I don't know if this is what Microsoft's tools do in practice, but it seems like it'd be fairly straightforward to support this for cases of updating an existing PDB file: you just write out all the new data to empty pages, write out a new stream table page list (also to empty pages), and then write out the header with the new stream table pages.

Bumping this issue with some new insights. At our company we found ourselves in need of adding new symbol information into the pdb. We only cared to add new LabelSymbols.

We forked this library and did the following:

You will need to recreate the BlockMapAddr (a page describing all of the page numbers used by the StreamDirectory)
You will need to recreate the StreamDirectory
You will need to create new pages at the end of the pdb (or you can use the free page map but i dont care for this level of optimization). These new pages can then be used in the Stream for SymRecordStream.
You will need to update the MSF header to indicate youve increased the number of pages.
You will need to update the section contribution substream. (If your adding a section to a PE file your gunna need to add an entry into the pdb as well to describe this section).

The PDB code is from the 90s its an elderscroll. Figured id put my 0.02 here and say what we did. :)

StreamDirectory - https://llvm.org/docs/PDB/DbiStream.html (control-f for it on this page) BlockMapAddr - https://llvm.org/docs/PDB/MsfFile.html (control-f for it on this page) SectionMap* - https://llvm.org/docs/PDB/DbiStream.html#section-contribution-substream

Writing anymore more complex to the PDB is going to require major changes to pdb-rs as its been said before its read-centric. The SourceView creates a linear sequence of bytes of a non-contiguous system. You would need a way to map this view back to those pages after changing them. Nightmare fuel. Our fork is not the best solution to the problem since we end up making very huge pdb files.

getsentry / pdb

Modifying PDBs #16