Open luser opened 6 years ago
Yep, you've got it.
Stream
represents a possibly-expensive copy of a PDB stream. Think of it as a Vec<u8>
(and note that it is a Vec<u8>
unless you wrote your own Source
). The Stream
owns the data for a PDB stream, either by owning a copy of the data or by owning whatever resources are needed to maintain access to the data. The Stream
therefore needs to live at least as long as any references to things inside the stream.
A ParseBuffer
is basically a &[u8]
with convenience functions. (Some of those convenience functions are sensitive to the offset within the stream, so it's actually a (&[u8], usize)
.) It owns nothing; it's a slice borrowed from a Stream
. You can get &[u8]
s from a ParseBuffer
-- take()
, parse_cstring()
, etc. These &[u8]
s can outlive the ParseBuffer
, but they can't outlive the Stream
, since they're borrowed from the Stream
exactly how a &[u8]
is borrowed from a Vec<u8>
.
So far this has worked out okay, with the pattern for the existing features being:
// create a PDB from a File
let mut pdb = PDB::open(…)?;
// type_information() reads the TPI stream in its entirety, and TypeInformation owns that copy
// note also that reading things can mutate them, hence why `pdb` is `mut`
let type_information = pdb.type_information()?;
// do things with type_information -- parsing its Stream and whatnot -- but without mutating type_information
I have also struggled with this at times. For example, I started by assuming that TypeFinder
's functions should be part of TypeInformation
, but attempting to do that was painful. Eventually I stopped fighting and just split them apart. I later found that keeping them separate actually has meaningful benefits; I've since written client code that loads a single TypeInformation
and does different things with it in multiple threads. One of the threads builds a partial TypeFinder
for one purpose, while another builds a full TypeFinder
for another purpose which later processing stages borrow and use in parallel. Even if I had managed to satisfy the borrow checker with my initial combined approach, my mut TypeInformation
would not have permitted this kind of use without loads of synchronization, while the &TypeInformation
-> mut TypeFinder
approach accidentally supported parallel processing right out of the box. ¯\_(ツ)_/¯
For me, I think the biggest mental hurdle is the Stream
isn't movable. I feel like I should be able to hold references to e.g. strings within the stream as long as the Stream
is alive -- I know for a fact these references live behind a pointer, so moving the Stream
has no effect on my &[u8]
s -- but AFAIK there's no way to describe that in a way that the borrow checker understands.
Just ran across this again and wanted to drop a note about what I did to solve a similar issue in my minidump
crate. I made the Minidump
type generic over the file data storage by declaring it as:
pub struct Minidump<'a, T>
where T: Deref<Target=[u8]> + 'a,
{
data: T,
// ...
}
That way it's possible to create a Minidump<&[u8]>
, Minidump<Vec<u8>>
or Minidump<Mmap>
and all the lifetimes work sensibly. I implemented a convenience Minidump::read_path
method on a separate impl<'a> Minidump<'a, Mmap>
block to handle the common case of reading from a memory mapped file. Internally everything just winds up calling self.data.deref()
to get a &'a [u8]
and things fall out from there.
@willglynn
(I don't want to create a new issue since mine is pretty similar to what you wrote above.)
I want to populate the type finder (or a custom structure: hashmap/whatever) in a constructor such that I can use it for lookup later in a method. Right now it doesn't seem to be possible.
Basically, I have something similar in my new
constructor. I also maintain a separate mapping for type name -> type index lookup.
I see no way to store the updated type finder information as part of the structure, to be used later from the method I mentioned.
You can store TypeInformation
, but it doesn't get updated when you update the type finder, so it's useless for this purpose.
Trying to store the type finder itself will lead you nowhere because it depends on local variables, which will get destroyed when the constructor exits, so Rust doesn't allow that, which is fine. However, there doesn't seem to be a way to break this dependency by copying data.
I've also tried storing TypeData
in a hashmap, similar to this. Again, doesn't seem to be possible due to a dependency on locals.
Doing something like Box
ing a value or calling clone
on it doesn't help either.
The only way to achieve this at the moment (that I can think of) is to copy each enum variant manually, but this is not good. Not interested in changing my APIs either.
Am I missing something?
In addition, zero-copy is great, however, lifetimes is a pretty complex topic and this complicates things (especially for beginners). Maybe there should be a way to copy everything (you can write giant warnings in docs for such methods, but at least provide them in case people need that).
You're not missing anything, this is indeed this complicated. Your issue is slightly different from the one in the issue description. The complexity comes from two facts:
Stream
. Since this is a rather expensive process, this is only done on demand for the very streams you read. rental
](), but they cannot not be used with the way we have to read data. This means that you as a user receive the burden to hold on to all data owners.You can go about this in multiple ways:
To give an example for TypeFinder
, you can have a look at how symbolic
deals with this here and here. The relevant bits of the code are below:
// ItemMap borrows an item iterator and finder, and on-demand advances the
// iterator to populate the finder. It requires the caller to hold on to `TypeInformation`.
struct ItemMap<'s, I: ItemIndex> {
iter: pdb::ItemIter<'s, I>,
finder: pdb::ItemFinder<'s, I>,
}
impl<'s, I> ItemMap<'s, I>
where
I: ItemIndex,
{
pub fn try_get(&mut self, index: I) -> Result<pdb::Item<'s, I>, PdbError> {
if index <= self.finder.max_index() {
return Ok(self.finder.find(index)?);
}
while let Some(item) = self.iter.next()? {
self.finder.update(&self.iter);
match item.index().partial_cmp(&index) {
Some(Ordering::Equal) => return Ok(item),
Some(Ordering::Greater) => break,
_ => continue,
}
}
Err(pdb::Error::TypeNotFound(index.into()).into())
}
}
// This holds on to all owned data and must be put somewhere on the stack
// while someone is using TypeMap.
struct MyPdbUtil<'d> {
type_info: pdb::TypeInformation<'d>,
// ...
}
impl<'d> MyPdbUtil<'d> {
fn from_pdb(pdb: &mut PDB<'d>) -> Result<Self, PdbError> {
Ok(Self {
type_info: pdb.type_information()?,
// ...
})
}
fn type_map(&self) -> TypeMap<'_> {
ItemMap {
iter: self.type_info.iter(),
finder: self.type_info.finder(),
}
}
}
Apart from that, the clean solution to achieve true zero-copy and solve the awkward lifetimes is by moving the paging logic. Right now, we read and assemble pages for the entire stream into a Vec. Instead, we could resolve the physical address of RVAs at access time, and would just have to ensure that we never read across page boundaries (so potentially return Cow
s).
@jan-auer
Thanks for the quick and detailed reply!
FWIW, I "solved" it for now by creating a copy of TypeData
that uses String
s in variants with a lifetime parameter (this way I don't have to use any lifetimes at all). I use a hashmap instead of the type finder for lookups, which I manage myself too. Not good, but at least it compiles now.
I'll take a closer look at your suggestions sometime later. I've already spent too much time on this and need to focus on something else. But I'll let you know if there are any issues!
Thanks again!
I've also run into this problem in pdb-addr2line. I've ended up propagating the awkwardness to the consumer of my API; rather than just creating a Context
they need to create a ContextPdbData
and then a Context
from that.
I keep running into this when hacking on this crate. I think the root of the problem is that
SourceView::as_slice
returns a slice whose lifetime matches the lifetime of the reference to theSourceView
, not the contained lifetime's
, and thenStream::parse_buffer
returns aParseBuffer
with a lifetime derived from a slice returned fromSourceView::as_slice
. This means that in practice you have to keep both theStream
andParseBuffer
alive to do anything where you want to hand out references to data borrowed from the stream. This seems to mean that you always need an intermediate type in this situation to own theParseBuffer
.I understand why you've written things this way--you want to potentially support zero-copy operation by having a
SourceView
that points into a memory-mapped file. However, sinceReadView
owns its data, it can't actually hand out a reference that outlives itself, so you can't just changeSourceView::as_slice
to make the slice lifetime match theSourceView
lifetime.I don't know if there's a straightforward way to fix this, but it's frustrated me many times so I figured I'd at least write it down to get it out of my head.