New API for on-demand parsing

robdockins commented 5 years ago

Currently, and entire bitcode module is parsed into AST nodes when it is loaded. This is a problem for the kinds of bitcode files we want to be able to handle, where we link together a lot of code (e.g. the c++ standard library) but only exercise a small amount of it.

We should refactor this library so that it instead only parses portions of the bitcode file as they are needed, as this will reduce the memory usage and improve the performance of this library a great deal for our primary use cases.

Ptival commented 2 years ago

Here are some initial thoughts on how to proceed, hopefully leading to some discussion.

Problem 1: How to represent the delayed parsing?

So the idea is that upon encountering an ENTER_SUBBLOCK, we want to register its block length, supposedly the offset in the bitstream at which the block starts, and only do more work on demand.

I was worried that we'd need to not only know byte information, but subword information, but thankfully, blocks appear to be 32-bit-word-aligned and have 32-bit-word-multiple length!

So, it seems we'd want some description of a computation to be performed, closed over the information it needs to do so. I don't think Get has means to reify its current state, but I think we can create a delayed Get that knows about where to start parsing.

I'm thinking we might want to do something like an IORef (Either (Get a) a) to allow substituting the result for the delayed computation upon forcing. I'm open to pointers if such data types already exist, or to better alternatives.

Problem 2: Altering pure code that expects those delayed results

There is a lot of pure code that expects to just read some fields from parsed entries. It seems to force us into one of two alterations:

A. We make all such code be effectful, so that it can force the (sub-)entries as it tries to access them,

B. We use different types for delayed entries and forced entries, and make these pure functions working over forced entries.

The likely problem with option B is that entries are recursive, so we'd end up forcing all the sub-entries even if we only access some of them.

kquick commented 2 years ago

I'd really like to avoid using IORef, or even needing IO in the parsing.

One hypothetical approach (possibly naive):

It seems like the initial parsing will need to read the spine of the AST at a minimum; if the AST representation object was extended such that each node was either the actual parsed/resolved object or an (offset, length, Some a) then there could be helper functions to retrieve a node, and the helper function could take the second form and invoke the Get a on that location in the ByteString as needed. I may be mistaken, but I don't think Get has much internal state beyond needing to know the type a you expect it to parse from the provided ByteString.

travitch commented 2 years ago

What @kquick is suggesting is pretty close to how our ELF parser (elf-edit) currently works. The initial parsing step only parses out the necessary metadata to understand the file and is mostly index structures. That returns what is effectively an index object that you can query for other information later, thus decoding on demand. You could imagine either

Leaving it up to callers to cache what they need
Providing a blessed caching interface to retain already decoded information

Leaving it up to callers is kind of nice, since it enables them to control what memory is retained (to the extent possible in Haskell...). There is a bit of a challenge with entities like types or global references that might need to be re-parsed often. I'm not quite sure what I think about that at the moment.

A simple interface might look like:

parseBitcode :: Bytestring -> Either Error (Some Module)
getFunction :: Module s -> FunctionName -> Either Error (Function s) - Does this contain all of its references eagerly? Lazily? Do we share type definitions by caching them in the Module? Lots of questions.

We could probably eagerly parse all of the types and globals, while making individual functions our unit of laziness. That would save us from the worst of the performance problems.

GaloisInc / llvm-pretty-bc-parser

New API for on-demand parsing #133

Problem 1: How to represent the delayed parsing?

Problem 2: Altering pure code that expects those delayed results