Currently, parsing is whole-program: the only way to parse a bitcode file is to parse every entity inside it. For most clients of llvm-pretty-bc-parser, however, this is wasteful, as these clients typically only need a fraction of the entities inside the file. Nevertheless, every client must currently pay the cost of both loading the entire file, which can be a time-consuming process, and keeping the loaded contents around in memory, which can use quite a bit of space. This is especially painful for large bitcode files, which are common in C++ programs and in large applications.
Fortunately, we can do better here by deferring some of the parsing work until a client actually needs it. The LLVM Bitcode File Format page has this to say about blocks in a bitstream:
Block definitions allow the reader to efficiently skip blocks in constant time if the reader wants a summary of blocks, or if it wants to efficiently skip data it does not understand. The LLVM IR reader uses this mechanism to skip function bodies, lazily reading them on demand.
I propose that we do just that: defer the parsing of function bodies until a client actually requests it. I haven't precisely worked out the finer details of implementing this, but as with most things LLVM-related, the best place to learn is from the upstream source code for the bitcode reader. In broad terms, we will want to preprocess the bitcode file to construct a mapping from function names to the blocks containing their bodies. Later, when a client requests a particular function, if that function's body has not been parsed, we will do so on demand.
Some currently unresolved design questions:
What should the API for this look like? The rough idea I had in my head is something like this:
Where BitCodeHandle is an abstract data type representing an opened, bitcode file in the middle of being incrementally parsed. We would need to figure out how many operations on top of BitCodeHandle to provide, e.g.:
lookupFunction :: Symbol -> BitCodeHandle -> IO (Either Error Define)
Does it suffice to only parse function bodies incrementally, or do we need to handle other things incrementally as well?
Should an API for incremental parsing replace the current API for eager parsing? Should they live alongside each other? If the latter, should the API for eager parsing be reimplemented in terms of the API for incremental parsing?
See also GaloisInc/llvm-pretty-bc-parser#176, which discusses other possible performance improvements.
Currently, parsing is whole-program: the only way to parse a bitcode file is to parse every entity inside it. For most clients of
llvm-pretty-bc-parser
, however, this is wasteful, as these clients typically only need a fraction of the entities inside the file. Nevertheless, every client must currently pay the cost of both loading the entire file, which can be a time-consuming process, and keeping the loaded contents around in memory, which can use quite a bit of space. This is especially painful for large bitcode files, which are common in C++ programs and in large applications.Fortunately, we can do better here by deferring some of the parsing work until a client actually needs it. The LLVM Bitcode File Format page has this to say about blocks in a bitstream:
I propose that we do just that: defer the parsing of function bodies until a client actually requests it. I haven't precisely worked out the finer details of implementing this, but as with most things LLVM-related, the best place to learn is from the upstream source code for the bitcode reader. In broad terms, we will want to preprocess the bitcode file to construct a mapping from function names to the blocks containing their bodies. Later, when a client requests a particular function, if that function's body has not been parsed, we will do so on demand.
Some currently unresolved design questions:
What should the API for this look like? The rough idea I had in my head is something like this:
Where
BitCodeHandle
is an abstract data type representing an opened, bitcode file in the middle of being incrementally parsed. We would need to figure out how many operations on top ofBitCodeHandle
to provide, e.g.:See also GaloisInc/llvm-pretty-bc-parser#176, which discusses other possible performance improvements.