Open bobbinth opened 2 months ago
- Whether the MAST should contain the original sources when compiled with debug info enabled
- Whether we should support a split debug info format, so that you can publish the MAST without debug info, but ship the debug info metadata separately so that it is still possible to debug the code during execution
- Whether we should support stripping debug info from compiled MAST
My current thinking on this is that we should include the original MASM into MastForest
as an optional component. Could be stored simply as vector of SourceFile
s. The struct itself could look something like this:
pub struct MastForest {
/// All of the nodes local to the trees comprising the MAST forest.
nodes: Vec<MastNode>,
/// Roots of procedures defined within this MAST forest.
roots: Vec<MastNodeId>,
/// MASM source code of this MAST forest indexed by file name.
source_code: BTreeMap<Arc<str>, SourceFile>
}
Then, on deserialization we would control whether we want to deserialize MastForest
with or without debug info. If we deserialize without debug info, this would omit the source_code
map and would also strip all AsmOp
and other debug-related decorators in the future.
We could also add something like MastForest::strip_debug_info()
to strip debug info (if any) from an already instantiated MastForest
.
Yeah that's more or less what I had in mind - though I was imagining that we'd maybe store it as part of the theoretical Package
type, and simply supply it alongside the MastForest
when constructing a Process
with debugging enabled. Having it in the MastForest
might simplify some things though.
One thing to note: the sources are unlikely to be MASM in general - the vast majority of the time I expect it will be Rust sources, or other high-level language sources. It doesn't actually matter what the sources are though, but we should be sure not to assume anything about them, other than the SourceSpan
identifies the specific byte offsets in the relevant SourceFile
which contains the code from which the given MAST instruction was derived. Depending on how far in terms of abstraction level that code is from the underlying MAST, there could be a large number of instructions corresponding to a single line of source code.
The main reason I bring that up, is because how we encode the actual locations is important in terms of the size of files we generate. The actual text files are small in comparison to the raw location data if you store it fully expanded (i.e. where each instruction has its own Location
). A key thing we'll need to do is determine how to encode that data efficiently, so that streams of 10s of instructions that all share the same location, don't add up to bloat the size of the debug info by a factor of 10+.
Having something like strip_debug_info
would be good to have for publishing on-chain I would imagine - supplying debug info there is unlikely to be viable, but we don't want to have to recompile code just to strip the debug info, ideally it can be done as a post-processing step like you describe.
Originally posted by @bitwalker in https://github.com/0xPolygonMiden/miden-vm/issues/990#issuecomment-2270127545