Finish debugging improvements

bobbinth commented 2 months ago

We have all of the necessary primitives for source-level debug info, with only a couple of minor questions outstanding, off the top of my head:

Whether the MAST should contain the original sources when compiled with debug info enabled

Whether we should support a split debug info format, so that you can publish the MAST without debug info, but ship the debug info metadata separately so that it is still possible to debug the code during execution

Whether we should support stripping debug info from compiled MAST

We don't assign source spans to if or while instructions, but those are very primitive forms of the original source code anyway, so in practice all of the "interesting" bits of the conditional being applied will have source spans, and only the jump itself will not, however I guess it is an open question as to whether (and how) to support that if desired, primarily for Miden Assembly sources, and a smoother debugging experience.

Lastly, the way we are encoding debug info in the compiled MAST is "fat", we could vastly reduce the cost of including debug info with a better encoding. For example, the following process could be used:

Is there no source location for this source span? If so, store u8::MAX as a sentinel value, and proceed to the next instruction.

Is there a previous instruction with a source location? If so, see below, otherwise proceed to 2. a. Is the location of the previous instruction in the same source file? If so, see below, otherwise proceed to 2. b. Is the location of the previous instruction identical to the current instruction? If so, store a single byte, 0b10000000, whose most significant bit is set. This signals that this instruction has the same location as the previous instruction. Proceed to the next instruction. c. The location is in the same source file, so the first byte will be 0b11FXXXXX, where F is 1 if the offset delta can fit in the remaining bits, in which case the span length is decoded starting at the next byte; or 0 if the remaining bits should be ignored, and the offset delta and span length start at the next byte.

Write the source file index, byte offset, and length of span as three variable-length integers. Proceed to the next instruction.

That's the gist anyway, obviously the specific details depend on the precise variable-length encoding, and maybe we can come up with an even cleverer compact encoding, but I think the idea is to make it extremely compact, so that shipping debug info is viable.

Originally posted by @bitwalker in https://github.com/0xPolygonMiden/miden-vm/issues/990#issuecomment-2270127545

bobbinth commented 2 months ago

Whether the MAST should contain the original sources when compiled with debug info enabled

Whether we should support a split debug info format, so that you can publish the MAST without debug info, but ship the debug info metadata separately so that it is still possible to debug the code during execution

Whether we should support stripping debug info from compiled MAST

My current thinking on this is that we should include the original MASM into MastForest as an optional component. Could be stored simply as vector of SourceFiles. The struct itself could look something like this:

pub struct MastForest {
    /// All of the nodes local to the trees comprising the MAST forest.
    nodes: Vec<MastNode>,

    /// Roots of procedures defined within this MAST forest.
    roots: Vec<MastNodeId>,

    /// MASM source code of this MAST forest indexed by file name.
    source_code: BTreeMap<Arc<str>, SourceFile>
}

Then, on deserialization we would control whether we want to deserialize MastForest with or without debug info. If we deserialize without debug info, this would omit the source_code map and would also strip all AsmOp and other debug-related decorators in the future.

We could also add something like MastForest::strip_debug_info() to strip debug info (if any) from an already instantiated MastForest.

bitwalker commented 2 months ago

Yeah that's more or less what I had in mind - though I was imagining that we'd maybe store it as part of the theoretical Package type, and simply supply it alongside the MastForest when constructing a Process with debugging enabled. Having it in the MastForest might simplify some things though.

One thing to note: the sources are unlikely to be MASM in general - the vast majority of the time I expect it will be Rust sources, or other high-level language sources. It doesn't actually matter what the sources are though, but we should be sure not to assume anything about them, other than the SourceSpan identifies the specific byte offsets in the relevant SourceFile which contains the code from which the given MAST instruction was derived. Depending on how far in terms of abstraction level that code is from the underlying MAST, there could be a large number of instructions corresponding to a single line of source code.

The main reason I bring that up, is because how we encode the actual locations is important in terms of the size of files we generate. The actual text files are small in comparison to the raw location data if you store it fully expanded (i.e. where each instruction has its own Location). A key thing we'll need to do is determine how to encode that data efficiently, so that streams of 10s of instructions that all share the same location, don't add up to bloat the size of the debug info by a factor of 10+.

Having something like strip_debug_info would be good to have for publishing on-chain I would imagine - supplying debug info there is unlikely to be viable, but we don't want to have to recompile code just to strip the debug info, ideally it can be done as a post-processing step like you describe.

0xPolygonMiden / miden-vm

Finish debugging improvements #1440