Open bitwalker opened 5 days ago
This is a great write up - thank you! I really like the idea of representing nodes as dynamic objects. This does look like a fairly significant refactoring though, and I do wonder whether we should attempt it at this moment. Maybe the approach could be:
DecoratorSpan
s as described in #1122 (comment). This should be relatively easy to do.@bobbinth Agreed, I was imagining this would supercede #1490, while also addressing more generally the problem of MastForest
drop glue. It's possible that the DecoratorSpan
implementation will address the issue with dropping decorators specifically, at least to the point where it gives us time to approach the larger refactoring proposed here, so I think its a perfectly valid approach to start there and work our way up to something like I've described above as/if needed.
Issue
While investigating 0xPolygonMiden/compiler#317, I discovered via flamegraph analysis that up to 30% of program runtime was being spent just
drop
ing vectors used to store decorators.The conditions in which this occurs are:
AsmOp
decorator at a minimum. For programs compiled from Rust, this is basically every node in the program.Library
are loaded into memory, potentially cloned up to two times for reasons of ownership, and then dropped when no longer used.The issue was noticed in our test suite, prior to the change in this repo to allocating
Library
s inArc
, and cloning those rather than the actualLibrary
. Even though the compiler was caching instances ofLibrary
and only cloning it (similar to what is now done innext
), we observed the excessive time spent indrop_in_place
.Impact
However, even though the recent changes in
next
makes this a non-issue formiden-stdlib
(and I believemiden-base
, not sure if that library is also allocated with'static
lifetime), for the time being, it will affect any other library, even if allocated viaArc
. The use ofArc
does mean that clones are cheap and don't require dropping, however the actualLibrary
still gets dropped when all of its referencingArc
s are dropped. In one-shot compiler-like scenarios, this is a small difference, and the time spent dropping the data will still impact user-perceived execution time of the compiler/assembler and executor.The compiler is also planning to switch to a
rustup
-like toolchain management approach to the libraries and tools it uses, which means that there will be no benefit to themiden-stdlib
strategy of caching theLibrary
allocations in a static. All libraries, regardless of whether they are fundamental or user-defined, will be treated the same, so this affects all programs, not just a subset.Background
NOTE: The way decorators are managed was recently changed in
next
, but it may actually make the problem worse rather than better, as we now have twoVec
for decorators (to partition before and after decorators), rather than just one.As mentioned above, the problem is that every
MastNode
must be dropped individually when dropping aMastForest
. If eachMastNode
also has twoVec
, both vectors must be dropped as well, if they are non-empty, which is rarely the case when debug info is enabled.More generally though, any heap-allocated type stored in a
MastNode
would present a similar issue.Vec
is particularly bad because all of the elements must be dropped as well, but it's less about that, and more about the amount of drop glue that is executed when aMastForest
is dropped - it ends up being a lot.As programs grow in size, this problem becomes more egregious too - each new function adds anywhere from 10s to hundreds of nodes, so the growth in time spent dropping things is superlinear.
Proposed Solution
Various refactorings of decorators and MAST are still ongoing, some of which may address this issue. However, I have a specific proposal that I'd like to make that I think would also tie into those changes, so rather than wait to see what happens, I wanted to outline one option here, so it can be considered while looking at the totality of other changes being made.
Specifically, I'd like to propose moving to a model where
MastForest
(and all of its contents) are arena-allocated. ALibrary
would consist not only of the same items it does today, but also the arena used to allocate the contents of itsMastForest
. This is actually along the lines of what my original design for the MAST refactoring proposed, i.e. the "table-based" structure, which was based on the way the compiler structures dataflow/control-flow graphs. The actual entities in the compiler structures are arena-allocated. We didn't go that far with the implementation of the proposal, but that was primarily to keep the initial implementation simpler so that it could evolve a bit first.The benefits to a region-based memory management scheme for
MastForest
are as follows:MastForest
need to be dropped - instead the arena is dropped. This immediately solves the performance issue reported here.MastForest
is in close proximity in memory, and because nodes are allocated in an order roughly approximating the order in which they will be executed, this means that in many cases, the next node we care about will be in the processor cache. MAST execution isn't really performance bound in this aspect, but it still reduces the amount of cache thrashing that occurs in more disparate structures.MastNode
can be defined as a simple enum with a handle/pointer as the data member, addressing the issue ofMastNode
being very large. This would let us simplify the handling ofMastNode
in the core VM loop, by allowing us to abstract overMastNode
behaviors using traits. See the example below:That was a fairly involved example, but hopefully makes the benefits obvious. I haven't yet gotten to what the actual arena implementation would look like, but that is because it could take multiple forms. My basic recommendation would be:
MastArena
type to abstract over arena-specific detailsMastArena
would be the actual arena. There are all kinds of options for this out there, but IMO, useblink_alloc::Blink
, which can be used as anAllocator
impl, and combines some of the nicest elements of bump- and region-based arena schemes, with support for heterogenous types. This makes it easy to add new stuff in the arena, and also allows you to allocate everything for some item you want to store in the arena, together, rather than in separate type-specific arenas. Most importantly though, the model it uses internally, ensures that data never moves, so you can have pointers to data in the arena safely.DecoratorList
, I would use something likecranelift_entity::EntityList
if you need to mutate the list itself, i.e. resizing, which essentially abstracts over usingVec
as backing storage for vector-like sets of some entity typeT
. If that kind of access isn't needed, you can simply haveDecoratorList
be a wrapper around aNonNull<[Decorator]>
, similar to howMastNodeHandle
is used above.This turned into a combination bug report/enhancement proposal, but hopefully the bug report part of it provides some context for why such a proposal is worth considering at this juncture.
While I've gotten into the details in my proposal, I obviously haven't considered everything, so I expect we'll need to examine in what ways this change would impact existing APIs and workflows. I'm fairly confident we can keep the fact that arenas are used at all, mostly an internal implementation detail, but internally it would certainly impact a number of places in the code (mostly in the construction of
MastForest
s, and in their execution). Some edge cases such as mergingMastForest
s would require close attention, but would mostly continue to work as they do now.I suspect the most annoying effect of this change, would be the fact that
MastNodeHandle
cannot implementDeref
orAsRef
, because we must dynamically check, and guard, borrows of the underlying data, using.borrow()
or.borrow_mut()
to check the borrow, and obtain the guard, which does implement those traits, but obviously is a an extra step. In many cases of course, you can do that in just one place, and then use a plain reference for the lifetime of the borrow guard, so the guard type itself doesn't need to appear in any APIs, but it does impose some constraints.