Closed MarcusTomlinson closed 5 months ago
Thanks Marcus - we should profile it and see if there's an obvious hotspot.
What's going on here is that it's duplicating the processor for each of the passthrough instances in the graph, and then compiling this code that many times, and then attempting to inline it all. I'll look through this and see why the duplication is happening - I think we should be able to avoid this.
Interestingly, the resulting optimisation pass is succeeding in turning it into a sensible bit of code. Here's the output for 1000 nodes (100 parallel branches of 10 nodes):
.build_version macos, 14, 0
.globl _initialise ; -- Begin function initialise
.p2align 2
_initialise: ; @initialise
; %bb.0:
ret
; -- End function
.globl _advanceBlock ; -- Begin function advanceBlock
.p2align 2
_advanceBlock: ; @advanceBlock
; %bb.0:
ldr w8, [x0]
cmp w8, w2
b.eq LBB1_3
; %bb.1: ; %.lr.ph.preheader
mov w9, #100
LBB1_2: ; %.lr.ph
; =>This Inner Loop Header: Depth=1
add x8, x1, w8, sxtw #2
ldr w10, [x8]
mul w10, w10, w9
str w10, [x8, #4096]
ldr w8, [x0]
add w8, w8, #1
str w8, [x0]
cmp w8, w2
b.ne LBB1_2
LBB1_3: ; %._crit_edge
str wzr, [x0]
ret
; -- End function```
It's managed to work out that this is basically a *100...
Yeah the resulting optimised graph is mad impressive. It's very fast - after waiting an hour for it compile that is ;)
It turns out it's spending significant time in an O(N^2) algorithm to make names unique, and there's basically lots of similar named functions, one per instance of this processor. So, try dropping this into cmaj_AST_Utilities.h to replace UniqueNameList:
template <typename ObjectType, typename ParentType>
struct UniqueNameList
{
UniqueNameList() = default;
UniqueNameList (const UniqueNameList&) = delete;
std::string getName (const ObjectType& o)
{
auto& name = names[std::addressof (o)];
if (name.empty())
{
auto root = static_cast<ParentType&> (*this).getRootName (o);
if (root.empty())
root = "_";
auto exists = [this] (const std::string& nameToCheck) -> bool
{
for (auto& n : names)
if (n.second == nameToCheck)
return true;
return false;
};
auto uniqueName = root;
auto& suffix = suffixes[root];
if (suffix != 0)
uniqueName = root + "_" + std::to_string (suffix++);
while (exists (uniqueName))
uniqueName = root + "_" + std::to_string (suffix++);
name = uniqueName;
}
return name;
}
void clear()
{
names.clear();
}
std::unordered_map<const ObjectType*, std::string> names;
std::unordered_map<std::string, uint32_t> suffixes;
};
This takes the runtime for 4000 nodes (500/8) from 275 secs to 9 secs on my machine, and 10,000 nodes (500/20) with this algorithm takes 65 secs.
I'll see about getting a fix something like this into the codebase.
I've pushed another performance fix, the 500/8 case is now down to 1.8 secs, and the 10,000 nodes to 9 seconds on my machine. I think that's more reasonable given the size of the graph. The next issue to resolve is to move the duplication of the code for this situation.
In testing graphs of various sizes, I observed more-or-less the following timings (2021 MacBook M1 Pro):
(x-axis = number of nodes, y-axis = number of seconds to compile*)
* or more specifically, the time it takes for
engine.link()
to return